Hah! Got nerd-sniped again:
http://stackoverflow.com/questions/43099226/how-to-make-jooq-to-use-arrays-in-the-in-clause/43102102
A jOOQ user was wondering why jOOQ would generate an IN
list for a predicate like this:
Java
COLUMN.in(1, 2, 3, 4)
SQL
COLUMN in (?, ?, ?, ?)
… when in fact there could have been the following predicate being generated, instead:
COLUMN = any(?::int[])
In the second case, there would have been only one single bind variable instead of 4, and the SQL generation and parsing work would have been “much” less (maybe not for the IN
list of size 4, but let’s imagine a list of 50 values).
A disclaimer
First off, a disclaimer: In databases that have a cursor cache / plan cache (e.g. Oracle or SQL Server), you should be careful with long IN
lists, because they will probably trigger a hard parse every time you run them, as by the time you run the exact same predicate (with 371 elements in the list) again, the execution plan will have been purged from the cache. So, you cannot really profit from the cache.
I’m aware of this problem, and it will be topic of another blog post, soon. Let’s stick to PostgreSQL whose “plan cache” isn’t really that sophisticated.
Measure, don’t guess
The question was about improving the speed of parsing a SQL statement. Parsers are really fast, so parsing shouldn’t be a problem. Generating an execution plan certainly does cost more time, but again, since PostgreSQL’s plan cache isn’t very sophisticated, this won’t play into the issue here. So the question is really:
Is an IN
list really that bad in PostgreSQL?
Would an array bind variable be much better?
Since our recent post about benchmarking, we now know that we shall never guess, but always measure. I’m using again the Sakila database to run these two queries:
-- IN list SELECT * FROM film JOIN film_actor USING (film_id) JOIN actor USING (actor_id) WHERE film_id IN (?, ?, ?, ?) -- Array SELECT * FROM film JOIN film_actor USING (film_id) JOIN actor USING (actor_id) WHERE film_id = ANY(?)
Let’s try lists of length 4, first. The benchmark is here:
DO $$ DECLARE v_ts TIMESTAMP; v_repeat CONSTANT INT := 1000; rec RECORD; v_e1 INT := 1; v_e2 INT := 2; v_e3 INT := 4; v_e4 INT := 8; v_any_arr INT[] := ARRAY[v_e1, v_e2, v_e3, v_e4]; BEGIN FOR r IN 1..5 LOOP v_ts := clock_timestamp(); FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT * FROM film JOIN film_actor USING (film_id) JOIN actor USING (actor_id) WHERE film_id IN (v_e1, v_e2, v_e3, v_e4) ) LOOP NULL; END LOOP; END LOOP; RAISE INFO 'Run %, Statement 1: %', r, (clock_timestamp() - v_ts); v_ts := clock_timestamp(); FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT * FROM film JOIN film_actor USING (film_id) JOIN actor USING (actor_id) WHERE film_id = ANY(v_any_arr) ) LOOP NULL; END LOOP; END LOOP; RAISE INFO 'Run %, Statement 2: %', r, (clock_timestamp() - v_ts); END LOOP; END$$;
The result being:
INFO: Run 1, Statement 1: 00:00:00.112195 INFO: Run 1, Statement 2: 00:00:00.450461 INFO: Run 2, Statement 1: 00:00:00.109792 INFO: Run 2, Statement 2: 00:00:00.446518 INFO: Run 3, Statement 1: 00:00:00.105413 INFO: Run 3, Statement 2: 00:00:00.44298 INFO: Run 4, Statement 1: 00:00:00.108249 INFO: Run 4, Statement 2: 00:00:00.476527 INFO: Run 5, Statement 1: 00:00:00.120229 INFO: Run 5, Statement 2: 00:00:00.448214
Interesting. So, the IN
list outperforms the array bind variable every time by a factor of 4 (which is the size of the array / list!) So, let’s try 8 values, then. Here are the values and the adapted query 1:
-- values v_e1 INT := 1; v_e2 INT := 2; v_e3 INT := 4; v_e4 INT := 8; v_e5 INT := 16; v_e6 INT := 32; v_e7 INT := 64; v_e8 INT := 128; v_any_arr INT[] := ARRAY[v_e1, v_e2, v_e3, v_e4, v_e5, v_e6, v_e7, v_e8]; -- adapted query 1 ... WHERE film_id IN (v_e1, v_e2, v_e3, v_e4, v_e5, v_e6, v_e7, v_e8) -- ...
The result is still impressive:
INFO: Run 1, Statement 1: 00:00:00.182646 INFO: Run 1, Statement 2: 00:00:00.63624 INFO: Run 2, Statement 1: 00:00:00.184814 INFO: Run 2, Statement 2: 00:00:00.685976 INFO: Run 3, Statement 1: 00:00:00.188108 INFO: Run 3, Statement 2: 00:00:00.634903 INFO: Run 4, Statement 1: 00:00:00.184933 INFO: Run 4, Statement 2: 00:00:00.626616 INFO: Run 5, Statement 1: 00:00:00.185879 INFO: Run 5, Statement 2: 00:00:00.636723
The IN
list query now takes almost 2x as long (but not quite 2x), whereas the array query now takes around 1.5x as long. It looks as though arrays become the better choice when their size increases. So, let’s do this! With 32 bind variables in the IN
list, or 32 array elements respectively:
INFO: Run 1, Statement 1: 00:00:00.905064 INFO: Run 1, Statement 2: 00:00:00.752819 INFO: Run 2, Statement 1: 00:00:00.760475 INFO: Run 2, Statement 2: 00:00:00.758247 INFO: Run 3, Statement 1: 00:00:00.777667 INFO: Run 3, Statement 2: 00:00:00.895875 INFO: Run 4, Statement 1: 00:00:01.308167 INFO: Run 4, Statement 2: 00:00:00.789537 INFO: Run 5, Statement 1: 00:00:00.788606 INFO: Run 5, Statement 2: 00:00:00.776159
Both are about equally fast. 64 bind values!
INFO: Run 1, Statement 1: 00:00:00.915069 INFO: Run 1, Statement 2: 00:00:01.058966 INFO: Run 2, Statement 1: 00:00:00.951488 INFO: Run 2, Statement 2: 00:00:00.906285 INFO: Run 3, Statement 1: 00:00:00.907489 INFO: Run 3, Statement 2: 00:00:00.892393 INFO: Run 4, Statement 1: 00:00:00.900424 INFO: Run 4, Statement 2: 00:00:00.903447 INFO: Run 5, Statement 1: 00:00:00.961805 INFO: Run 5, Statement 2: 00:00:00.951697
Still about the same. OK… INTERN! Get over here. I need you to “generate” 128 bind values on this query.
Yep, as expected. Finally, arrays start to outperform IN
lists:
INFO: Run 1, Statement 1: 00:00:01.122866 INFO: Run 1, Statement 2: 00:00:01.083816 INFO: Run 2, Statement 1: 00:00:01.416469 INFO: Run 2, Statement 2: 00:00:01.134882 INFO: Run 3, Statement 1: 00:00:01.122723 INFO: Run 3, Statement 2: 00:00:01.087755 INFO: Run 4, Statement 1: 00:00:01.143148 INFO: Run 4, Statement 2: 00:00:01.124902 INFO: Run 5, Statement 1: 00:00:01.236722 INFO: Run 5, Statement 2: 00:00:01.113741
Using Oracle
Oracle also has array types (although you have to declare them as nominal types first, but that’s not a problem here).
Here are some benchmark results (as always, not actual benchmark results, but anonymised units of measurement. I.e. these aren’t seconds but… Larrys):
4 bind values
Run 1, Statement 1 : 01.911000000 Run 1, Statement 2 : 02.852000000 Run 2, Statement 1 : 01.659000000 Run 2, Statement 2 : 02.680000000 Run 3, Statement 1 : 01.628000000 Run 3, Statement 2 : 02.664000000 Run 4, Statement 1 : 01.629000000 Run 4, Statement 2 : 02.657000000 Run 5, Statement 1 : 01.636000000 Run 5, Statement 2 : 02.688000000
128 bind values
Run 1, Statement 1 : 04.010000000 Run 1, Statement 2 : 06.275000000 Run 2, Statement 1 : 03.749000000 Run 2, Statement 2 : 05.440000000 Run 3, Statement 1 : 03.985000000 Run 3, Statement 2 : 05.387000000 Run 4, Statement 1 : 03.807000000 Run 4, Statement 2 : 05.688000000 Run 5, Statement 1 : 03.782000000 Run 5, Statement 2 : 05.803000000
The size of the number of bind values doesn’t seem to matter really. There’s always a constant overhead of using the array bind variable compared to the IN
list, but that might as well be a benchmarking error. For instance, when I add the /*+GATHER_PLAN_STATISTICS*/
hint to both queries, interestingly, the one with the array got significantly faster, whereas the IN
list one was not affected… Weird?
Conclusion
This article doesn’t go into why there’s such a big difference for small lists when the benefit is only apparent for quite large lists.
But it has once again shown, that we must not optimise prematurely in SQL, but measure, measure, measure things. IN
lists in dynamic SQL queries can be a big issue in production when they lead to cursor cache / plan cache saturation and a lot of “hard parsing”. So, the benefit of using the array is much more drastic when the content is big, as we can recycle execution plans much more often than with IN
lists.
But chances are, that IN
lists may be faster for single executions.
In any case: Choose carefully when following advice that you find somewhere on the Internet. Also, when following this advice. I ran the benchmark on PostgreSQL 9.5 and Oracle 11gR2 XE. Both are not the latest database versions. Try to measure things again on your side, to be sure that your “improvement” is really an actual improvement! And if in doubt, don’t optimise, until you’re sure you actually have a problem.
Filed under: sql Tagged: arrays, Bind variables, Cursor cache, Execution Plan Cache, IN List, in predicate, Optimisation, sql, SQL Performance
