In a previous blog post, I wrote about why you should (almost) always default to using bind variables. There are some exceptions, which I will cover in another follow-up post, but by default, bind variables are the right choice, both from a performance and from a security perspective.
In this article, I will show an example where regrettably, bind variables are not enough, and you can still run into significant performance issues in production. That’s when you create dynamic IN lists.
What’s the problem with dynamic IN lists
Again, this blog post is applicable for most databases, but I’ll use Oracle as an example, only.
An IN
predicate that takes a list is a handy abbreviation for an equivalent OR
predicate. The following two queries are semantically equivalent:
-- IN predicate SELECT * FROM actor WHERE actor_id IN (1, 2, 3) -- OR predicate SELECT * FROM actor WHERE actor_id = 1 OR actor_id = 2 OR actor_id = 3
The execution plans are, respectively:
IN list
--------------------------------------------------------- | Id | Operation | Name | Rows | --------------------------------------------------------- | 0 | SELECT STATEMENT | | 3 | | 1 | INLIST ITERATOR | | | | 2 | TABLE ACCESS BY INDEX ROWID| ACTOR | 3 | |* 3 | INDEX UNIQUE SCAN | PK_ACTOR | 3 | --------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access("ACTOR_ID"=1 OR "ACTOR_ID"=2 OR "ACTOR_ID"=3)
OR predicate
--------------------------------------------------------- | Id | Operation | Name | Rows | --------------------------------------------------------- | 0 | SELECT STATEMENT | | 3 | | 1 | INLIST ITERATOR | | | | 2 | TABLE ACCESS BY INDEX ROWID| ACTOR | 3 | |* 3 | INDEX UNIQUE SCAN | PK_ACTOR | 3 | --------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access("ACTOR_ID"=1 OR "ACTOR_ID"=2 OR "ACTOR_ID"=3)
We get the exact same plan. And the plan is also optimal. When looking for 3 actors, we’ll run 3 primary key lookups and get a cardinality estimate of 3. Perfect, what could be wrong about this?
Remember execution plan caching
In that previous blog post, we’ve seen the detrimental effects we can get under load when execution plans are no longer cached in the execution plan cache (or in Oracle: cursor cache). But even when we’re using bind variables, we can run into these issues. Consider the following dynamic SQL queries:
SELECT count(*) FROM actor WHERE actor_id IN (?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?)
Now, let’s generate these SQL queries using dynamic SQL in PL/SQL. PL/SQL is not a very dynamic SQL friendly language as can be seen in the following snippet:
SET SERVEROUTPUT ON DECLARE v_count NUMBER; FUNCTION binds (n NUMBER) RETURN VARCHAR2 IS v_result VARCHAR2(1000); BEGIN FOR i IN 1..n LOOP IF v_result IS NULL THEN v_result := ':b' || i; ELSE v_result := v_result || ', :b' || i; END IF; END LOOP; RETURN v_result; END binds; FUNCTION vals (n NUMBER) RETURN VARCHAR2 IS v_result VARCHAR2(1000); BEGIN FOR i IN 1..n LOOP IF v_result IS NULL THEN v_result := i; ELSE v_result := v_result || ', ' || i; END IF; END LOOP; RETURN v_result; END vals; BEGIN FOR i IN 1..5 LOOP EXECUTE IMMEDIATE ' DECLARE v_count NUMBER; BEGIN EXECUTE IMMEDIATE '' SELECT count(*) FROM actor WHERE actor_id IN (' || binds(i) || ') '' INTO v_count USING ' || vals(i) || '; :v_count := v_count; END; ' USING OUT v_count ; dbms_output.put_line(v_count); END LOOP; END; /
Running this snippet yields, as expected:
1 2 3 4 5
There are a few noteworthy things to say about the above:
- The functions
BINDS()
andVALS()
generate strings like:b1, :b2, :b3
and1, 2, 3
. We need these to generate dynamically a list of bind variables for theIN
predicates, as well as a list of bind values for theUSING
clause of the nestedEXECUTE IMMEDIATE
statement - The
EXECUTE IMMEDIATE .. USING
clause does not take dynamically sized bind value lists, so we have to generate that as well, in a nestedEXECUTE IMMEDIATE
statement
An appropriate gif at this point when nesting EXECUTE IMMEDIATE
like we did is this one:
Or also:
So, now that we got this covered…
… let’s look at execution plans:
SELECT s.sql_id, p.* FROM v$sql s, TABLE ( dbms_xplan.display_cursor ( s.sql_id, s.child_number, 'ALLSTATS LAST' ) ) p WHERE lower(s.sql_text) LIKE '%actor_id in %' AND lower(s.sql_text) NOT LIKE '%v$sql%' AND lower(s.sql_text) NOT LIKE '%execute_immediate%';
And the result is (some additional formatting applied):
SQL_ID cwm09k8zqp2at, child number 0 ------------------------------------- SELECT count(*) FROM actor WHERE actor_id IN (:b1) Plan hash value: 1971314150 ------------------------------------------------ | Id | Operation | Name | E-Rows | ------------------------------------------------ | 0 | SELECT STATEMENT | | | | 1 | SORT AGGREGATE | | 1 | |* 2 | INDEX UNIQUE SCAN| PK_ACTOR | 1 | ------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ACTOR_ID"=:B1) SQL_ID 1wypurx1x0mrp, child number 0 ------------------------------------- SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2) Plan hash value: 577114894 ------------------------------------------------- | Id | Operation | Name | E-Rows | ------------------------------------------------- | 0 | SELECT STATEMENT | | | | 1 | SORT AGGREGATE | | 1 | | 2 | INLIST ITERATOR | | | |* 3 | INDEX UNIQUE SCAN| PK_ACTOR | 2 | ------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2)) SQL_ID 2y06nwgpxqttn, child number 0 ------------------------------------- SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3) Plan hash value: 577114894 ------------------------------------------------- | Id | Operation | Name | E-Rows | ------------------------------------------------- | 0 | SELECT STATEMENT | | | | 1 | SORT AGGREGATE | | 1 | | 2 | INLIST ITERATOR | | | |* 3 | INDEX UNIQUE SCAN| PK_ACTOR | 3 | ------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2 OR "ACTOR_ID"=:B3)) SQL_ID d4nn8qf9n22yt, child number 0 ------------------------------------- SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3, :b4) Plan hash value: 577114894 ------------------------------------------------- | Id | Operation | Name | E-Rows | ------------------------------------------------- | 0 | SELECT STATEMENT | | | | 1 | SORT AGGREGATE | | 1 | | 2 | INLIST ITERATOR | | | |* 3 | INDEX UNIQUE SCAN| PK_ACTOR | 4 | ------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2 OR "ACTOR_ID"=:B3 OR "ACTOR_ID"=:B4)) SQL_ID 4n9b4zgxr5cwj, child number 0 ------------------------------------- SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3, :b4, :b5) Plan hash value: 577114894 ------------------------------------------------- | Id | Operation | Name | E-Rows | ------------------------------------------------- | 0 | SELECT STATEMENT | | | | 1 | SORT AGGREGATE | | 1 | | 2 | INLIST ITERATOR | | | |* 3 | INDEX UNIQUE SCAN| PK_ACTOR | 5 | ------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access(("ACTOR_ID"=:B1 OR "ACTOR_ID"=:B2 OR "ACTOR_ID"=:B3 OR "ACTOR_ID"=:B4 OR "ACTOR_ID"=:B5))
So, again, we’re down to having 5 different SQL_IDs and statements in the execution plan cache / cursor cache, even if they all derive from a single logical statement. Now, in PL/SQL, this is so hard to do, you’re probably not doing it at all.
But if you’re using a query builder (or wrote your own), like JPA’s Criteria Query, or jOOQ, then you will be writing something like this (using jOOQ, because in fact, using the Criteria Query API, dynamic SQL is about as hard to do as in PL/SQL):
for (int i = 1; i <= 5; i++) { System.out.println( DSL.using(configuration) .select(count()) .from(ACTOR) .where(ACTOR.ACTOR_ID.in( IntStream.rangeClosed(1, i) .boxed() .collect(toList()) ) .fetchOne(count()) ); }
So this query will generate the same queries as the PL/SQL code, and this really wasn’t that hard. Perhaps, in this particular case, it’s not that much of a good thing that this isn’t so hard to do!
Let’s benchmark this, again
There are different ways to benchmark this problem. One mistake we could make here is to just benchmark small IN lists with large ones, but that’s not fair. If you have to send a large IN list to the server, you cannot easily change that.
Another option is to benchmark alternative approaches of generating those IN
lists
- The default way is to have a list of exactly size N
- An alternative way would be to pad the list to a size M where
M = 2 ^ X
such thatM(X) >= N
andM(X - 1) < N
The latter approach would, instead of generating the original queries:
SELECT count(*) FROM actor WHERE actor_id IN (?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?)
… generate these ones:
SELECT count(*) FROM actor WHERE actor_id IN (?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?) SELECT count(*) FROM actor WHERE actor_id IN (?, ?, ?, ?, ?, ?, ?, ?)
Where the last bind variable is always repeated (padded) until the list is filled up to a length that is a power of 2 (and in Oracle, of course, we’d split lists longer than 1000 items into several lists).
This padding is what jOOQ already offers:
https://www.jooq.org/doc/latest/manual/sql-building/dsl-context/custom-settings/settings-in-list-padding/
And maybe, a future Hibernate will offer it too, after I’ve suggested the feature to Vlad Mihalcea from the Hibernate team:
https://hibernate.atlassian.net/browse/HHH-12469
Here’s the benchmark logic. It’s a bit more complex than previously:
SET SERVEROUTPUT ON ALTER SYSTEM FLUSH SHARED_POOL; ALTER SYSTEM FLUSH BUFFER_CACHE; CREATE TABLE results ( run NUMBER(2), stmt NUMBER(2), elapsed NUMBER ); DECLARE v_ts TIMESTAMP WITH TIME ZONE; v_repeat CONSTANT NUMBER := 20000; v_first_name actor.first_name%TYPE; v_last_name actor.last_name%TYPE; FUNCTION binds (n NUMBER, pad BOOLEAN) RETURN VARCHAR2 IS v_result VARCHAR2(32767); v_n_pad NUMBER(10) := power(2, ceil(ln(n)/ln(2))); BEGIN FOR i IN 1..n LOOP IF v_result IS NULL THEN v_result := ':b' || i; ELSIF mod(i, 1000) = 0 THEN v_result := v_result || ') OR actor_id IN (:b' || i; ELSE v_result := v_result || ', :b' || i; END IF; END LOOP; IF pad THEN FOR i IN n + 1 .. v_n_pad LOOP IF mod(i, 1000) = 0 THEN v_result := v_result || ') OR actor_id IN (:b' || i; ELSE v_result := v_result || ', :b' || i; END IF; END LOOP; END IF; RETURN v_result; END binds; FUNCTION vals (n NUMBER, pad BOOLEAN) RETURN VARCHAR2 IS v_result VARCHAR2(32767); v_n_pad NUMBER(10) := power(2, ceil(ln(n)/ln(2))); BEGIN FOR i IN 1..n LOOP IF v_result IS NULL THEN v_result := i; ELSE v_result := v_result || ', ' || i; END IF; END LOOP; IF pad THEN FOR i IN n + 1 .. v_n_pad LOOP v_result := v_result || ', ' || n; END LOOP; END IF; RETURN v_result; END vals; PROCEDURE run(i NUMBER, pad BOOLEAN) IS BEGIN EXECUTE IMMEDIATE ' DECLARE v_count NUMBER; BEGIN EXECUTE IMMEDIATE '' SELECT count(*) FROM actor WHERE actor_id IN (' || binds(i, pad) || ') '' INTO v_count USING ' || vals(i, pad) || '; END; '; END run; BEGIN -- Repeat the whole benchmark several times to avoid warmup penalty FOR r IN 1..5 LOOP v_ts := SYSTIMESTAMP; FOR i IN 1..v_repeat LOOP run(mod(i, 100) + 1, FALSE); END LOOP; INSERT INTO results VALUES (r, 1, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE); v_ts := SYSTIMESTAMP; FOR i IN 1..v_repeat LOOP run(mod(i, 100) + 1, TRUE); END LOOP; INSERT INTO results VALUES (r, 2, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE); END LOOP; FOR rec IN ( SELECT run, stmt, CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio FROM results ) LOOP dbms_output.put_line('Run ' || rec.run || ', Statement ' || rec.stmt || ' : ' || rec.ratio); END LOOP; END; / DROP TABLE results;
Bear in mind the following:
- We run the benchmark 5 times
- In each run, we run 20000 queries with each technique
- Those 20000 queries have between 1 and 100 elements in the IN list
The results show that both approaches fare equally well:
Run 1, Statement 1 : 1.16944 Run 1, Statement 2 : 1.11204 Run 2, Statement 1 : 1.06086 Run 2, Statement 2 : 1.03591 Run 3, Statement 1 : 1.03589 Run 3, Statement 2 : 1.03605 Run 4, Statement 1 : 1.33935 Run 4, Statement 2 : 1.2822 Run 5, Statement 1 : 1 Run 5, Statement 2 : 1.04648
This means that there is no significant overhead caused by the padding in this benchmark. That’s good news.
Running the following query shows the number of executions per statement:
SELECT executions, regexp_count(sql_text, ':'), sql_text FROM v$sql WHERE lower(sql_text) LIKE '%actor_id in %' AND lower(sql_text) NOT LIKE '%v$sql%' AND lower(sql_text) NOT LIKE '%execute_immediate%' ORDER BY sql_text;
The result being:
EXECS BINDS SQL_TEXT 2000 1 SELECT count(*) FROM actor WHERE actor_id IN (:b1) 2000 2 SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2) 1000 3 SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, :b3) 3000 4 ... 1000 5 ... 1000 6 ... 1000 7 ... 5000 8 ... 1000 9 ... 1000 10 ... 1000 11 ... 1000 12 ... 1000 13 ... 1000 14 ... 1000 15 ... 9000 16 ... ... 1000 100 ... 36000 128 ...
The version that creates exactly sized IN lists is always executed 1000 times. The version that has padded IN lists sized to the next power of 2 (M = 2 ^ X) is executed 1000 * X times (plus another 1000 times because there’s also an IN list of that exact size)
At the end, there are some “lone” 128-element sized IN lists. This is expected. The number of different SQL statements is:
- 100 for the exactly sized IN lists
- 7 for the padded IN lists (128 = 2 ^ 7)
So, thus far, we haven’t gained anything, but also not lost much.
What if the lists get longer
The real problem isn’t the individual query execution time, which is fine in both cases, but the effect that dynamic IN lists have on a system. Let’s generate IN lists of up to size 4096, and bearing in mind that Oracle won’t allow this, the lists will be of the form:
SELECT count(*) FROM actor WHERE actor_id IN (:b1, :b2, ..., :b1000) OR actor_id IN (:b1001, :b1002, ..., :b2000) OR actor_id IN (:b2001, ...)
Now, with a maximum size of 4096, there are quite a few distinct statements that need to be parsed. As opposed to only 12 when we use the padded in lists. The padded size grows logarithmically, so we’ll never really run into this same problem again, at least not for a single IN list.
How does the comparison fare in another benchmark? We must be careful not to overdo things. Running 2x 20000 queries with variable sized IN lists can easily run for more than an hour on Oracle. Perhaps, it’s Oracle’s way of punishing you for doing something you shouldn’t: Having super-long IN lists. I’ve played around with different sets of parameters to the benchmark and the clear message here is that having long bind variable lists creates a bottleneck per se (i.e. the binding of the variables takes time), so the difference in parse times isn’t as drastic as it was before, for individual query executions. The overall improvement is this:
Run 1, Statement 1 : 1.42696 Run 1, Statement 2 : 1
So, we get around a 1.5x increase of performance when repeating the bind values. Yet still, the fact that we have far less cached execution plans does matter, especially when we start exhausting the cursor cache, because with this approach, at least this exhaustion is not going to happen.
Alternative options
The above shows that if you really cannot get around an IN list, a simple, viable workaround is to pad the lists. But often, you can get around them. In databases like Oracle and PostgreSQL, that offer array types, you can consider using those once your lists start growing. I’ve blogged about this in a previous blog post:
https://blog.jooq.org/2017/03/30/sql-in-predicate-with-in-list-or-with-array-which-is-faster
If your lists become huge, a bulk/batch inserting the IDs into a temporary table might be a far better option, and then semi-join that one instead:
SELECT count(*) FROM actor WHERE actor_id IN ( SELECT id FROM temp_table )
And ideally, if your list is really not dynamic, but originates from some other table, then please, stop passing around IDs and semi-join the original query that produced the IDs in the first place:
SELECT count(*) FROM actor WHERE actor_id IN ( SELECT id FROM other_table WHERE ... )
That last option should always be your first choice as it will very likely outperform all the others.