In SQL, quite often, we want to compare several values with each other. For instance, when we’re looking for a specific user by their first and last names, we’ll write a query like this one:
SELECT * FROM customer WHERE first_name = 'SUSAN' AND last_name = 'WILSON';
We’re getting:
CUSTOMER_ID FIRST_NAME LAST_NAME ------------------------------------ 8 SUSAN WILSON
Surely, everyone agrees that this is correct and perfectly fine as we probably have an index on these two columns (or on at least one of them) to speed up such queries:
CREATE INDEX idx_customer_name ON customer (last_name, first_name);
The execution plan is thus optimal, e.g. with Oracle:
------------------------------------------------------------------------- | Id | Operation | Name | Rows | ------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER | 1 | |* 2 | INDEX RANGE SCAN | IDX_CUSTOMER_NAME | 1 | -------------------------------------------------------------------------
But sometimes, we cannot use AND
to connect two predicates. In particular, that’s not possible with an IN
predicate, so people sometimes resort to using string concatenation, because that seems to work and make sense.
For instance, let’s find all customers whose first and last names matches those of an actor (as always, using the Sakila database)
SELECT * FROM customer WHERE first_name || last_name IN ( SELECT first_name || last_name FROM actor )
And yes indeed, what we’re getting here is the correct answer:
CUSTOMER_ID FIRST_NAME LAST_NAME ------------------------------------ 6 JENNIFER DAVIS
But that answer is only accidentally correct!
Because we weren’t looking for customers called
first_name = 'JENNIFER' AND last_name = 'DAVIS'
We were looking for customers called
first_name || last_name = 'JENNIFERDAVIS'
Want proof? Let’s add a new customer:
INSERT INTO customer (customer_id, first_name, last_name ) VALUES (600 , 'JENNI' , 'FERDAVIS');
Yeah right? No one is called FERDAVIS. Or are they? As good programmers, we closely observe Murphy’s Law (i.e. always look both left and right when crossing a street).
In any case, let’s run our query again:
SELECT * FROM customer WHERE first_name || last_name IN ( SELECT first_name || last_name FROM actor )
And observe the result!
CUSTOMER_ID FIRST_NAME LAST_NAME ------------------------------------ 6 JENNIFER DAVIS 600 JENNI FERDAVIS
Of course, because our predicate was really looking for customers called
first_name || last_name = 'JENNIFERDAVIS'
Which matches in both cases:
-- What we expected first_name || last_name = 'JENNIFER' || 'DAVIS' -- What we got first_name || last_name = 'JENNI' || 'FERDAVIS'
Notice that I only added this customer to the customer table, not to the actor table. There’s no actor by the name FERDAVIS
, so the result is clearly wrong.
AHA! Let’s use an “impossible” separator
So, we might proceed to fixing this as such:
SELECT * FROM customer WHERE first_name || '###' || last_name IN ( SELECT first_name || '###' || last_name FROM actor )
And now, the result is again correct. We get only JENNIFER DAVIS
because we were looking for:
first_name || '###' || last_name = 'JENNIFER###DAVIS'
This works quite well for a while, as the separator is quite “impossible” (i.e. improbable) to be encountered in actual data. But we shouldn’t trust our judgement, because… Murphy’s Law. So you might think: better use a more rare separator, e.g. (if your database supports proper character sets)
SELECT * FROM customer WHERE first_name || '🙈🙉🙊' || last_name IN ( SELECT first_name || '🙈🙉🙊' || last_name FROM actor )
The use of emojis should indicate what my opinion of this approach is.
Too bad for performance, though
Remember that index we’ve created? Fact is, we also have such an index on the ACTOR table:
CREATE INDEX idx_actor_name ON actor (last_name, first_name);
And now, let’s assume our query is a bit different. We’ll be looking only for customers whose address_id is 10:
SELECT * FROM customer WHERE address_id = 10 AND first_name || '🙈🙉🙊' || last_name IN ( SELECT first_name || '🙈🙉🙊' || last_name FROM actor )
Now, our querymoji is using the index indeed, but for an INDEX FULL SCAN, so it’s only slightly faster than scanning the entire actor table:
----------------------------------------------------------------------------------- | Id | Operation | Name | Rows | ----------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | |* 1 | HASH JOIN SEMI | | 1 | | 2 | TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER | 1 | |* 3 | INDEX RANGE SCAN | IDX_CUSTOMER_FK_ADDRESS_ID | 1 | | 4 | INDEX FULL SCAN | IDX_ACTOR_NAME | 2 | -----------------------------------------------------------------------------------
And what’s worse, even if all the cardinality estimates correctly indicate only 1-2 rows, we’ll perform a HASH JOIN and load the full index for it! We should be running a NESTED LOOP instead.
Is there a better way? Yes! Use row constructors to compare several values at once:
SELECT * FROM customer WHERE address_id = 10 AND (first_name, last_name) IN ( SELECT first_name, last_name FROM actor );
Or, if your database doesn’t support this syntax (luckily, Oracle and PostgreSQL do, for instance), then you can resort to an equivalent EXISTS
predicate
SELECT * FROM customer c WHERE address_id = 10 AND EXISTS ( SELECT 1 FROM actor a WHERE c.first_name = a.first_name AND c.last_name = a.last_name );
Both of these queries are exactly equivalent and result in a nested loop semi join, rather than the previous hash join, which is perfectly reasonable for these small tables. We can now use the IDX_ACTOR_NAME
for a quick INDEX RANGE SCAN operation:
----------------------------------------------------------------------------------- | Id | Operation | Name | Rows | ----------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 1 | NESTED LOOPS SEMI | | 1 | | 2 | TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER | 1 | |* 3 | INDEX RANGE SCAN | IDX_CUSTOMER_FK_ADDRESS_ID | 1 | |* 4 | INDEX RANGE SCAN | IDX_ACTOR_NAME | 1 | -----------------------------------------------------------------------------------
But let’s not trust the estimated plans. Let’s benchmark (more info about benchmarking SQL here)
SET SERVEROUTPUT ON DECLARE v_ts TIMESTAMP WITH TIME ZONE; v_repeat CONSTANT NUMBER := 2500; BEGIN -- Repeat benchmark several times to avoid warmup penalty FOR r IN 1..5 LOOP v_ts := SYSTIMESTAMP; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT first_name, last_name FROM customer WHERE address_id = 10 AND first_name || '###' || last_name IN ( SELECT first_name || '###' || last_name FROM actor ) ) LOOP NULL; END LOOP; END LOOP; dbms_output.put_line('Run ' || r ||', Statement 1 : ' || (SYSTIMESTAMP - v_ts)); v_ts := SYSTIMESTAMP; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT first_name, last_name FROM customer WHERE address_id = 10 AND (first_name, last_name) IN ( SELECT first_name, last_name FROM actor ) ) LOOP NULL; END LOOP; END LOOP; dbms_output.put_line('Run ' || r ||', Statement 2 : ' || (SYSTIMESTAMP - v_ts)); END LOOP; END; /
As can be seen here, the benchmark shows that the query using the row constructor is drastically faster as it can properly use the index as it should:
Run 1, Statement 1 : +000000000 00:00:00.374471000 Run 1, Statement 2 : +000000000 00:00:00.062830000 Run 2, Statement 1 : +000000000 00:00:00.364168000 Run 2, Statement 2 : +000000000 00:00:00.066252000 Run 3, Statement 1 : +000000000 00:00:00.359559000 Run 3, Statement 2 : +000000000 00:00:00.063898000 Run 4, Statement 1 : +000000000 00:00:00.344775000 Run 4, Statement 2 : +000000000 00:00:00.086060000 Run 5, Statement 1 : +000000000 00:00:00.394163000 Run 5, Statement 2 : +000000000 00:00:00.063176000
Now, imagine we were running this against some much more impressive data sets than the Sakila database
Conclusion
If you’re ever thinking about concatenating two fields for a comparison, try again. There are two major caveats that should indicate you’re about to do something silly:
- There’s a major risk of your query being subtly wrong (accidental matches between JENNIFER DAVIS and JENNI FERDAVIS)
- There’s a major risk of your query being quite slow
So, as a rule of thumb, don’t use concatenation in predicates. There’s (almost) always a better way.
Read also: Why You Should (Sometimes) Avoid Expressions in SQL Predicates
Filed under: sql Tagged: Concatenation, Indexes, performance, predicates, sql
