Don’t Use the String Concatenation “Trick” in SQL Predicates

In SQL, quite often, we want to compare several values with each other. For instance, when we’re looking for a specific user by their first and last names, we’ll write a query like this one:

SELECT *
FROM customer
WHERE first_name = 'SUSAN'
AND last_name = 'WILSON';

We’re getting:

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          8   SUSAN        WILSON

Surely, everyone agrees that this is correct and perfectly fine as we probably have an index on these two columns (or on at least one of them) to speed up such queries:

CREATE INDEX idx_customer_name ON customer (last_name, first_name);

The execution plan is thus optimal, e.g. with Oracle:

-------------------------------------------------------------------------
| Id  | Operation                           | Name              | Rows  |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                   |       |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER          |     1 |
|*  2 |   INDEX RANGE SCAN                  | IDX_CUSTOMER_NAME |     1 |
-------------------------------------------------------------------------

But sometimes, we cannot use AND to connect two predicates. In particular, that’s not possible with an IN predicate, so people sometimes resort to using string concatenation, because that seems to work and make sense.

For instance, let’s find all customers whose first and last names matches those of an actor (as always, using the Sakila database)

SELECT *
FROM customer
WHERE first_name || last_name IN (
  SELECT first_name || last_name
  FROM actor
)

And yes indeed, what we’re getting here is the correct answer:

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          6   JENNIFER     DAVIS

But that answer is only accidentally correct!

Because we weren’t looking for customers called

first_name = 'JENNIFER' AND last_name = 'DAVIS'

We were looking for customers called

first_name || last_name = 'JENNIFERDAVIS'

Want proof? Let’s add a new customer:

INSERT INTO customer (customer_id, first_name, last_name )
VALUES               (600        , 'JENNI'   , 'FERDAVIS');

Yeah right? No one is called FERDAVIS. Or are they? As good programmers, we closely observe Murphy’s Law (i.e. always look both left and right when crossing a street).

In any case, let’s run our query again:

SELECT *
FROM customer
WHERE first_name || last_name IN (
  SELECT first_name || last_name
  FROM actor
)

And observe the result!

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          6   JENNIFER     DAVIS
        600   JENNI        FERDAVIS

Of course, because our predicate was really looking for customers called

first_name || last_name = 'JENNIFERDAVIS'

Which matches in both cases:

-- What we expected
first_name || last_name = 'JENNIFER' || 'DAVIS'

-- What we got
first_name || last_name = 'JENNI' || 'FERDAVIS'

Notice that I only added this customer to the customer table, not to the actor table. There’s no actor by the name FERDAVIS, so the result is clearly wrong.

AHA! Let’s use an “impossible” separator

So, we might proceed to fixing this as such:

SELECT *
FROM customer
WHERE first_name || '###' || last_name IN (
  SELECT first_name || '###' || last_name
  FROM actor
)

And now, the result is again correct. We get only JENNIFER DAVIS because we were looking for:

first_name || '###' || last_name = 'JENNIFER###DAVIS'

This works quite well for a while, as the separator is quite “impossible” (i.e. improbable) to be encountered in actual data. But we shouldn’t trust our judgement, because… Murphy’s Law. So you might think: better use a more rare separator, e.g. (if your database supports proper character sets)

SELECT *
FROM customer
WHERE first_name || '🙈🙉🙊' || last_name IN (
  SELECT first_name || '🙈🙉🙊' || last_name
  FROM actor
)

The use of emojis should indicate what my opinion of this approach is.

Too bad for performance, though

Remember that index we’ve created? Fact is, we also have such an index on the ACTOR table:

CREATE INDEX idx_actor_name ON actor (last_name, first_name);

And now, let’s assume our query is a bit different. We’ll be looking only for customers whose address_id is 10:

SELECT *
FROM customer
WHERE address_id = 10
AND first_name || '🙈🙉🙊' || last_name IN (
  SELECT first_name || '🙈🙉🙊' || last_name
  FROM actor
)

Now, our querymoji is using the index indeed, but for an INDEX FULL SCAN, so it’s only slightly faster than scanning the entire actor table:

-----------------------------------------------------------------------------------
| Id  | Operation                            | Name                       | Rows  |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                            |       |
|*  1 |  HASH JOIN SEMI                      |                            |     1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER                   |     1 |
|*  3 |    INDEX RANGE SCAN                  | IDX_CUSTOMER_FK_ADDRESS_ID |     1 |
|   4 |   INDEX FULL SCAN                    | IDX_ACTOR_NAME             |     2 |
-----------------------------------------------------------------------------------

And what’s worse, even if all the cardinality estimates correctly indicate only 1-2 rows, we’ll perform a HASH JOIN and load the full index for it! We should be running a NESTED LOOP instead.

Is there a better way? Yes! Use row constructors to compare several values at once:

SELECT *
FROM customer
WHERE address_id = 10
AND (first_name, last_name) IN (
  SELECT first_name, last_name
  FROM actor
);

Or, if your database doesn’t support this syntax (luckily, Oracle and PostgreSQL do, for instance), then you can resort to an equivalent EXISTS predicate

SELECT *
FROM customer c
WHERE address_id = 10
AND EXISTS (
  SELECT 1
  FROM actor a
  WHERE c.first_name = a.first_name
  AND c.last_name = a.last_name
);

Both of these queries are exactly equivalent and result in a nested loop semi join, rather than the previous hash join, which is perfectly reasonable for these small tables. We can now use the IDX_ACTOR_NAME for a quick INDEX RANGE SCAN operation:

-----------------------------------------------------------------------------------
| Id  | Operation                            | Name                       | Rows  |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                            |       |
|   1 |  NESTED LOOPS SEMI                   |                            |     1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER                   |     1 |
|*  3 |    INDEX RANGE SCAN                  | IDX_CUSTOMER_FK_ADDRESS_ID |     1 |
|*  4 |   INDEX RANGE SCAN                   | IDX_ACTOR_NAME             |     1 |
-----------------------------------------------------------------------------------

But let’s not trust the estimated plans. Let’s benchmark (more info about benchmarking SQL here)

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 2500;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name
        FROM customer
        WHERE address_id = 10
        AND first_name || '###' || last_name IN (
          SELECT first_name || '###' || last_name
          FROM actor
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    dbms_output.put_line('Run ' || r ||', Statement 1 : '
      || (SYSTIMESTAMP - v_ts));
    v_ts := SYSTIMESTAMP;

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name
        FROM customer
        WHERE address_id = 10
        AND (first_name, last_name) IN (
          SELECT first_name, last_name
          FROM actor
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    dbms_output.put_line('Run ' || r ||', Statement 2 : '
      || (SYSTIMESTAMP - v_ts));
  END LOOP;
END;
/

As can be seen here, the benchmark shows that the query using the row constructor is drastically faster as it can properly use the index as it should:

Run 1, Statement 1 : +000000000 00:00:00.374471000
Run 1, Statement 2 : +000000000 00:00:00.062830000
Run 2, Statement 1 : +000000000 00:00:00.364168000
Run 2, Statement 2 : +000000000 00:00:00.066252000
Run 3, Statement 1 : +000000000 00:00:00.359559000
Run 3, Statement 2 : +000000000 00:00:00.063898000
Run 4, Statement 1 : +000000000 00:00:00.344775000
Run 4, Statement 2 : +000000000 00:00:00.086060000
Run 5, Statement 1 : +000000000 00:00:00.394163000
Run 5, Statement 2 : +000000000 00:00:00.063176000

Now, imagine we were running this against some much more impressive data sets than the Sakila database

Conclusion

If you’re ever thinking about concatenating two fields for a comparison, try again. There are two major caveats that should indicate you’re about to do something silly:

There’s a major risk of your query being subtly wrong (accidental matches between JENNIFER DAVIS and JENNI FERDAVIS)
There’s a major risk of your query being quite slow

So, as a rule of thumb, don’t use concatenation in predicates. There’s (almost) always a better way.

Filed under: sql Tagged: Concatenation, Indexes, performance, predicates, sql

Don’t Use the String Concatenation “Trick” in SQL Predicates

But that answer is only accidentally correct!

AHA! Let’s use an “impossible” separator

Too bad for performance, though

Conclusion

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112