Quantcast
Channel: sql – Java, SQL and jOOQ.
Viewing all articles
Browse latest Browse all 426

Don’t Use the String Concatenation “Trick” in SQL Predicates

$
0
0

In SQL, quite often, we want to compare several values with each other. For instance, when we’re looking for a specific user by their first and last names, we’ll write a query like this one:

SELECT *
FROM customer
WHERE first_name = 'SUSAN'
AND last_name = 'WILSON';

We’re getting:

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          8   SUSAN        WILSON

Surely, everyone agrees that this is correct and perfectly fine as we probably have an index on these two columns (or on at least one of them) to speed up such queries:

CREATE INDEX idx_customer_name ON customer (last_name, first_name);

The execution plan is thus optimal, e.g. with Oracle:

-------------------------------------------------------------------------
| Id  | Operation                           | Name              | Rows  |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                   |       |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER          |     1 |
|*  2 |   INDEX RANGE SCAN                  | IDX_CUSTOMER_NAME |     1 |
-------------------------------------------------------------------------

But sometimes, we cannot use AND to connect two predicates. In particular, that’s not possible with an IN predicate, so people sometimes resort to using string concatenation, because that seems to work and make sense.

For instance, let’s find all customers whose first and last names matches those of an actor (as always, using the Sakila database)

SELECT *
FROM customer
WHERE first_name || last_name IN (
  SELECT first_name || last_name
  FROM actor
)

And yes indeed, what we’re getting here is the correct answer:

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          6   JENNIFER     DAVIS

But that answer is only accidentally correct!

Because we weren’t looking for customers called

first_name = 'JENNIFER' AND last_name = 'DAVIS'

We were looking for customers called

first_name || last_name = 'JENNIFERDAVIS'

Want proof? Let’s add a new customer:

INSERT INTO customer (customer_id, first_name, last_name )
VALUES               (600        , 'JENNI'   , 'FERDAVIS');

Yeah right? No one is called FERDAVIS. Or are they? As good programmers, we closely observe Murphy’s Law (i.e. always look both left and right when crossing a street).

In any case, let’s run our query again:

SELECT *
FROM customer
WHERE first_name || last_name IN (
  SELECT first_name || last_name
  FROM actor
)

And observe the result!

CUSTOMER_ID   FIRST_NAME   LAST_NAME
------------------------------------
          6   JENNIFER     DAVIS
        600   JENNI        FERDAVIS

Of course, because our predicate was really looking for customers called

first_name || last_name = 'JENNIFERDAVIS'

Which matches in both cases:

-- What we expected
first_name || last_name = 'JENNIFER' || 'DAVIS'

-- What we got
first_name || last_name = 'JENNI' || 'FERDAVIS'

Notice that I only added this customer to the customer table, not to the actor table. There’s no actor by the name FERDAVIS, so the result is clearly wrong.

AHA! Let’s use an “impossible” separator

So, we might proceed to fixing this as such:

SELECT *
FROM customer
WHERE first_name || '###' || last_name IN (
  SELECT first_name || '###' || last_name
  FROM actor
)

And now, the result is again correct. We get only JENNIFER DAVIS because we were looking for:

first_name || '###' || last_name = 'JENNIFER###DAVIS'

This works quite well for a while, as the separator is quite “impossible” (i.e. improbable) to be encountered in actual data. But we shouldn’t trust our judgement, because… Murphy’s Law. So you might think: better use a more rare separator, e.g. (if your database supports proper character sets)

SELECT *
FROM customer
WHERE first_name || '🙈🙉🙊' || last_name IN (
  SELECT first_name || '🙈🙉🙊' || last_name
  FROM actor
)

The use of emojis should indicate what my opinion of this approach is.

Too bad for performance, though

Remember that index we’ve created? Fact is, we also have such an index on the ACTOR table:

CREATE INDEX idx_actor_name ON actor (last_name, first_name);

And now, let’s assume our query is a bit different. We’ll be looking only for customers whose address_id is 10:

SELECT *
FROM customer
WHERE address_id = 10
AND first_name || '🙈🙉🙊' || last_name IN (
  SELECT first_name || '🙈🙉🙊' || last_name
  FROM actor
)

Now, our querymoji is using the index indeed, but for an INDEX FULL SCAN, so it’s only slightly faster than scanning the entire actor table:

-----------------------------------------------------------------------------------
| Id  | Operation                            | Name                       | Rows  |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                            |       |
|*  1 |  HASH JOIN SEMI                      |                            |     1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER                   |     1 |
|*  3 |    INDEX RANGE SCAN                  | IDX_CUSTOMER_FK_ADDRESS_ID |     1 |
|   4 |   INDEX FULL SCAN                    | IDX_ACTOR_NAME             |     2 |
-----------------------------------------------------------------------------------

And what’s worse, even if all the cardinality estimates correctly indicate only 1-2 rows, we’ll perform a HASH JOIN and load the full index for it! We should be running a NESTED LOOP instead.

Is there a better way? Yes! Use row constructors to compare several values at once:

SELECT *
FROM customer
WHERE address_id = 10
AND (first_name, last_name) IN (
  SELECT first_name, last_name
  FROM actor
);

Or, if your database doesn’t support this syntax (luckily, Oracle and PostgreSQL do, for instance), then you can resort to an equivalent EXISTS predicate

SELECT *
FROM customer c
WHERE address_id = 10
AND EXISTS (
  SELECT 1
  FROM actor a
  WHERE c.first_name = a.first_name
  AND c.last_name = a.last_name
);

Both of these queries are exactly equivalent and result in a nested loop semi join, rather than the previous hash join, which is perfectly reasonable for these small tables. We can now use the IDX_ACTOR_NAME for a quick INDEX RANGE SCAN operation:

-----------------------------------------------------------------------------------
| Id  | Operation                            | Name                       | Rows  |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                            |       |
|   1 |  NESTED LOOPS SEMI                   |                            |     1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMER                   |     1 |
|*  3 |    INDEX RANGE SCAN                  | IDX_CUSTOMER_FK_ADDRESS_ID |     1 |
|*  4 |   INDEX RANGE SCAN                   | IDX_ACTOR_NAME             |     1 |
-----------------------------------------------------------------------------------

But let’s not trust the estimated plans. Let’s benchmark (more info about benchmarking SQL here)

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP WITH TIME ZONE;
  v_repeat CONSTANT NUMBER := 2500;
BEGIN

  -- Repeat benchmark several times to avoid warmup penalty
  FOR r IN 1..5 LOOP
    v_ts := SYSTIMESTAMP;

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name
        FROM customer
        WHERE address_id = 10
        AND first_name || '###' || last_name IN (
          SELECT first_name || '###' || last_name
          FROM actor
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    dbms_output.put_line('Run ' || r ||', Statement 1 : '
      || (SYSTIMESTAMP - v_ts));
    v_ts := SYSTIMESTAMP;

    FOR i IN 1..v_repeat LOOP
      FOR rec IN (
        SELECT first_name, last_name
        FROM customer
        WHERE address_id = 10
        AND (first_name, last_name) IN (
          SELECT first_name, last_name
          FROM actor
        )
      ) LOOP
        NULL;
      END LOOP;
    END LOOP;

    dbms_output.put_line('Run ' || r ||', Statement 2 : '
      || (SYSTIMESTAMP - v_ts));
  END LOOP;
END;
/

As can be seen here, the benchmark shows that the query using the row constructor is drastically faster as it can properly use the index as it should:

Run 1, Statement 1 : +000000000 00:00:00.374471000
Run 1, Statement 2 : +000000000 00:00:00.062830000
Run 2, Statement 1 : +000000000 00:00:00.364168000
Run 2, Statement 2 : +000000000 00:00:00.066252000
Run 3, Statement 1 : +000000000 00:00:00.359559000
Run 3, Statement 2 : +000000000 00:00:00.063898000
Run 4, Statement 1 : +000000000 00:00:00.344775000
Run 4, Statement 2 : +000000000 00:00:00.086060000
Run 5, Statement 1 : +000000000 00:00:00.394163000
Run 5, Statement 2 : +000000000 00:00:00.063176000

Now, imagine we were running this against some much more impressive data sets than the Sakila database

Conclusion

If you’re ever thinking about concatenating two fields for a comparison, try again. There are two major caveats that should indicate you’re about to do something silly:

  • There’s a major risk of your query being subtly wrong (accidental matches between JENNIFER DAVIS and JENNI FERDAVIS)
  • There’s a major risk of your query being quite slow

So, as a rule of thumb, don’t use concatenation in predicates. There’s (almost) always a better way.

Read also: Why You Should (Sometimes) Avoid Expressions in SQL Predicates


Filed under: sql Tagged: Concatenation, Indexes, performance, predicates, sql

Viewing all articles
Browse latest Browse all 426

Trending Articles