Quantcast
Channel: sql – Java, SQL and jOOQ.
Viewing all articles
Browse latest Browse all 426

The Many Flavours of the Arcane SQL MERGE Statement

$
0
0

The SQL MERGE statement is a device whose mystery is only exceeded by its power.

A simple example shows its full power according to standard SQL. Imagine you have a production table for product prices, and a staging table from which you want to load the latest prices. For once, I’m using the Db2 LuW MERGE syntax, because that’s the most standards compliant syntax out there (among the dialects we support in jOOQ):

DROP TABLE IF EXISTS prices;
DROP TABLE IF EXISTS staging;

CREATE TABLE prices (
  product_id BIGINT NOT NULL PRIMARY KEY,
  price DECIMAL(10, 2) NOT NULL,
  price_date DATE NOT NULL,
  update_count BIGINT NOT NULL
);

CREATE TABLE staging (
  product_id BIGINT NOT NULL PRIMARY KEY,
  price DECIMAL(10, 2) NOT NULL
);

DELETE FROM prices;
DELETE FROM staging;
INSERT INTO staging 
VALUES (1, 100.00),
       (2, 125.00),
       (3, 150.00);

So, we’ve loaded a few records in our staging table, which we now want to merge over to the prices table. We could just insert them, easily, but we’ll stage more prices later on, e.g. these ones:

DELETE FROM staging;
INSERT INTO staging 
VALUES (1, 100.00),
       (2,  99.00),
       (4, 300.00);

We want our logic to be this:

  • All new IDs in the staging table should just be inserted into the prices table.
  • Existing IDs should be updated if and only if the price has changed. In that case, the update_count should increase.
  • Prices that are no longer encountered in the staging table should be deleted from the prices table, to implement a full sync, rather than a delta sync, for the sake of the example. We could also add a “command” column that contains the instruction on whether data should be updated or deleted, to implement a delta sync.

So, this is the Db2 (and standards compliant) MERGE statement that we use for the job:

MERGE INTO prices AS p
USING (
  SELECT COALESCE(p.product_id, s.product_id) AS product_id, s.price
  FROM prices AS p
  FULL JOIN staging AS s ON p.product_id = s.product_id
) AS s
ON (p.product_id = s.product_id)
WHEN MATCHED AND s.price IS NULL THEN DELETE
WHEN MATCHED AND p.price != s.price THEN UPDATE SET 
  price = s.price,
  price_date = CURRENT_DATE,
  update_count = update_count + 1
WHEN NOT MATCHED THEN INSERT 
  (product_id, price, price_date, update_count)
VALUES 
  (s.product_id, s.price, CURRENT_DATE, 0);

Easy eh?

Hm, not so simple if you haven’t written MERGE statements all your life. If so, don’t be scared. As most of SQL, the scary part is syntax (keywords, UPPER CASE, etc.). The underlying concepts are simpler than they may seem at first. Let’s go through it step by step. It has 4 parts:

1. The target table

Just like with INSERT statements, we can define where we want to MERGE the data INTO. This is the simple part:

MERGE INTO prices AS p
-- ...

2. The source table

The USING keyword wraps a source table that we want to merge. We could have just placed the staging table here, but I wanted to enrich the source data with some additional data first. I’m using a FULL JOIN to produce the matching between old data (prices) and new data (staging). If, after filling the staging table for the second time, but before running the MERGE statement, we ran the USING clause alone (with some minor modifications for illustration):

SELECT 
  COALESCE(p.product_id, s.product_id) AS product_id, 
  p.price AS old_price, 
  s.price AS new_price
FROM prices AS p
FULL JOIN staging AS s ON p.product_id = s.product_id
ORDER BY product_id

Then we would get this:

PRODUCT_ID|OLD_PRICE|NEW_PRICE|
----------|---------|---------|
         1|   100.00|   100.00| <-- same price
         2|   125.00|    99.00| <-- updated price
         3|   150.00|         | <-- deleted price
         4|         |   300.00| <-- added price

Neat!

3. The ON clause

Next, we RIGHT JOIN the target table and source tables using an ON clause, just like with an ordinary JOIN:

ON (p.product_id = s.product_id)

MERGE always uses RIGHT JOIN semantics, which is why I placed a FULL JOIN in the source table, the USING clause. It is totally possible to write things a bit differently, such that we can avoid accessing the prices table twice, but I want to show the full power of this statement. Notice, SQL Server joins source and target tables using a FULL JOIN, as I’ll explain further down. I’ll also explain why RIGHT JOIN, right away.

4. The WHEN clauses

Now comes the interesting part! There can either be a match between the two tables (target and source), like when we get the result of an INNER JOIN, or no such match because the source table contains a row that is not matched by the target table (the RIGHT JOIN semantics). In our example, PRODUCT_ID IN (1, 2, 3) will produce a match (contained in both source and target tables), whereas PRODUCT_ID = 4 will not produce a match (not contained in the target table yet).

Colouring our source data set:

PRODUCT_ID|OLD_PRICE|NEW_PRICE|
----------|---------|---------|
         1|   100.00|   100.00| <-- matched (but ignored)
         2|   125.00|    99.00| <-- matched
         3|   150.00|         | <-- matched
         4|         |   300.00| <-- not matched

The following are a sequence of matching instructions that will be executed in the order of appearance, for each row resulting from the previous RIGHT JOIN:

-- With my FULL JOIN, I've produced NULL price values
-- whenever a PRODUCT_ID is in the target table, but not
-- in the source table. These rows, we want to DELETE
WHEN MATCHED AND s.price IS NULL THEN DELETE

-- When there is a price change (and only then), we 
-- want to update the price information in the target table.
WHEN MATCHED AND p.price != s.price THEN UPDATE SET 
  price = s.price,
  price_date = CURRENT_DATE,
  update_count = update_count + 1

-- Finally, when we don't have a match, i.e. a row is
-- in the source table, but not in the target table, then
-- we simply insert it into the target table.
WHEN NOT MATCHED THEN INSERT 
  (product_id, price, price_date, update_count)
VALUES 
  (s.product_id, s.price, CURRENT_DATE, 0);

It’s not too complicated, just a lot of keywords and syntax. So, after running this MERGE on the second set of data of the staging table, we’ll get this result in the price table:

PRODUCT_ID|PRICE |PRICE_DATE|UPDATE_COUNT|
----------|------|----------|------------|
         1|100.00|2020-04-09|           0|
         2| 99.00|2020-04-09|           1|
         4|300.00|2020-04-09|           0|

The way I expressed this MERGE statement, it is idempotent, i.e. I can run it again on the same staging table content, and it won’t modify any data in the price table – because none of the WHEN clauses applies. Idempotence is not a property of MERGE, I just wrote my statement this way.

Dialect specifics

A few dialects support MERGE. Among the ones that jOOQ 3.13 supports, there are at least:

  • Db2
  • Derby
  • Firebird
  • H2
  • HSQLDB
  • Oracle
  • SQL Server
  • Sybase SQL Anywhere
  • Teradata
  • Vertica

For once, regrettably, this list does not include PostgreSQL. But even the dialects in this list do not all agree on what MERGE really is. The SQL standard specifies 3 features, each one optional:

  • F312 MERGE statement
  • F313 Enhanced MERGE statement
  • F314 MERGE statement with DELETE branch

But instead of looking at the standards and what they require, let’s look at what the dialects offer, and how it can be emulated if something is not available.

The AND clause

As you may have noticed, this post uses the syntax:

WHEN MATCHED AND <some predicate> THEN

It’s also possible to specify

WHEN NOT MATCHED AND <some predicate> THEN

With the exception of Teradata, most dialects support these AND clauses (Oracle has a specific syntax using WHERE for this, which I’ll get to later).

The point of these clauses is to be able to have several of these WHEN MATCHED or WHEN NOT MATCHED clauses, in fact an arbitrary number of them. Unfortunately, this isn’t supported by all dialects. Some dialects only support one clause of each type (INSERT, UPDATE, DELETE). It isn’t strictly necessary to support several clauses, but it is a lot more convenient as we’ll see below.

These dialects do not support multiple WHEN MATCHED or WHEN NOT MATCHED clauses:

  • HSQLDB
  • Oracle
  • SQL Server
  • Teradata

If a dialect does not support AND, or if it does not support multiple WHEN MATCHED clauses, simply translate the clauses to case expressions. Instead of our previous WHEN clauses, we’d get:

-- The DELETE clause doesn't make much sense without AND,
-- So there's not much we can do about this emulation in Teradata.
WHEN MATCHED AND s.price IS NULL THEN DELETE

-- Repeat the AND clause in every branch of the CASE
-- Expression where it applies
WHEN MATCHED THEN UPDATE SET 
  price = CASE

    -- Update the price if the AND clause applies
    WHEN p.price != s.price THEN s.price,

    -- Otherwise, leave it untouched
    ELSE p.price
  END

  -- Repeat for all columns
  price_date = CASE
    WHEN p.price != s.price THEN CURRENT_DATE
    ELSE p.price_date
  END,
  update_count = CASE
    WHEN p.price != s.price THEN update_count + 1
    ELSE p.update_count
  END

-- Unchanged, in this case
WHEN NOT MATCHED THEN INSERT 
  (product_id, price, price_date, update_count)
VALUES 
  (s.product_id, s.price, CURRENT_DATE, 0);

The formalism is this one:

If there is no AND, add AND

These are the same:

WHEN MATCHED THEN [ UPDATE | DELETE ]
WHEN MATCHED AND 1 = 1 THEN [ UPDATE | DELETE ]

This replacement may be needed in Firebird (which has a bug in this area) and SQL Server (which does not allow for WHEN MATCHED clauses after a WHEN MATCHED clause without AND clause, which is kind of a linting error). Instead of emulating stuff, you can just skip all the subsequent WHEN MATCHED branches, because they will not apply. Every row is updated only once, i.e. by only one WHEN clause:

Every row is updated only once

Make sure no row is updated more than once in the emulation, as required by the standard. When writing this:

WHEN MATCHED AND p1 THEN UPDATE SET c1 = 1
WHEN MATCHED AND p2 THEN DELETE
WHEN MATCHED AND p3 THEN UPDATE SET c1 = 3, c2 = 3
WHEN MATCHED AND p4 THEN DELETE

This really means the same as:

WHEN MATCHED AND p1 THEN UPDATE SET c1 = 1
WHEN MATCHED AND NOT p1 AND p2 THEN DELETE
WHEN MATCHED AND NOT p1 AND NOT p2 AND p3 THEN UPDATE SET c1 = 3,c2 = 3
WHEN MATCHED AND NOT p1 AND NOT p2 AND NOT p3 AND p4 THEN DELETE

To emulate the above, write this instead:

WHEN MATCHED AND 
  p1 OR
  NOT p1 AND NOT p2 AND p3
THEN UPDATE SET 
  c1 = CASE 
    WHEN p1                       THEN 1
    WHEN NOT p1 AND NOT p2 AND p3 THEN 3
                                  ELSE c1
  END,
  c2 = CASE
    WHEN NOT p1 AND NOT p2 AND p3 THEN 3
                                  ELSE c2
  END
WHEN MATCHED AND 
  NOT p1 AND p2 OR
  NOT p1 AND NOT p2 AND NOT p3 AND p4
THEN DELETE

Quite laborious, but that’s how it is.

H2 and HSQLDB

Notice that both H2 and HSQLDB do not follow the “every row is updated only once” rule. I’ve reported this to H2 already: https://github.com/h2database/h2database/issues/2552. If you want to be standards compliant (jOOQ 3.14 will emulate this for you, don’t worry), then you have to do the above CASE expression madness in these dialects, or, in H2 (HSQLDB doesn’t support multiple WHEN MATCHED clauses of the same type) enhance all the WHEN MATCHED AND clauses as I’ve illustrated before:

WHEN MATCHED AND p1 THEN UPDATE SET c1 = 1
WHEN MATCHED AND NOT p1 AND p2 THEN DELETE
WHEN MATCHED AND NOT p1 AND NOT p2 AND p3 THEN UPDATE SET c1 = 3,c2 = 3
WHEN MATCHED AND NOT p1 AND NOT p2 AND NOT p3 AND p4 THEN DELETE

Oracle

Oracle doesn’t support AND here, but some interesting vendor specific syntax. It looks reasonable at first, but it’s really funky.

  • After UPDATE, you can add a WHERE clause, which is the same thing as AND. So far so good.
  • You can also add a DELETE WHERE clause, but only together with an UPDATE. So you cannot DELETE without updating. Fine, we weren’t going to, in our example.
  • However, the interesting thing is that the UPDATE / DELETE command is executed together, and DELETE happens after UPDATE. So the same row gets processed twice. If you use WHERE in UPDATE, only rows included in the UPDATE can also be included in DELETE. I mean, why would you update the rows first, prior to deletion?

This means that our standard clauses:

WHEN MATCHED AND p1 THEN UPDATE SET c1 = 1
WHEN MATCHED AND p2 THEN DELETE
WHEN MATCHED AND p3 THEN UPDATE SET c1 = 3, c2 = 3
WHEN MATCHED AND p4 THEN DELETE

Would need to be emulated like this:

WHEN MATCHED 
THEN UPDATE SET 
  c1 = CASE 
    WHEN p1 THEN 1  -- Normal update for WHEN MATCHED AND p1 clause
    WHEN p2 THEN c1 -- "Touch" record for later deletion
    WHEN p3 THEN 3  -- Normal update for WHEN MATCHED AND p3 clause
    WHEN p4 THEN c1 -- "Touch" record for later deletion
            ELSE c1
  END,
  c2 = CASE
    WHEN p1 THEN c2 -- p1 is not affecting c2
    WHEN p2 THEN c2 -- "Touch" record for later deletion
    WHEN p3 THEN 3  -- Normal update for WHEN MATCHED AND p3 clause
    WHEN p4 THEN c2 -- "Touch" record for later deletion
            ELSE c2
  END

-- Any predicate from any AND clause, regardless if UPDATE or DELETE
WHERE p1 OR p2 OR p3 OR p4

-- Repeat the predicates required for deletion
DELETE WHERE 
  NOT p1 AND p2 OR
  NOT p1 AND NOT p2 AND NOT p3 AND p4

It was just a simple MERGE statement in standard SQL syntax!

There is an additional level of trickiness here, which I will not cover in this blog post (but we might handle it in jOOQ). In Oracle, the DELETE WHERE clause can already see the updates performed by the UPDATE clause. This means that if, for example, p2 depends on the value of c1:

  c1 = CASE 
    ...
    WHEN p2 THEN c1 -- "Touch" record for later deletion
    ...
  END,

Then the evaluation of p2 in DELETE WHERE will be affected by this

DELETE WHERE 
  NOT p1 AND p2 OR
  NOT p1 AND NOT p2 AND NOT p3 AND p4

The c1 in these p2 expressions will not be the same c1 as the c1 in the UPDATE clause. It’s obviously possible to manage this as well to some extent, through variable substitution.

SQL Server BY SOURCE and BY TARGET

SQL Server has a useful extension to the WHEN NOT MATCHED clause, which I think belongs in the SQL standard!

With this extension, you can specify whether you want to perform INSERT actions WHEN NOT MATCHED [ BY TARGET ] (the default that everyone else supports as well), or WHEN NOT MATCHED BY SOURCE (in case of which you can perform another UPDATE or DELETE action.

The BY TARGET clause means that we found a row in the source table but not in the target table. The BY SOURCE clause means that we found a row in the target table but not in the source table. This means that in SQL Server, the target and source tables are FULL OUTER JOINed, not RIGHT OUTER JOINed, which would mean that our original statement can be greatly simplified:

MERGE INTO prices AS p
USING staging AS s
ON (p.product_id = s.product_id)
WHEN NOT MATCHED BY SOURCE THEN DELETE
WHEN MATCHED AND p.price != s.price THEN UPDATE SET 
  price = s.price,
  price_date = getdate(),
  update_count = update_count + 1
WHEN NOT MATCHED BY TARGET THEN INSERT 
  (product_id, price, price_date, update_count)
VALUES 
  (s.product_id, s.price, getdate(), 0);

We could colour again the rows encountered here:

PRODUCT_ID|  P.PRICE|  S.PRICE|
----------|---------|---------|
         1|   100.00|   100.00| <-- matched (but ignored)
         2|   125.00|    99.00| <-- matched
         3|   150.00|         | <-- not matched by source
         4|         |   300.00| <-- not matched by target

As can be seen, this is really just how a FULL OUTER JOIN works.

The emulation of these clauses back into standard SQL are laborious too, as we’d have to emulate this FULL OUTER JOIN explicitly. I think it’s possible, but we might not implement it in jOOQ.

Vertica

Only Vertica seems to not support the DELETE branch, meaning you cannot use a MERGE statement to DELETE data from your target table. You can use it only to INSERT or UPDATE data, which is good enough in almost all cases. Curiously, Teradata supports DELETE, but not AND, which seems kinda pointless, as DELETE and UPDATE cannot be combined this way.

Conclusion

The MERGE statement is a device whose mystery is only exceeded by its power. In a simple form (no AND or WHERE clauses, no DELETE clauses), all dialects pretty much agree, and that’s already a very useful feature set, which jOOQ has supported for a long time. Starting from jOOQ 3.14, we’re tackling also all the other features listed in this article to help you write complex, vendor agnostic MERGE statements and emulate them on all the dialects that have some MERGE support.

Want to play around with it already now? Check out our free online SQL translation tool.


Viewing all articles
Browse latest Browse all 426

Trending Articles