The SQL language has one great advantage over procedural, object oriented, and “ordinary” functional programming languages. The fact that it is truly declarative (i.e. a 4GL / fourth generation programming language) means that a sophisticated optimiser can easily transform one SQL expression into another, equivalent SQL expression, which might be faster to execute.
How does this work?
Expression Tree Equivalence
Here’s an introduction to equivalent expression trees from my SQL Masterclass: Let’s assume we want to evaluate the following expression:
A + (A - B)
Now in maths, it can be proven trivially that by the laws of associativity, the above expression is really the same as this one:
(A + A) - B
We didn’t really win anything yet, but we can equally trivially turn the above addition into a multiplication that can be proven to be exactly equivalent:
(2 * A) - B
Now, imagine that A
is an extremely “expensive” value, e.g. a value that we have to fetch from the disk, or worse, from the network. The cost of accessing A is thus very high and if we can avoid one access to A by the above transformation, then the resulting expression is much faster to evaluate than the original one, even if mathematically speaking, it does not matter at all.
That’s what optimisers do all the time. They transform expressions into equivalent expressions which are faster to execute. And they constantly adapt, so if the DBA chooses to move A from a remote server to the local server, thus reducing the cost of access to A, perhaps, suddenly, the original plan will be better again, because of the cost of multiplication (just an example).
A SQL Example
An equally trivial SQL example from my SQL Masterclass shows that it really doesn’t matter mathematically if we run this query:
SELECT first_name, last_name FROM customer WHERE first_name = 'JAMIE'
Or this one:
SELECT * FROM ( SELECT first_name, last_name FROM customer ) WHERE first_name = 'JAMIE'
With the SQL language, it may be a bit harder to see that these are exactly equivalent SQL statements, but if we translate the above queries to relational algebra, it may become more visible:
Selection before projection
… or WHERE
before SELECT
:
Projection then selection
… or SELECT
before WHERE
:
Don’t be fooled by relational algebra‘s term “selection”. It does not correspond to the SELECT
clause, but to the WHERE
clause!
We can prove (let’s leave the exercise to the reader), that both expressions are exactly equivalent, so optimisers can pick whichever they have a more efficient matching algorithm for:
- Ordinary row stores will probably apply the selection first, before projecting, as the expensive operation is accessing rows and if we can filter rows early (e.g. through indexes) then the whole query is less expensive)
- A column store might (just a hypothesis) apply the projection first, as the expensive operation is accessing the columns. We then might have to traverse the entire column anyway to filter out the rows
Let’s Talk About JOINs (and Their Elimination)
JOIN elimination is one of the most basic and yet most powerful SQL transformations, which is implemented by most modern databases in one way or another. It is so powerful, because we’re writing (potentially useless) JOINs all the time, when writing a bit more complex queries. See also our article about JOINs for a general overview.
Now, consider the following simplified schema, taken from the Sakila database:
CREATE TABLE address ( address_id INT NOT NULL, address VARCHAR(50) NOT NULL, CONSTRAINT pk_address PRIMARY KEY (address_id) ); CREATE TABLE customer ( customer_id INT NOT NULL, first_name VARCHAR(45) NOT NULL, last_name VARCHAR(45) NOT NULL, address_id INT NOT NULL, CONSTRAINT pk_customer PRIMARY KEY (customer_id), CONSTRAINT fk_customer_address FOREIGN KEY (address_id) REFERENCES address(address_id) );
Let’s ignore indexing and other useful features for this example.
INNER JOIN Elimination
The following query shows a common JOIN use-case, joining a parent table (ADDRESS
) to a child table (CUSTOMER
) in a to-one relationship:
SELECT c.* FROM customer c JOIN address a ON c.address_id = a.address_id
We intended to fetch all customers and their addresses. But observe: We project only columns from the CUSTOMER
table and we don’t have any predicates at all, specifically not predicates using the ADDRESS
table. So, we’re completely ignoring any contributions from the ADDRESS
table. We never really needed the JOIN in the first place!
And in fact, the optimiser can prove this too, because of the FOREIGN KEY
constraint on C.ADDRESS_ID
, which guarantees that every CUSTOMER
record has exactly one corresponding ADDRESS
record. The JOIN does not duplicate, nor remove any CUSTOMER
rows, so it is unneeded and thus eliminated (by some, not all databases, will list each database at the end of the article).
So, the database can rewrite the SQL statement to the following, equivalent SQL statement in the presence of said FOREIGN KEY:
SELECT * FROM customer c
Now, quite obviously, this query will be faster than the previous one, if the entire JOIN can be avoided, and thus the entire access to the ADDRESS
table. Neat, huh? Who would have thought that FOREIGN KEY
s can be so useful in terms of performance.
OUTER JOIN Elimination
A LEFT [ OUTER ] JOIN
will JOIN the right table to the left table but keep rows from the left table if there is no match (again, an explanation of joins can be seen here). When we apply LEFT JOIN
to the previous query…
SELECT c.* FROM customer c LEFT JOIN address a ON c.address_id = a.address_id
… then we’ll fetch all rows from CUSTOMER
regardless if that customer has any ADDRESS
. This is useful if the FOREIGN KEY
is optional (nullable), or completely absent, e.g. through:
ALTER TABLE customer DROP CONSTRAINT fk_customer_address
OUTER JOIN
is even easier to eliminate, as it doesn’t require any FOREIGN KEY
constraint for the database to prove that it is unneeded. A UNIQUE
constraint on the parent table (here: ADDRESS.ADDRESS_ID
) is sufficient to show that for every CUSTOMER
there can be at most one ADDRESS
, so the LEFT JOIN
won’t duplicate any CUSTOMER
rows (Unlike INNER JOIN
, OUTER JOIN
never remove rows).
Hence, the above query can again be rewritten to the more optimal:
SELECT * FROM customer c
OUTER JOIN Elimination with DISTINCT
Another interesting case of OUTER JOIN
elimination is the following one, which unfortunately didn’t work on Oracle for a customer of ours, recently, in a complex query that ran rogue. Let’s look at some other tables of the Sakila database, which expose a to-many relationship:
CREATE TABLE actor ( actor_id numeric NOT NULL , first_name VARCHAR(45) NOT NULL, last_name VARCHAR(45) NOT NULL, CONSTRAINT pk_actor PRIMARY KEY (actor_id) ); CREATE TABLE film ( film_id int NOT NULL, title VARCHAR(255) NOT NULL, CONSTRAINT pk_film PRIMARY KEY (film_id) ); CREATE TABLE film_actor ( actor_id INT NOT NULL, film_id INT NOT NULL, CONSTRAINT pk_film_actor PRIMARY KEY (actor_id, film_id), CONSTRAINT fk_film_actor_actor FOREIGN KEY (actor_id) REFERENCES actor (actor_id), CONSTRAINT fk_film_actor_film FOREIGN KEY (film_id) REFERENCES film (film_id) );
Now, consider this query:
SELECT DISTINCT first_name, last_name FROM actor a LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
We’re looking for all actors and their films, but then we project only distinct actors. Again, the JOIN
to FILM_ACTOR
doesn’t contribute anything to the result, but because we’re joining a to-many relationship here (from parent table ACTOR
to child table FILM_ACTOR
), the JOIN
is producing duplicate rows. Without DISTINCT
, we’d get something like:
FIRST_NAME LAST_NAME ---------------------- ... PENELOPE GUINESS PENELOPE GUINESS PENELOPE GUINESS NICK WAHLBERG NICK WAHLBERG ...
But thanks to the DISTINCT
keyword, the result is (provably) no different from the result of this much simpler query:
SELECT DISTINCT first_name, last_name FROM actor a
(Note, DISTINCT
cannot be eliminated, unless we already have a UNIQUE
constraint on (FIRST_NAME, LAST_NAME)
).
Why not Just Refactor the SQL Manually?
Of course, all of this shouldn’t be needed if our SQL were perfect. In the above trivial examples, the SQL can (and should) be re-written manually to improve quality. But note that:
- Developers make mistakes, and those mistakes may be very subtle when queries get more complex. I’ll show an example below.
- The presence of this feature actually encourages writing more complex SQL, especially when using reusable views. I’ll show another example below.
- Finally, I’ve previously advocated avoiding needless, mandatory work, like
SELECT *
. Such work is mandatory because the optimiser cannot prove its needlessness. In the case of theseJOINs
, the optimiser can prove the needlessness, so the work is no longer mandatory. It can be eliminated.
Here are some complex examples as promised, where this optimiser feature really shines:
Subtle Mistakes
Let’s consider the following query (in PostgreSQL syntax):
SELECT c.name, count(*) FROM actor a JOIN film_actor fa USING (actor_id) JOIN film f USING (film_id) JOIN film_category fc USING (film_id) JOIN category c USING (category_id) WHERE actor_id = 1 GROUP BY c.name ORDER BY count(*) DESC
What does it do? For ACTOR_ID = 1
(Penelope Guiness), we’re looking for all the different film categories she played in, and the number of films per category. This is easier to understand when we look at the result:
NAME COUNT Horror 3 Family 2 New 2 Classics 2 Games 2 Music 1 Sci-Fi 1 Animation 1 Sports 1 Children 1 Comedy 1 Documentary 1 Foreign 1
Now, can you spot the unneeded JOINs? In fact, we never needed ACTOR
, nor did we need FILM
SELECT c.name, count(*) FROM film_actor fa JOIN film_category fc USING (film_id) JOIN category c USING (category_id) WHERE actor_id = 1 GROUP BY c.name ORDER BY count(*) DESC
Cool, eh? The JOINs can be eliminated (again, in some databases, see below) and our “mistake” is no longer relevant to the query. The mistake could have also snuck (or sneaked?) in from a previous query version, which may have looked like this, projecting also the actor information and the list of films per category, in case of which the additional JOIN
are needed:
SELECT c.name, count(*), a.first_name, a.last_name, array_agg(f.title ORDER BY f.title) FROM actor a JOIN film_actor fa USING (actor_id) JOIN film f USING (film_id) JOIN film_category fc USING (film_id) JOIN category c USING (category_id) WHERE actor_id = 1 GROUP BY c.name, a.first_name, a.last_name ORDER BY count(*) DESC
The result being:
NAME COUNT FIRST_NAME LAST_NAME FILMS Horror 3 PENELOPE GUINESS {"ELEPHANT TROJAN","LADY STAGE","RULES HUMAN"} Family 2 PENELOPE GUINESS {"KING EVOLUTION","SPLASH GUMP"} New 2 PENELOPE GUINESS {"ANGELS LIFE","OKLAHOMA JUMANJI"} Classics 2 PENELOPE GUINESS {"COLOR PHILADELPHIA","WESTWARD SEABISCUIT"} Games 2 PENELOPE GUINESS {"BULWORTH COMMANDMENTS","HUMAN GRAFFITI"} Music 1 PENELOPE GUINESS {"WIZARD COLDBLOODED"} Sci-Fi 1 PENELOPE GUINESS {"CHEAPER CLYDE"} Animation 1 PENELOPE GUINESS {"ANACONDA CONFESSIONS"} Sports 1 PENELOPE GUINESS {"GLEAMING JAWBREAKER"} Children 1 PENELOPE GUINESS {"LANGUAGE COWBOY"} Comedy 1 PENELOPE GUINESS {"VERTIGO NORTHWEST"} Documentary 1 PENELOPE GUINESS {"ACADEMY DINOSAUR"} Foreign 1 PENELOPE GUINESS {"MULHOLLAND BEAST"}
As you can see, this optimisation can be very useful on your legacy SQL, because if we maintain a complex query, we might not always be able to see all the JOINs that are really needed.
Reusable Views
Sometimes, we simply add additional JOINs
for convenience, when building complex queries from simpler ones, e.g. by using views (which is a completely underrated RDBMS feature! You should all write more views).
Consider this view:
CREATE VIEW v_customer AS SELECT c.first_name, c.last_name, a.address, ci.city, co.country FROM customer c JOIN address a USING (address_id) JOIN city ci USING (city_id) JOIN country co USING (country_id)
It’s not unlikely that we will write a view like this, simply because we’re incredibly bored to constantly join all these tables all the time. Every time we do something with customers and addresses, we need the CITY
and COUNTRY
table as well.
From now on (with this view), we can simply select from the view and “forget” about how it came to be. Now, let’s consider we completely forget about the underlying table, because the view was so useful all the time. We could think about doing this:
SELECT first_name, last_name FROM v_customer
What do you think will happen? Exactly. JOIN
elimination. A view isn’t really anything special, just a “macro” of some stored SQL (beware of some databases, where this isn’t always the case, e.g. MySQL, which Bill Karwin was kind enough to hint me at). So the above statement will be transformed into:
SELECT first_name, last_name FROM ( SELECT c.first_name, c.last_name, a.address, ci.city, co.country FROM customer c JOIN address a USING (address_id) JOIN city ci USING (city_id) JOIN country co USING (country_id) ) v_customer
… which can be transformed into this (we don’t need all columns in the nested select):
SELECT first_name, last_name FROM ( SELECT c.first_name, c.last_name FROM customer c JOIN address a USING (address_id) JOIN city ci USING (city_id) JOIN country co USING (country_id) ) v_customer
… which can be transformed into this (JOINs can be eliminated):
SELECT first_name, last_name FROM ( SELECT c.first_name, c.last_name FROM customer c ) v_customer
… and finally (the subquery is not useful):
SELECT first_name, last_name FROM customer
The view is even very useful for this particular query, thanks to JOIN elimination!
Note, the SQL transformations exposed above are simply educational. Actual optimisers may perform transformations in an entirely differently, or in a different order. This is just to show what’s possible, and what kind of stuff is being done.
Cool, So Can My Database Do It?
Perhaps! Let’s look at the three different types of JOIN
elimination in the context of these databases:
- DB2 LUW 10.5
- MySQL 8.0.2
- Oracle 12.2.0.1
- PostgreSQL 9.6
- SQL Server 2014
INNER JOIN Elimination
Remember, this depends on the presence (and usefulness) of a FOREIGN KEY
constraint. The SQL statement we’re using here is:
SELECT first_name, last_name FROM customer c JOIN address a ON c.address_id = a.address_id
We’re hoping to get:
SELECT first_name, last_name FROM customer c
DB2 LUW
The following execution plan (created with Markus Winand’s cool utility) shows that this works in DB2, there’s no access to the ADDRESS
table:
Explain Plan | -----------------------------------------------------------------------| ID | Operation | Rows | Cost | 1 | RETURN | | 61 | 2 | FETCH CUSTOMER | 599 of 599 (100.00%) | 61 | 3 | IXSCAN IDX_CUSTOMER_FK_ADDRESS_ID | 599 of 599 (100.00%) | 20 |
MySQL
MySQL 8, apart from finally introducing CTE and window functions (yay), has a lot of new optimiser features, read Morgan Tocker’s useful optimiser guide for details. Unfortunately, INNER JOIN
elimination is not implemented:
ID TABLE TYPE REF ROWS EXTRA 1 c ALL 599 1 a eq_ref sakila.c.address_id 1 Using index
Not only is the JOIN
executed, but it is executed using a nested loop with 599 index lookups, as MySQL still only supports NESTED LOOP JOINs, not HASH JOINs.
Bummer.
Oracle
No problem at all for Oracle:
------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 599 | 28752 | 5 (0)| 00:00:01 | | 1 | TABLE ACCESS FULL| CUSTOMER | 599 | 28752 | 5 (0)| 00:00:01 | ------------------------------------------------------------------------------
… the JOIN
is eliminated as expected.
PostgreSQL
Unfortunately, PostgreSQL cannot eliminate INNER JOIN:
Hash Join (cost=19.57..42.79 rows=599 width=13) Hash Cond: (c.address_id = a.address_id) -> Seq Scan on customer c (cost=0.00..14.99 rows=599 width=15) -> Hash (cost=12.03..12.03 rows=603 width=4) -> Seq Scan on address a (cost=0.00..12.03 rows=603 width=4)
Not as bad as in MySQL, though, as PostgreSQL chose to use a HASH JOIN to combine the two tables.
SQL Server
No problemo for SQL Server, the ADDRESS
table access is gone!
|--Table Scan(OBJECT:([sakila].[dbo].[customer] AS [c]))
Excellent. What about…
OUTER JOIN Elimination
This one is a bit easier to prove for the database, remember? We don’t rely on any FOREIGN KEY
anymore. A UNIQUE
key in the parent table is sufficient to eliminate an OUTER JOIN
. We can safely expect that if the INNER JOIN
could be eliminated (DB2, Oracle, SQL Server), then an OUTER JOIN
can be eliminated, too.
Here’s the query:
SELECT first_name, last_name FROM customer c LEFT JOIN address a ON c.address_id = a.address_id
And the outcome:
DB2 LUW
Good
Explain Plan | -----------------------------------------------------------------------| ID | Operation | Rows | Cost | 1 | RETURN | | 61 | 2 | FETCH CUSTOMER | 599 of 599 (100.00%) | 61 | 3 | IXSCAN IDX_CUSTOMER_FK_ADDRESS_ID | 599 of 599 (100.00%) | 20 |
MySQL
Still nope:
ID TABLE TYPE REF ROWS EXTRA 1 c ALL 599 1 a eq_ref sakila.c.address_id 1 Using index
Oracle
Perfect:
------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 599 | 28752 | 5 (0)| 00:00:01 | | 1 | TABLE ACCESS FULL| CUSTOMER | 599 | 28752 | 5 (0)| 00:00:01 | ------------------------------------------------------------------------------
PostgreSQL
Unlike INNER JOIN
elimination, this works. Great!
Seq Scan on customer c (cost=0.00..14.99 rows=599 width=13)
SQL Server
As expected, good:
|--Table Scan(OBJECT:([sakila].[dbo].[customer] AS [c]))
Finally…
OUTER JOIN Elimination with DISTINCT
Remember, the query here navigates a to-many relationship, producing duplicate records of the parent table, but then removes all those duplicates again by
- Ignoring contributions from the child table
- Removing duplicates with
DISTINCT
It’s easy to prove that this:
SELECT DISTINCT first_name, last_name FROM actor a LEFT JOIN film_actor fa ON a.actor_id = fa.actor_id
Is equivalent to this:
SELECT DISTINCT first_name, last_name FROM actor a
Remember also, this only works with OUTER JOIN
, not with INNER JOIN
, as the latter might remove rows, so we have to execute it to see if it does.
DB2 LUW
Cool, this actually works!
Explain Plan | ----------------------------------------------------------------------| ID | Operation | Rows | Cost | 1 | RETURN | | 20 | 2 | TBSCAN | 200 of 200 (100.00%) | 20 | 3 | SORT (UNIQUE) | 200 of 200 (100.00%) | 20 | 4 | TBSCAN ACTOR | 200 of 200 (100.00%) | 20 |
There’s no access to the FILM_ACTOR
table, nor to its indexes. Very nice.
MySQL
As this is a more sophisticated transformation than the previous ones, we don’t have high hopes here.
ID TABLE TYPE REF ROWS EXTRA 1 a ALL 200 Using temporary 1 fa ref sakila.a.actor_id 27 Using index; Distinct
This has become a rather expensive query, again because of the lack of HASH JOIN support in MySQL!
Oracle
I’m very surprised to see that Oracle doesn’t support this optimisation, we’re executing the full query:
--------------------------------------------------------------- | Id | Operation | Name | Rows | --------------------------------------------------------------- | 0 | SELECT STATEMENT | | 5462 | | 1 | HASH UNIQUE | | 5462 | | 2 | NESTED LOOPS OUTER| | 5462 | | 3 | TABLE ACCESS FULL| ACTOR | 200 | |* 4 | INDEX RANGE SCAN | IDX_FK_FILM_ACTOR_ACTOR | 27 | ---------------------------------------------------------------
Curiously, Oracle also chose a NESTED LOOP JOIN in this case, even if we could have loaded the entire index on the FILM_ACTOR
table into memory and then HASH JOINed it to ACTOR
. Note that the cardinality estimate of the resulting query is quite off, too, despite the DISTINCT
operation. This can lead to significant effects in an upstream query, which selects from this query (e.g. if stored as a view) – which is what happened to our customer.
PostgreSQL
PostgreSQL also doesn’t support this elimination, but at least gets cardinality estimates much more accurately and chooses a HASH JOIN operation:
HashAggregate (cost=193.53..194.81 rows=128 width=13) Group Key: a.first_name, a.last_name -> Hash Right Join (cost=6.50..166.22 rows=5462 width=13) Hash Cond: (fa.actor_id = a.actor_id) -> Seq Scan on film_actor fa (cost=0.00..84.62 rows=5462 width=2) -> Hash (cost=4.00..4.00 rows=200 width=17) -> Seq Scan on actor a (cost=0.00..4.00 rows=200 width=17)
SQL Server
The pleasant surprise came from SQL Server, which does support this optimisation too:
|--Sort(DISTINCT ORDER BY:([a].[first_name] ASC, [a].[last_name] ASC)) |--Table Scan(OBJECT:([sakila].[dbo].[actor] AS [a]))
As you can see, no access to any FILM_ACTOR
related objects!
Summary
Here’s a summary of what databases can eliminate:
Database | INNER JOIN: to-one |
OUTER JOIN: to-one |
OUTER JOIN DISTINCT: to-many |
---|---|---|---|
DB2 LUW 10.5 | Yep | Yep | Yep |
MySQL 8.0.2 | Nope | Nope | Nope |
Oracle 12.2.0.1 | Yep | Yep | Nope |
PostgreSQL 9.6 | Nope | Yep | Nope |
SQL Server 2014 | Yep | Yep | Yep |
Conclusion
JOIN
elimination is a very simple to understand, yet incredibly powerful feature that modern databases support to help developers build and maintain complex SQL queries without worrying too much about performance side effects.
It is possible because SQL is a 4GL (fourth-generation programming language), i.e. a declarative language whose parsed expression trees can “easily” be transformed into equivalent, simpler, and faster to execute expression trees. This work is constantly done by the optimiser behind the scenes, often without you even noticing (see example where unnecessary ACTOR
and FILM
tables were removed).
Currenly, DB2 and SQL Server are the leaders here with Oracle being a close runner-up, at least in my investigation. There’s hope for PostgreSQL, and a bit less hope for MySQL right now. There are tons of other interesting SQL transformations, which I’ll blog about in future blog posts, which may make the difference in other kinds of complex queries.
If you were intrigued by this sort of functionality, do have a look at my most recent SQL talk, which helps developers understand the real value of SQL in the context of the optimiser:
Filed under: sql Tagged: DB2, JOIN elimination, mysql, Optimisation, Oracle, PostgreSQL, sql, SQL Optimisation, SQL Server, SQL transformations
