What’s a good natural key?
This is a very difficult question for most entities when you design your schema. In some rare cases, there seems to be an “obvious” candidate, such as a variety of ISO standards, including:
But even in those cases, there might be exceptions and the worst thing that can happen is a key change. Most database designs play it safe and use surrogate keys instead. Nothing wrong with that. But…
Relationship tables
There is one exception where a surrogate key is never really required. Those are relationship tables. For example, in the Sakila database, all relationship tables lack a surrogate key and use their respective foreign keys as a compound “natural” primary key instead:
So, the FILM_ACTOR
table, for example, is defined as such:
CREATE TABLE film_actor ( actor_id int NOT NULL REFERENCES actor, film_id int NOT NULL REFERENCES film, CONSTRAINT film_actor_pkey PRIMARY KEY (actor_id, film_id) );
There is really no point in adding another column FILM_ACTOR_ID
or ID
for an individual row in this table, even if a lot of ORMs and non-ORM-defined schemas will do this, simply for “consistency” reasons (and in a few cases, because they cannot handle compound keys).
Now, the presence or absence of such a surrogate key is usually not too relevant in every day work with this table. If you’re using an ORM, it will likely make no difference to client code. If you’re using SQL, it definitely doesn’t. You just never use that additional column.
But in terms of performance, it might make a huge difference!
Clustered indexes
In many RDBMS, when creating a table, you get to choose whether to use a “clustered index” or a “non clustered index” table layout. The main difference is:
Clustered index
… is a primary key index that “clusters” data together, which belongs together. In other words:
- All the index column values are contained in the index tree structure
- All the other column values are contained in the index leaf nodes
The benefit of this table layout is that primary key lookups can be much faster because your entire row is located in the index, which requires less disk I/O than the non clustered index for primary key lookups. The price for this is slower secondary index searches (e.g. searching for last names). The algorithmic complexities are:
O(log N)
for primary key lookupsO(log N)
for secondary key lookups plusO(M log N)
for projections of non-secondary-key columns (quite a high price to pay)
… where
N
is the size of the tableM
is the number of rows that are searched in secondary keys
OLTP usage often profits from clustered indexes.
Non clustered index
… is a primary key index that resides “outside” of the table structure, which is a heap table. In other words:
- All the index column values are contained in the index tree structure
- All the index column values and other column values are contained in the heap table
The benefit of this table layout is that all lookups are equally fast, regardless if you’re using a primary key lookup or a secondary key search. There’s always an additional, constant time heap table lookup. The algorithmic complexities are:
O(log N)
for primary key lookups plusO(1)
for projections of non-primary-key columns (a moderate price to pay)O(log N)
for secondary key lookups plusO(M)
for projections of non-secondary-key columns (a moderate price to pay)
OLAP usage definitely profits from heap tables.
Defaults
- MySQL’s InnoDB offers clustered indexes only.
- MySQL’s MyISAM offers heap tables only.
- Oracle offers both and defaults to heap tables
- PostgreSQL offers both and defaults to heap tables
- SQL Server offers both and defaults to clustered indexes
Note that Oracle calls clustered indexes “index organised tables”
Performance
With the algorithmic complexities above, we can easily guess what I’m trying to hint at here. In the presence of a clustered index, we should avoid expensive secondary key searches when possible. Of course, these searches cannot always be avoided, but if we review the alternative design of these two tables:
CREATE TABLE film_actor_surrogate ( id int NOT NULL, actor_id int NOT NULL REFERENCES actor, film_id int NOT NULL REFERENCES film, CONSTRAINT film_actor_surrogate_pkey PRIMARY KEY (id) ); CREATE TABLE film_actor_natural ( actor_id int NOT NULL REFERENCES actor, film_id int NOT NULL REFERENCES film, CONSTRAINT film_actor_pkey PRIMARY KEY (actor_id, film_id) );
… we can see that if we’re using a clustered index here, the clustering will be made based on either:
FILM_ACTOR_SURROGATE.ID
, which is a very useless clustering(FILM_ACTOR_NATURAL.ACTOR_ID, FILM_ACTOR_NATURAL.FILM_ID)
, which is a very useful clustering
In the latter case, whenever we look up an actor’s films, we can use the clustering index as a covering index, regardless if we project anything additional from that table or not.
In the former case, we have to rely on an additional secondary key index that contains (ACTOR_ID, FILM_ID)
, and chances are that secondary index is not covering if we have additional projections.
The surrogate key clustering is really useless, because we never use the table this way.
Does it matter?
We can easily design a benchmark for this case. You can find the complete benchmark code here on GitHub, to validate the results on your environment. The benchmark uses this database design:
create table parent_1 (id int not null primary key); create table parent_2 (id int not null primary key); create table child_surrogate ( id int auto_increment, parent_1_id int not null references parent_1, parent_2_id int not null references parent_2, payload_1 int, payload_2 int, primary key (id), unique (parent_1_id, parent_2_id) ) -- ENGINE = MyISAM /* uncomment to use MyISAM (heap tables) */ ; create table child_natural ( parent_1_id int not null references parent_1, parent_2_id int not null references parent_2, payload_1 int, payload_2 int, primary key (parent_1_id, parent_2_id) ) -- ENGINE = MyISAM /* uncomment to use MyISAM (heap tables) */ ;
Unlike in the Sakila database, we’re now adding some “payload” to the relationship table, which is not unlikely. Recent versions of MySQL will default to InnoDB, which only supports a clustered index layout. You can uncomment the ENGINE
storage clause to see how this would perform with MyISAM, which only supports heap tables.
The benchmark adds:
- 10 000 rows in
PARENT_1
- 100 rows in
PARENT_2
- 1 000 000 rows in both
CHILD
tables (just a cross join of the above)
And then, it runs 5 iterations of 10000 repetitions of the following two queries, following our standard SQL benchmark technique:
-- Query 1 SELECT c.payload_1 + c.payload_2 AS a FROM parent_1 AS p1 JOIN child_surrogate AS c ON p1.id = c.parent_1_id WHERE p1.id = 4; -- Query 2 SELECT c.payload_1 + c.payload_2 AS a FROM parent_1 AS p1 JOIN child_natural AS c ON p1.id = c.parent_1_id WHERE p1.id = 4;
Notice that MySQL does not implement join elimination, otherwise, the useless join to PARENT_1
would be eliminated. The benchmark results are very clear:
Using InnoDB (clustered indexes)
Run 0, Statement 1 : 3104 Run 0, Statement 2 : 1910 Run 1, Statement 1 : 3097 Run 1, Statement 2 : 1905 Run 2, Statement 1 : 3045 Run 2, Statement 2 : 2276 Run 3, Statement 1 : 3589 Run 3, Statement 2 : 1910 Run 4, Statement 1 : 2961 Run 4, Statement 2 : 1897
Using MyISAM (heap tables)
Run 0, Statement 1 : 3473 Run 0, Statement 2 : 3288 Run 1, Statement 1 : 3328 Run 1, Statement 2 : 3341 Run 2, Statement 1 : 3674 Run 2, Statement 2 : 3307 Run 3, Statement 1 : 3373 Run 3, Statement 2 : 3275 Run 4, Statement 1 : 3298 Run 4, Statement 2 : 3322
You shouldn’t read this as a comparison between InnoDB and MyISAM in general, but as a comparison of the different table structures within the boundaries of the same engine. Very obviously, the additional search complexity of the badly clustered index in CHILD_SURROGATE
causes a 50% slower query execution on this type of query, without gaining anything.
In the case of the heap table, the additional surrogate key column did not have any significant effect.
Again, the full benchmark can be found here on GitHub, if you want to repeat it.
Conclusion
Not everyone agrees what is generally better: clustered or non clustered indexes. Not everyone agrees on the utility of surrogate keys on every table. These are both quite opinionated discussions.
But this article clearly showed that on relationship tables, which have a very clear candidate key, namely the set of outgoing foreign keys that defines the many-to-many relationship, the surrogate key not only doesn’t add value, but it actively hurts your performance on a set of queries when your table is using a clustered index.
MySQL’s InnoDB and SQL Server use clustered indexes by default, so if you’re using any of those RDBMS, do check if you have room for significant improvement by dropping your surrogate keys.