We programmers keep cargo culting these wrong ideas. Recently, we said “NO” to Venn diagrams. Today we’re going to say no to surrogate keys.
The surrogate keys vs. natural keys non-debate is one of the most overheated debates in data architecture, and I don’t get why everyone is so emotional. Both sides claim to hold the ultimate truth (just like in the tabs vs. spaces non-debate) and prevent added value where it really matters.
This article sheds some light into why you shouldn’t be so dogmatic all the time. By the end of the article, you’ll agree, promised.
What’s so great about surrogate keys?
In case your SQL-speak is a bit rusty, a surrogate key is an artificial identifier. Or how Wikipedia puts it:
A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data, unlike a natural (or business) key which is derived from application data.
There are three clear advantages that surrogate keys have:
- You know every table has one, and it probably has the same name everywhere: ID.
- It is relatively small (e.g. a BIGINT)
- You don’t have to go through the “hassle” of designing your table for 30 seconds longer, trying to find a natural key
There are more advantages (check out the wikipedia article), and of course, there are disadvantages. Today, I’d like to talk about architectures where EVERY table has a surrogate key, even when it makes absolutely no sense.
When do surrogate keys make absolutely no sense?
I’m currently helping a customer improve performance on their queries that run against a log database. The log database essentially contains relevant information about sessions, and some transactions related to those sessions. (In case you’re jumping to the conclusion “hey don’t use SQL for logs”: Yes, they also use Splunk for unstructured logs. But these here are structured, and they run a lot of SQL style analytics against them).
Here’s what you can imagine is stored in that sessions table (using Oracle):
CREATE TABLE sessions ( -- Useful values sess VARCHAR2(50 CHAR) NOT NULL PRIMARY KEY, ip_address VARCHAR2(15 CHAR) NOT NULL, tracking VARCHAR2(50 CHAR) NOT NULL, -- and much more info here ... )
Except, though, that’s not how the schema was designed. It was designed like this:
CREATE TABLE sessions ( -- Meaningless surrogate key id NUMBER(18) NOT NULL PRIMARY KEY, -- Useful values sess VARCHAR2(50 CHAR) NOT NULL, -- Meaningless foreign keys ip_id NUMBER(18) NOT NULL, tracking_id NUMBER(18) NOT NULL, ... FOREIGN KEY (ip_id) REFERENCES ip_addresses (ip_id), FOREIGN KEY (tracking_id) REFERENCES tracking (tracking_id), ... -- Non-primary UNIQUE key UNIQUE (sess), )
So, this so called “fact table” (it’s a star schema after all) contains nothing useful, only a set of surrogate keys that contain references to the interesting values, which are all located in other tables. For instance, if you want to count all session for a given IP address, you already need to run a JOIN:
SELECT ip_address, count(*) FROM ip_addresses JOIN sessions USING (ip_id) GROUP BY ip_address
After all, we need the join because what the user really likes to see is the IP address, not the surrogate key. Duh, right? Here’s the execution plan:
------------------------------------------------------ | Operation | Name | Rows | Cost | ------------------------------------------------------ | SELECT STATEMENT | | 9999 | 108 | | HASH GROUP BY | | 9999 | 108 | | HASH JOIN | | 99999 | 104 | | TABLE ACCESS FULL| IP_ADDRESSES | 9999 | 9 | | TABLE ACCESS FULL| SESSIONS | 99999 | 95 | ------------------------------------------------------
Perfect hash join with two full table scans. In my example database, I have 100k sessions and 10k IP addresses.
Obviously, there’s an index on the IP_ADDRESS column, because I want to be able to filter by it. Something meaningful, like:
SELECT ip_address, count(*) FROM ip_addresses JOIN sessions USING (ip_id) WHERE ip_address LIKE '192.168.%' GROUP BY ip_address
Obviously, the plan is a bit better, because we’re returning less data. Here’s there result:
---------------------------------------------------------------- | Operation | Name | Rows | Cost | ---------------------------------------------------------------- | SELECT STATEMENT | | 1 | 99 | | HASH GROUP BY | | 1 | 99 | | HASH JOIN | | 25 | 98 | | TABLE ACCESS BY INDEX ROWID| IP_ADDRESSES | 1 | 3 | | INDEX RANGE SCAN | I_IP | 1 | 2 | | TABLE ACCESS FULL | SESSIONS | 99999 | 95 | ----------------------------------------------------------------
Intersting. We can now use our index statistics to estimate that our predicate will return only one row from the ip_address table. Yet, we still get a hash join with significant cost, for what now appears to be a rather trivial query.
What would the world look like without surrogate keys?
Easy. We no longer need the join, every time we need something IP address related from the sessions table.
Our two queries become, trivially:
-- All counts SELECT ip_address, count(*) FROM sessions GROUP BY ip_address -- Filtered counts SELECT ip_address, count(*) FROM sessions USING WHERE ip_address LIKE '192.168.%' GROUP BY ip_address
The first query yields a simpler execution plan, with around the same cost estimate.
-------------------------------------------------- | Operation | Name | Rows | Cost | -------------------------------------------------- | SELECT STATEMENT | | 256 | 119 | | HASH GROUP BY | | 256 | 119 | | TABLE ACCESS FULL| SESSIONS2 | 99999 | 116 | --------------------------------------------------
We don’t seem to gain that much here, but what happens with the filtered query?
------------------------------------------------ | Operation | Name | Rows | Cost | ------------------------------------------------ | SELECT STATEMENT | | 1 | 4 | | SORT GROUP BY NOSORT| | 1 | 4 | | INDEX RANGE SCAN | I_IP2 | 391 | 4 | ------------------------------------------------
OMG! Where did our costs go? Huh, this seems to be extremely fast! Let’s benchmark!
SET SERVEROUTPUT ON DECLARE v_ts TIMESTAMP; v_repeat CONSTANT NUMBER := 100000; BEGIN v_ts := SYSTIMESTAMP; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT ip_address, count(*) FROM ip_addresses JOIN sessions USING (ip_id) WHERE ip_address LIKE '192.168.%' GROUP BY ip_address ) LOOP NULL; END LOOP; END LOOP; dbms_output.put_line('Surrogate: ' || (SYSTIMESTAMP - v_ts)); v_ts := SYSTIMESTAMP; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT ip_address, count(*) FROM sessions2 WHERE ip_address LIKE '192.168.%' GROUP BY ip_address ) LOOP NULL; END LOOP; END LOOP; dbms_output.put_line('Natural : ' || (SYSTIMESTAMP - v_ts)); END; /
Surrogate: +000000000 00:00:03.453000000 Natural : +000000000 00:00:01.227000000
The improvement is significant, and we don’t have a lot of data here. Now, when you think about it, it is kind of obvious, no? For the semantically exact same query, we either run a JOIN, or we don’t. And this is using very very little data. The actual customer database has hundreds of millions of sessions, where all these JOINs waste valuable resources for nothing but an artificial surrogate key which was introduced… because that’s how things were always done.
And, as always, don’t trust your execution plan costs. Measure. Benchmarking 100 iterations of the unfiltered query (the one that produced a hash join) yields:
Surrogate: +000000000 00:00:06.053000000 Natural : +000000000 00:00:02.408000000
Still obvious, when you think of it.
Do note that we’re not denormalising yet. There’s still an IP_ADDRESS
table, but now it contains the business key as the primary key (the address), not the surrogate key. In more rare querying use-cases, we’ll still join the table, in order to get IP-address related info (such as country).
Taking it to the extreme
The database at this customer site was designed by someone who appreciates purity, and I can certainly relate to that some times. But in this case, it went clearly wrong, because there were many of these queries that were actually filtering for a “natural key” (i.e. a business value), but needed to add yet another JOIN just to do that.
There were more of these cases, for instance:
- ISO 639 language codes
- ISIN security numbers
- Account numbers, which had a formal identifier from an external system
and many more. Each and every time, there were dozens of JOINs required just to get that additional mapping between surrogate key and natural key.
Sometimes, surrogate keys are nice. They do use a little less disk space. Consider the following query (credits go to Stack Overflow user WW.)
SELECT owner, table_name, TRUNC(SUM(bytes)/1024) kb, ROUND(ratio_to_report(SUM(bytes)) OVER() * 100) Percent FROM ( SELECT segment_name table_name, owner, bytes FROM dba_segments WHERE segment_type IN ( 'TABLE', 'TABLE PARTITION', 'TABLE SUBPARTITION' ) UNION ALL SELECT i.table_name, i.owner, s.bytes FROM dba_indexes i, dba_segments s WHERE s.segment_name = i.index_name AND s.owner = i.owner AND s.segment_type IN ( 'INDEX', 'INDEX PARTITION', 'INDEX SUBPARTITION' ) ) WHERE owner = 'TEST' AND table_name IN ( 'SESSIONS', 'IP_ADDRESSES', 'SESSIONS2' ) GROUP BY table_name, owner ORDER BY SUM(bytes) DESC;
This will show us the disk space used by each object:
TABLE_NAME KB PERCENT -------------------- ---------- ---------- SESSIONS2 12288 58 SESSIONS 8192 39 IP_ADDRESSES 768 4
Yes. We use a little more disk space, because now our primary key in the sessions table is a VARCHAR2(50)
rather than a NUMBER(18)
. But disk space is extremely cheap, whereas wall clock time performance is essential. By removing just a little complexity, we’ve greatly increased performance already for a simple query.
Conclusion
Surrogate keys or natural keys? I say both. Surrogate keys do have advantages. You generally don’t want a 5 column composite primary key. And even less so, you don’t want a 5 column composite foreign key. In these cases, using a surrogate key can add value by removing complexity (and probably increasing performance again). But choose wisely. Ever so often, blindly using surrogate keys will go very wrong, and you’ll pay for it dearly when it comes to querying your data.
Further reading:
- How Adding a UNIQUE Constraint on a OneToOne Relationship Helps Performance
- “What Java ORM do You Prefer, and Why?” – SQL of Course!
- Correlated Subqueries are Evil and Slow. Or are They?
Filed under: sql Tagged: database design, natural keys, normalisation, RDBMS, relational model, sql, Surrogate keys
