Say NO to Excessive Use of Surrogate Keys if Performance Really Matters to You

We programmers keep cargo culting these wrong ideas. Recently, we said “NO” to Venn diagrams. Today we’re going to say no to surrogate keys.

The surrogate keys vs. natural keys non-debate is one of the most overheated debates in data architecture, and I don’t get why everyone is so emotional. Both sides claim to hold the ultimate truth (just like in the tabs vs. spaces non-debate) and prevent added value where it really matters.

This article sheds some light into why you shouldn’t be so dogmatic all the time. By the end of the article, you’ll agree, promised.

What’s so great about surrogate keys?

In case your SQL-speak is a bit rusty, a surrogate key is an artificial identifier. Or how Wikipedia puts it:

A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data, unlike a natural (or business) key which is derived from application data.

There are three clear advantages that surrogate keys have:

You know every table has one, and it probably has the same name everywhere: ID.
It is relatively small (e.g. a BIGINT)
You don’t have to go through the “hassle” of designing your table for 30 seconds longer, trying to find a natural key

There are more advantages (check out the wikipedia article), and of course, there are disadvantages. Today, I’d like to talk about architectures where EVERY table has a surrogate key, even when it makes absolutely no sense.

When do surrogate keys make absolutely no sense?

I’m currently helping a customer improve performance on their queries that run against a log database. The log database essentially contains relevant information about sessions, and some transactions related to those sessions. (In case you’re jumping to the conclusion “hey don’t use SQL for logs”: Yes, they also use Splunk for unstructured logs. But these here are structured, and they run a lot of SQL style analytics against them).

Here’s what you can imagine is stored in that sessions table (using Oracle):

CREATE TABLE sessions (

  -- Useful values
  sess        VARCHAR2(50 CHAR) NOT NULL PRIMARY KEY,
  ip_address  VARCHAR2(15 CHAR) NOT NULL,
  tracking    VARCHAR2(50 CHAR) NOT NULL,

  -- and much more info here
  ...
)

Except, though, that’s not how the schema was designed. It was designed like this:

CREATE TABLE sessions (

  -- Meaningless surrogate key
  id NUMBER(18) NOT NULL PRIMARY KEY,

  -- Useful values
  sess        VARCHAR2(50 CHAR) NOT NULL,

  -- Meaningless foreign keys
  ip_id       NUMBER(18) NOT NULL,
  tracking_id NUMBER(18) NOT NULL,
  ...

  FOREIGN KEY (ip_id)
    REFERENCES ip_addresses (ip_id),
  FOREIGN KEY (tracking_id)
    REFERENCES tracking (tracking_id),
  ...

  -- Non-primary UNIQUE key
  UNIQUE (sess),
)

So, this so called “fact table” (it’s a star schema after all) contains nothing useful, only a set of surrogate keys that contain references to the interesting values, which are all located in other tables. For instance, if you want to count all session for a given IP address, you already need to run a JOIN:

SELECT ip_address, count(*)
FROM ip_addresses
JOIN sessions USING (ip_id)
GROUP BY ip_address

After all, we need the join because what the user really likes to see is the IP address, not the surrogate key. Duh, right? Here’s the execution plan:

------------------------------------------------------
| Operation           | Name         | Rows  | Cost  |
------------------------------------------------------
| SELECT STATEMENT    |              |  9999 |   108 |
|  HASH GROUP BY      |              |  9999 |   108 |
|   HASH JOIN         |              | 99999 |   104 |
|    TABLE ACCESS FULL| IP_ADDRESSES |  9999 |     9 |
|    TABLE ACCESS FULL| SESSIONS     | 99999 |    95 |
------------------------------------------------------

Perfect hash join with two full table scans. In my example database, I have 100k sessions and 10k IP addresses.

Obviously, there’s an index on the IP_ADDRESS column, because I want to be able to filter by it. Something meaningful, like:

SELECT ip_address, count(*)
FROM ip_addresses
JOIN sessions USING (ip_id)
WHERE ip_address LIKE '192.168.%'
GROUP BY ip_address

Obviously, the plan is a bit better, because we’re returning less data. Here’s there result:

----------------------------------------------------------------
| Operation                     | Name         | Rows  | Cost  |
----------------------------------------------------------------
| SELECT STATEMENT              |              |     1 |    99 |
|  HASH GROUP BY                |              |     1 |    99 |
|   HASH JOIN                   |              |    25 |    98 |
|    TABLE ACCESS BY INDEX ROWID| IP_ADDRESSES |     1 |     3 |
|     INDEX RANGE SCAN          | I_IP         |     1 |     2 |
|    TABLE ACCESS FULL          | SESSIONS     | 99999 |    95 |
----------------------------------------------------------------

Intersting. We can now use our index statistics to estimate that our predicate will return only one row from the ip_address table. Yet, we still get a hash join with significant cost, for what now appears to be a rather trivial query.

What would the world look like without surrogate keys?

Easy. We no longer need the join, every time we need something IP address related from the sessions table.

Our two queries become, trivially:

-- All counts
SELECT ip_address, count(*)
FROM sessions
GROUP BY ip_address

-- Filtered counts
SELECT ip_address, count(*)
FROM sessions USING
WHERE ip_address LIKE '192.168.%'
GROUP BY ip_address

The first query yields a simpler execution plan, with around the same cost estimate.

--------------------------------------------------
| Operation          | Name      | Rows  | Cost  |
--------------------------------------------------
| SELECT STATEMENT   |           |   256 |   119 |
|  HASH GROUP BY     |           |   256 |   119 |
|   TABLE ACCESS FULL| SESSIONS2 | 99999 |   116 |
--------------------------------------------------

We don’t seem to gain that much here, but what happens with the filtered query?

------------------------------------------------
| Operation            | Name  | Rows  | Cost  |
------------------------------------------------
| SELECT STATEMENT     |       |     1 |     4 |
|  SORT GROUP BY NOSORT|       |     1 |     4 |
|   INDEX RANGE SCAN   | I_IP2 |   391 |     4 |
------------------------------------------------

OMG! Where did our costs go? Huh, this seems to be extremely fast! Let’s benchmark!

SET SERVEROUTPUT ON
DECLARE
  v_ts TIMESTAMP;
  v_repeat CONSTANT NUMBER := 100000;
BEGIN
  v_ts := SYSTIMESTAMP;

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT ip_address, count(*)
      FROM ip_addresses
      JOIN sessions USING (ip_id)
      WHERE ip_address LIKE '192.168.%'
      GROUP BY ip_address
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  dbms_output.put_line('Surrogate: '
    || (SYSTIMESTAMP - v_ts));
  v_ts := SYSTIMESTAMP;

  FOR i IN 1..v_repeat LOOP
    FOR rec IN (
      SELECT ip_address, count(*)
      FROM sessions2
      WHERE ip_address LIKE '192.168.%'
      GROUP BY ip_address
    ) LOOP
      NULL;
    END LOOP;
  END LOOP;

  dbms_output.put_line('Natural  : '
    || (SYSTIMESTAMP - v_ts));
END;
/

Surrogate: +000000000 00:00:03.453000000
Natural  : +000000000 00:00:01.227000000

The improvement is significant, and we don’t have a lot of data here. Now, when you think about it, it is kind of obvious, no? For the semantically exact same query, we either run a JOIN, or we don’t. And this is using very very little data. The actual customer database has hundreds of millions of sessions, where all these JOINs waste valuable resources for nothing but an artificial surrogate key which was introduced… because that’s how things were always done.

And, as always, don’t trust your execution plan costs. Measure. Benchmarking 100 iterations of the unfiltered query (the one that produced a hash join) yields:

Surrogate: +000000000 00:00:06.053000000
Natural  : +000000000 00:00:02.408000000

Still obvious, when you think of it.

Do note that we’re not denormalising yet. There’s still an IP_ADDRESS table, but now it contains the business key as the primary key (the address), not the surrogate key. In more rare querying use-cases, we’ll still join the table, in order to get IP-address related info (such as country).

Taking it to the extreme

The database at this customer site was designed by someone who appreciates purity, and I can certainly relate to that some times. But in this case, it went clearly wrong, because there were many of these queries that were actually filtering for a “natural key” (i.e. a business value), but needed to add yet another JOIN just to do that.

There were more of these cases, for instance:

ISO 639 language codes
ISIN security numbers
Account numbers, which had a formal identifier from an external system

and many more. Each and every time, there were dozens of JOINs required just to get that additional mapping between surrogate key and natural key.

Sometimes, surrogate keys are nice. They do use a little less disk space. Consider the following query (credits go to Stack Overflow user WW.)

SELECT
  owner,
  table_name,
  TRUNC(SUM(bytes)/1024) kb,
  ROUND(ratio_to_report(SUM(bytes)) OVER() * 100) Percent
FROM (
  SELECT
    segment_name table_name,
    owner,
    bytes
  FROM dba_segments
  WHERE segment_type IN (
    'TABLE',
	'TABLE PARTITION',
	'TABLE SUBPARTITION'
  )
  UNION ALL
  SELECT
    i.table_name,
    i.owner,
    s.bytes
  FROM dba_indexes i,
    dba_segments s
  WHERE s.segment_name = i.index_name
  AND s.owner          = i.owner
  AND s.segment_type  IN (
    'INDEX',
	'INDEX PARTITION',
	'INDEX SUBPARTITION'
  )
)
WHERE owner = 'TEST'
AND table_name IN (
  'SESSIONS',
  'IP_ADDRESSES',
  'SESSIONS2'
 )
GROUP BY table_name, owner
ORDER BY SUM(bytes) DESC;

This will show us the disk space used by each object:

TABLE_NAME                   KB    PERCENT
-------------------- ---------- ----------
SESSIONS2                 12288         58
SESSIONS                   8192         39
IP_ADDRESSES                768          4

Yes. We use a little more disk space, because now our primary key in the sessions table is a VARCHAR2(50) rather than a NUMBER(18). But disk space is extremely cheap, whereas wall clock time performance is essential. By removing just a little complexity, we’ve greatly increased performance already for a simple query.

Conclusion

Surrogate keys or natural keys? I say both. Surrogate keys do have advantages. You generally don’t want a 5 column composite primary key. And even less so, you don’t want a 5 column composite foreign key. In these cases, using a surrogate key can add value by removing complexity (and probably increasing performance again). But choose wisely. Ever so often, blindly using surrogate keys will go very wrong, and you’ll pay for it dearly when it comes to querying your data.

Say NO to Excessive Use of Surrogate Keys if Performance Really Matters to You

What’s so great about surrogate keys?

When do surrogate keys make absolutely no sense?

What would the world look like without surrogate keys?

Taking it to the extreme

Conclusion

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112