The following two indexes are redundant in most SQL databases:
CREATE INDEX i_actor_1 ON actor (last_name); CREATE INDEX i_actor_2 ON actor (last_name, first_name);
It is usually safe to drop the first index, because all queries that query the LAST_NAME
column only can still profit from the second index I_ACTOR_2
. The reason being that LAST_NAME
is the first column of the composite index I_ACTOR_2
(it would be a different story, if it weren’t the first column).
Note: It is usually safe to drop the first index, because the benefits probably outweigh the cost:
Benefits of dropping
Costs of dropping
- Querying a composite index can be slightly slower as can be seen in the below benchmark
Let’s see the costs of dropping the index below for Oracle, PostgreSQL, and SQL Server
Oracle
Preparation:
CREATE TABLE t ( a NUMBER(10) NOT NULL, b NUMBER(10) NOT NULL ); INSERT INTO t (a, b) SELECT level, level FROM dual CONNECT BY level <= 100000; CREATE INDEX i1 ON t(a); CREATE INDEX i2 ON t(a, b); EXEC dbms_stats.gather_table_stats('TEST', 'T');
Benchmark:
SET SERVEROUTPUT ON CREATE TABLE results ( run NUMBER(2), stmt NUMBER(2), elapsed NUMBER ); DECLARE v_ts TIMESTAMP WITH TIME ZONE; v_repeat CONSTANT NUMBER := 2000; BEGIN -- Repeat benchmark several times to avoid warmup penalty FOR r IN 1..5 LOOP v_ts := SYSTIMESTAMP; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT /*+INDEX(t i1)*/ * FROM t WHERE a = 1 ) LOOP NULL; END LOOP; END LOOP; INSERT INTO results VALUES (r, 1, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE); v_ts := SYSTIMESTAMP; FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT /*+INDEX(t i2)*/ * FROM t WHERE a = 1 ) LOOP NULL; END LOOP; END LOOP; INSERT INTO results VALUES (r, 2, SYSDATE + ((SYSTIMESTAMP - v_ts) * 86400) - SYSDATE); END LOOP; FOR rec IN ( SELECT run, stmt, CAST(elapsed / MIN(elapsed) OVER() AS NUMBER(10, 5)) ratio FROM results ) LOOP dbms_output.put_line('Run ' || rec.run || ', Statement ' || rec.stmt || ' : ' || rec.ratio); END LOOP; END; / DROP TABLE results;
The result being:
Run 1, Statement 1 : 1.4797 Run 1, Statement 2 : 1.45545 Run 2, Statement 1 : 1.1997 Run 2, Statement 2 : 1.01121 Run 3, Statement 1 : 1.13606 Run 3, Statement 2 : 1 Run 4, Statement 1 : 1.13455 Run 4, Statement 2 : 1.00242 Run 5, Statement 1 : 1.13303 Run 5, Statement 2 : 1.00606
Some notes on benchmarks here.
The fastest query execution in the above result yields 1, the other executions are multiples of 1. Yes, there’s a 10% difference in this case, so as you can see. The benefits (faster insertions) certainly should outweight the cost (slower queries), so, don’t apply this advice in a read-heavy / write-rarely database.
PostgreSQL
A similar difference can be seen in a PostgreSQL benchmark. No hints can be used to choose indexes, so we’re simply creating two tables:
CREATE TABLE t1 ( a INT NOT NULL, b INT NOT NULL ); CREATE TABLE t2 ( a INT NOT NULL, b INT NOT NULL ); INSERT INTO t1 (a, b) SELECT s, s FROM generate_series(1, 100000) AS s(s); INSERT INTO t2 (a, b) SELECT s, s FROM generate_series(1, 100000) AS s(s); CREATE INDEX i1 ON t1(a); CREATE INDEX i2 ON t2(a, b); ANALYZE t1; ANALYZE t2;
Benchmark:
DO $$ DECLARE v_ts TIMESTAMP; v_repeat CONSTANT INT := 10000; rec RECORD; BEGIN -- Repeat benchmark several times to avoid warmup penalty FOR r IN 1..5 LOOP v_ts := clock_timestamp(); FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT * FROM t1 WHERE a = 1 ) LOOP NULL; END LOOP; END LOOP; RAISE INFO 'Run %, Statement 1: %', r, (clock_timestamp() - v_ts); v_ts := clock_timestamp(); FOR i IN 1..v_repeat LOOP FOR rec IN ( SELECT * FROM t2 WHERE a = 1 ) LOOP NULL; END LOOP; END LOOP; RAISE INFO 'Run %, Statement 2: %', r, (clock_timestamp() - v_ts); RAISE INFO ''; END LOOP; END$$;
Result:
INFO: Run 1, Statement 1: 00:00:00.071891 INFO: Run 1, Statement 2: 00:00:00.080833 INFO: Run 2, Statement 1: 00:00:00.076329 INFO: Run 2, Statement 2: 00:00:00.079772 INFO: Run 3, Statement 1: 00:00:00.073137 INFO: Run 3, Statement 2: 00:00:00.079483 INFO: Run 4, Statement 1: 00:00:00.073456 INFO: Run 4, Statement 2: 00:00:00.081508 INFO: Run 5, Statement 1: 00:00:00.077148 INFO: Run 5, Statement 2: 00:00:00.083535
SQL Server
Preparation:
CREATE TABLE t ( a INT NOT NULL, b INT NOT NULL ); WITH s(s) AS ( SELECT 1 UNION ALL SELECT s + 1 FROM s WHERE s < 100 ) INSERT INTO t SELECT TOP 100000 row_number() over(ORDER BY (SELECT 1)), row_number() over(ORDER BY (select 1)) FROM s AS s1, s AS s2, s AS s3; CREATE INDEX i1 ON t(a); CREATE INDEX i2 ON t(a, b); UPDATE STATISTICS t;
Benchmark:
DECLARE @ts DATETIME; DECLARE @repeat INT = 2000; DECLARE @r INT; DECLARE @i INT; DECLARE @dummy VARCHAR; DECLARE @s1 CURSOR; DECLARE @s2 CURSOR; DECLARE @results TABLE ( run INT, stmt INT, elapsed DECIMAL ); SET @r = 0; WHILE @r < 5 BEGIN SET @r = @r + 1 SET @s1 = CURSOR FOR SELECT b FROM t WITH (INDEX (i1)) WHERE a = 1; SET @s2 = CURSOR FOR SELECT b FROM t WITH (INDEX (i2)) WHERE a = 1; SET @ts = current_timestamp; SET @i = 0; WHILE @i < @repeat BEGIN SET @i = @i + 1 OPEN @s1; FETCH NEXT FROM @s1 INTO @dummy; WHILE @@FETCH_STATUS = 0 BEGIN FETCH NEXT FROM @s1 INTO @dummy; END; CLOSE @s1; END; DEALLOCATE @s1; INSERT INTO @results VALUES (@r, 1, DATEDIFF(ms, @ts, current_timestamp)); SET @ts = current_timestamp; SET @i = 0; WHILE @i < @repeat BEGIN SET @i = @i + 1 OPEN @s2; FETCH NEXT FROM @s2 INTO @dummy; WHILE @@FETCH_STATUS = 0 BEGIN FETCH NEXT FROM @s2 INTO @dummy; END; CLOSE @s2; END; DEALLOCATE @s2; INSERT INTO @results VALUES (@r, 1, DATEDIFF(ms, @ts, current_timestamp)); END; SELECT 'Run ' + CAST(run AS VARCHAR) + ', Statement ' + CAST(stmt AS VARCHAR) + ': ' + CAST(CAST(elapsed / MIN(elapsed) OVER() AS DECIMAL(10, 5)) AS VARCHAR) FROM @results;
Result:
Run 1, Statement 1: 1.22368 Run 1, Statement 1: 1.09211 Run 2, Statement 1: 1.05263 Run 2, Statement 1: 1.09211 Run 3, Statement 1: 1.00000 Run 3, Statement 1: 1.05263 Run 4, Statement 1: 1.05263 Run 4, Statement 1: 1.00000 Run 5, Statement 1: 1.09211 Run 5, Statement 1: 1.05263
As can be seen, predictably, in all databases the smaller non-composite index is slightly faster for this type of query than the composite index. Yet both indexes can be used for the query in a reasonable way, so if disk space / insertion speed is an issue, the redundant single-column index can be dropped.
How to find such indexes
The following query will help you detect such indexes in Oracle, PostgreSQL, and SQL Server:
Oracle
WITH indexes AS ( SELECT i.owner, i.index_name, i.table_name, listagg(c.column_name, ', ') WITHIN GROUP (ORDER BY c.column_position) AS columns FROM all_indexes i JOIN all_ind_columns c ON i.owner = c.index_owner AND i.index_name = c.index_name GROUP BY i.owner, i.table_name, i.index_name, i.leaf_blocks ) SELECT i.owner, i.table_name, i.index_name AS "Deletion candidate index", i.columns AS "Deletion candidate columns", j.index_name AS "Existing index", j.columns AS "Existing columns" FROM indexes i JOIN indexes j ON i.owner = j.owner AND i.table_name = j.table_name AND j.columns LIKE i.columns || ',%'
Result:
TABLE_NAME delete index columns existing index columns ------------------------------------------------------------------------- T I1 A I2 A, B
In short, it lists all the indexes whose columns are a prefix of another index’s columns
PostgreSQL
Get ready for a really nifty query. Here’s how to discover redundant indexes in PostgreSQL, which unfortunately doesn’t seem to have an easy, out-of-the-box dictionary view to discover index columns:
WITH indexes AS ( SELECT trel.relnamespace AS schema_name, trel.relname AS table_name, irel.relname AS index_name, string_agg(a.attname, ', ' ORDER BY c.ordinality) AS columns FROM pg_index AS i JOIN pg_class AS trel ON trel.oid = i.indrelid JOIN pg_namespace AS tnsp ON trel.relnamespace = tnsp.oid JOIN pg_class AS irel ON irel.oid = i.indexrelid JOIN pg_attribute AS a ON trel.oid = a.attrelid JOIN LATERAL unnest(i.indkey) WITH ORDINALITY AS c(colnum, ordinality) ON a.attnum = c.colnum GROUP BY i, trel.relnamespace, trel.relname, irel.relname ) SELECT i.table_name, i.index_name AS "Deletion candidate index", i.columns AS "Deletion candidate columns", j.index_name AS "Existing index", j.columns AS "Existing columns" FROM indexes i JOIN indexes j ON i.schema_name = j.schema_name AND i.table_name = j.table_name AND j.columns LIKE i.columns || ',%';
This is a really nice case of lateral unnesting with ordinality, which you should definitely add to your PostgreSQL tool chain.
SQL Server
Now, SQL Server doesn’t have a nice STRING_AGG
function (yet), but we can work around this using STUFF
and XML
to get the same query.
Of course, there are other solutions using recursive SQL, but I’m too lazy to translate the simple string pattern-matching approach to something recursive.
WITH i AS ( SELECT s.name AS schema_name, t.name AS table_name, i.name AS index_name, c.name AS column_name, ic.index_column_id FROM sys.indexes i JOIN sys.index_columns ic ON i.object_id = ic.object_id AND i.index_id = ic.index_id JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id JOIN sys.tables t ON i.object_id = t.object_id JOIN sys.schemas s ON t.schema_id = s.schema_id ), indexes AS ( SELECT schema_name, table_name, index_name, STUFF(( SELECT ',' + j.column_name FROM i j WHERE i.table_name = j.table_name AND i.index_name = j.index_name FOR XML PATH('') -- Yay, XML in SQL! ), 1, 1, '') columns FROM i GROUP BY schema_name, table_name, index_name ) SELECT i.schema_name, i.table_name, i.index_name AS "Deletion candidate index", i.columns AS "Deletion candidate columns", j.index_name AS "Existing index", j.columns AS "Existing columns" FROM indexes i JOIN indexes j ON i.schema_name = j.schema_name AND i.table_name = j.table_name AND j.columns LIKE i.columns + ',%';
Conclusion
Now go run the above query on your production database and… Very carefully and reasonably think about whether you really want to drop those indexes
Filed under: sql Tagged: Indexing, Oracle, performance, PostgreSQL, Redundancy, Redundant indexes, sql, SQL Server
