Using Partial Indexes to Add a Unique Index
I inherited a PostgreSQL database that suffered from a lack of rigorous schema design. Column types were used and abused, such as Unix time values in varchar columns. Nullable columns for required data. Obscure naming conventions and magic integer values mixed with empty spaces and sometimes special text values, none of which were documented. Multiple comma-separated ints in text columns.
One of these legacy tables supported a central feature of the site the database backed and was quite large. It had no natural key and a primary key on a sequence value. And yes, the table had duplicates. Many of them. The rate of inserts on the table prevented me from removing the duplicates and adding a unique index concurrently. By the time the concurrent index creation attempted to complete, duplicates had been inserted and the index would be invalid. And taking the site offline for the multiple hours it would take to fix it was out of the question.
Here’s an example of the schema (after I had removed the unnecessary sequence):
CREATE TABLE links
(
source_id BIGINT NOT NULL,
target_id BIGINT NOT NULL,
created_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX links_source_id_target_id_idx ON links (source_id, target_id);
CREATE INDEX links_target_id_source_id_idx ON links (target_id, source_id);
This was an issue I had thought about over the years. Fortunately we didn’t observe any functionality issues due to the duplicates (though that didn’t mean they weren’t there). There were plenty of other tasks to work on, so this festered around in the back of my brain. How to get rid of the duplicates and prevent them from happening again on this huge, active table? If I could just work on it a bit at a time and prevent duplicates on that bit.
If only I could just enforce uniqueness over a part of the table. Bam. Once I had phrased the problem correctly, the solution became immediately clear: remove duplicates for a portion of the table and use partial unique indexes to enforce uniqueness over the subset I had just cleaned up.
CREATE UNIQUE INDEX CONCURRENTLY links_source_id_target_id_key_1000000
ON links (source_id, target_id)
WHERE source_id >= 0 AND source_id < 1000000;
And on to the next portion:
CREATE UNIQUE INDEX CONCURRENTLY links_source_id_target_id_key_2000000
ON links (source_id, target_id)
WHERE source_id >= 1000000 AND source_id < 2000000;
I continued this process until the entire table was covered by partial unique indexes.
There were a few times a duplicate got inserted in a portion that I had deduped but the unique index hadn’t yet been completed. This was easily fixed by removing the (invalid) index, and repeating the deduping and index creation.
To finish up, I created a unique index across the entire table.
CREATE UNIQUE INDEX CONCURRENTLY links_source_id_target_id_key
ON links (source_id, target_id);
CREATE UNIQUE INDEX CONCURRENTLY links_target_id_source_id_key
ON links (target_id, source_id);
Then, I removed the partial indexes and the original, non-key indexes. Voilà! The table is deduped and uniqueness enforced, a great way to take advantage of the partial index feature in PostgreSQL to resolve a thorny issue on a very large and active table. And the entire process was performed while the database was online!