I inher­ited a Post­greSQL data­base that suffered from a lack of rigorous schema design. Column types were used and abused, such as Unix time values in varchar columns. Nullable columns for required data. Obscure naming conven­tions and magic integer values mixed with empty spaces and some­times special text values, none of which were docu­mented. Multiple comma-sep­a­rated ints in text columns.

One of these legacy ta­bles supported a central feature of the site the data­base backed and was quite large. It had no natural key and a primary key on a sequence value. And yes, the ta­ble had dupli­cates. Many of them. The rate of inserts on the ta­ble prevented me from removing the dupli­cates and adding a unique index concur­rently. By the time the concur­rent index creation attempted to complete, dupli­cates had been inserted and the index would be invalid. And taking the site offline for the multiple hours it would take to fix it was out of the question.

Here’s an example of the schema (after I had removed the unnec­es­sary sequence):

CREATE TABLE links
(
  source_id BIGINT NOT NULL,
  target_id BIGINT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL
);

CREATE INDEX links_source_id_target_id_idx ON links (source_id, target_id);
CREATE INDEX links_target_id_source_id_idx ON links (target_id, source_id);

This was an issue I had thought about over the years. Fortu­nately we didn’t observe any func­tion­ality issues due to the dupli­cates (though that didn’t mean they weren’t there). There were plenty of other tasks to work on, so this festered around in the back of my brain. How to get rid of the dupli­cates and prevent them from happening again on this huge, active table? If I could just work on it a bit at a time and prevent dupli­cates on that bit.

If only I could just enforce unique­ness over a part of the table. Bam. Once I had phrased the problem correctly, the solu­tion became imme­di­ately clear: remove dupli­cates for a portion of the ta­ble and use partial unique indexes to enforce unique­ness over the subset I had just cleaned up.

CREATE UNIQUE INDEX CONCURRENTLY links_source_id_target_id_key_1000000
  ON links (source_id, target_id)
  WHERE source_id >= 0 AND source_id < 1000000;

And on to the next portion:

CREATE UNIQUE INDEX CONCURRENTLY links_source_id_target_id_key_2000000
  ON links (source_id, target_id)
  WHERE source_id >= 1000000 AND source_id < 2000000;

I continued this process until the entire ta­ble was covered by partial unique indexes.

There were a few times a dupli­cate got inserted in a portion that I had deduped but the unique index hadn’t yet been completed. This was easily fixed by removing the (invalid) index, and repeating the deduping and index creation.

To finish up, I created a unique index across the entire table.

CREATE UNIQUE INDEX CONCURRENTLY links_source_id_target_id_key
  ON links (source_id, target_id);
CREATE UNIQUE INDEX CONCURRENTLY links_target_id_source_id_key
  ON links (target_id, source_id);

Then, I removed the partial indexes and the orig­i­nal, non-key indexes. Voilà! The ta­ble is deduped and unique­ness enforced, a great way to take advan­tage of the partial index feature in Post­greSQL to resolve a thorny issue on a very large and active ta­ble. And the entire process was performed while the data­base was online!