I inherited a PostgreSQL database that suffered from a lack of rigorous schema design. Column types were used and abused, such as Unix time values in varchar columns. Nullable columns for required data. Obscure naming conventions and magic integer values mixed with empty spaces and sometimes special text values, none of which were documented. Multiple comma-separated ints in text columns.
One of these legacy tables supported a central feature of the site the database backed and was quite large. It had no natural key and a primary key on a sequence value. And yes, the table had duplicates. Many of them. The rate of inserts on the table prevented me from removing the duplicates and adding a unique index concurrently. By the time the concurrent index creation attempted to complete, duplicates had been inserted and the index would be invalid. And taking the site offline for the multiple hours it would take to fix it was out of the question.
Here’s an example of the schema (after I had removed the unnecessary sequence):
This was an issue I had thought about over the years. Fortunately we didn’t observe any functionality issues due to the duplicates (though that didn’t mean they weren’t there). There were plenty of other tasks to work on, so this festered around in the back of my brain. How to get rid of the duplicates and prevent them from happening again on this huge, active table? If I could just work on it a bit at a time and prevent duplicates on that bit.
If only I could just enforce uniqueness over a part of the table. Bam. Once I had phrased the problem correctly, the solution became immediately clear: remove duplicates for a portion of the table and use partial unique indexes to enforce uniqueness over the subset I had just cleaned up.
And on to the next portion:
I continued this process until the entire table was covered by partial unique indexes.
There were a few times a duplicate got inserted in a portion that I had deduped but the unique index hadn’t yet been completed. This was easily fixed by removing the (invalid) index, and repeating the deduping and index creation.
To finish up, I created a unique index across the entire table.
Then, I removed the partial indexes and the original, non-key indexes. Voilà! The table is deduped and uniqueness enforced, a great way to take advantage of the partial index feature in PostgreSQL to resolve a thorny issue on a very large and active table. And the entire process was performed while the database was online!