Bail out data duplicates with SQL

A good relational database model should avoid data duplicates. But sometimes you get working on a migration project. Maybe a project with a bad data basis.

The migration process seems to work, it looks good. Then it happens: the migration script aborts with a constraint error. You dig into the problem and discover that the suspicious table actually contains duplicate data, without a unique ID to distinguish the records, of course.

Here are two examples of how you can solve the problem in PostgreSQL or Oracle:

PostgreSQL solution

Firstly the PostgreSQL way. We create data with a duplicate.

playground=# create table mydata(name text);
CREATE TABLE
playground=# insert into mydata values ('Atari');
INSERT 0 1
playground=# insert into mydata values ('Commodore');
INSERT 0 1
playground=# insert into mydata values ('Sinclair ZX Spectrum');
INSERT 0 1
playground=# select * from mydata;
         name         
----------------------
 Atari
 Commodore
 Sinclair ZX Spectrum
(3 rows)

playground=# insert into mydata values ('Atari');
INSERT 0 1
playground=# select * from mydata;
         name         
----------------------
 Atari
 Commodore
 Sinclair ZX Spectrum
 Atari
(4 rows)

playground=# select name, count(*) from mydata group by name;
         name         | count 
----------------------+-------
 Commodore            |     1
 Sinclair ZX Spectrum |     1
 Atari                |     2
(3 rows)

And now remove the duplicates with this command:

playground=# delete from mydata a using  mydata b where a=b and a.ctid <  b.ctid;
DELETE 1
playground=# select * from mydata;
         name         
----------------------
 Commodore
 Sinclair ZX Spectrum
 Atari
(3 rows)

playground=# select name, count(*) from mydata group by name;
         name         | count 
----------------------+-------
 Commodore            |     1
 Sinclair ZX Spectrum |     1
 Atari                |     1
(3 rows)

Because both duplicate records were equal it doesn't matter which is really deleted internally.

Oracle solution

Actually the same logic but slightly different:

delete from mydata where rowid in (select a.rowid from mydata a, mydata b where a.name=b.name and a.rowid < b.rowid);

Same as mentioned above, it's irrelevant which duplicate is removed internally.

0
Subscribe to my newsletter

Read articles from Christoph Bilz (HamsterIdea.tech) directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Christoph Bilz (HamsterIdea.tech)
Christoph Bilz (HamsterIdea.tech)

HamsterIdea.tech stands for professional worldwide software development. I am experienced in software technologies since over 30 years and now it's time to share thoughts and spirits to the community. Let's start.