DISTINCT

(1) By Alex_Don on 2022-10-05 17:35:27 [link] [source]

Once a month a colleague sends me a file in CSV format, which I import into the same table as follows:

sqlite3 -separator ';' database.db ".import csv_file.txt abc"

The problem is that sometimes records in the file are repeated and of course it turns out that the table contains duplicates; at the moment I remove the duplicates in the table as follows:

SELECT DISTINCT * FROM abc

It works, I get output without duplicates, but can't figure out how I can update the existing abc table directly with the received data from the DISTINCT output. I can do something like the following:

CREATE TABLE abc2 AS SELECT DISTINCT * FROM abc
DROP TABLE abc
ALTER TABLE abc2 RENAME TO abc

But this will create a new table that will be missing type information.

Is there some way to run DISTINCT and directly update the already existing abc table? Thx!

P.S. Yes, I am a SQL newbie.

(2) By David Raymond (dvdraymond) on 2022-10-05 18:06:12 in reply to 1 [link] [source]

So if you're looking for duplicates of all the fields, maybe use the row_number window function, partitioned over all the fields, in a sub query to identify rowid's to delete?

sqlite> create table t (a, b, c);

sqlite> insert into t values (1, 2, 3), (2, 3, 4), (3, 4, 5), (1, 2, 3);

sqlite> select rowid, a, b, c, row_number() over (partition by a, b, c) from t;
rowid|a|b|c|row_number() over (partition by a, b, c)
1|1|2|3|1
4|1|2|3|2
2|2|3|4|1
3|3|4|5|1

sqlite> select id_to_delete from (select rowid as id_to_delete, row_number() over (partition by a, b, c) as row_number from t) where row_number > 1;
id_to_delete
4

sqlite> delete from t where rowid in (select id_to_delete from (select rowid as id_to_delete, row_number() over (partition by a, b, c) as row_number from t) where row_number > 1);

sqlite> select * from t;
a|b|c
1|2|3
2|3|4
3|4|5

sqlite>

(3) By David Raymond (dvdraymond) on 2022-10-05 18:37:59 in reply to 2 [link] [source]

My brain is wondering now. I believe that the window function means that the sub-query is going to scan through the entire table and finish before the outer main query starts deleting anything, and thus you don't have to worry about stuff getting deleted from the table while the sub-query is trying to access it.

But because I'm not positive that's the case, or if I just got lucky with my trivial example, I'll let more knowledgeable people chime in.

If you're paranoid you can add a step or two there.

sqlite> create table t (a, b, c);

sqlite> insert into t values (1, 2, 3), (2, 3, 4), (3, 4, 5), (1, 2, 3);

sqlite> create temp table ids_to_delete (id integer primary key);

sqlite> insert into ids_to_delete select id_to_delete from (select rowid as id_to_delete, row_number() over (partition by a, b, c) as row_number from t) where row_number > 1;

sqlite> select * from ids_to_delete;
id
4

sqlite> delete from t where rowid in ids_to_delete;

sqlite> select * from t;
a|b|c
1|2|3
2|3|4
3|4|5

sqlite> drop table ids_to_delete;

sqlite>

(4) By Alex_Don on 2022-10-05 19:06:18 in reply to 3 [link] [source]

David, thanks for the answer! but it looks very complicated and scary to me. I am very surprised that there is no easier way; well, what if we move away from the fact that it is necessary to update the already created table, and instead go by creating a clone of the table with the relevant information about types.

Something like that:

SELECT sql FROM sqlite_master WHERE type='table' AND name='abc'
Execute the result by changing the name of the table to the temporary name (abc2)
Copy the data received with DISTINCT into abc2
DROP TABLE abc
ALTER TABLE abc2 RENAME TO abc

Don't know how to do steps 2 and 3 yet, but I think it should be possible.

(5.1) By Ryan Smith (cuz) on 2022-10-05 20:33:11 edited from 5.0 in reply to 1 [source]

You are missing just one step.

This will do it easily:

CREATE TABLE abc2 AS SELECT DISTINCT * FROM abc;
DELETE FROM abc;
INSERT INTO abc SELECT * FROM abc2;
DROP TABLE abc2;

EDIT: Make sure you run this in a transaction!

Note: It does create 2 tables which can be a hefty file and may be unwanted if your dataset is big, such as 100GB+

Another way might be to simply remove duplicate rows after inserting, which may or may not also seem scary:

DELETE FROM abc WHERE rowid IN (
    SELECT B.rowid
      FROM abc AS A
      JOIN abc AS B ON B.rowid > A.rowid  
     WHERE B.a=A.a AND B.b = A.b AND B.c = A.c AND ..{all cols}..
)

Essentially it gets a list of all rows exactly equal but where the second row has a bigger rowid than its matching row, listing the rowid for each such second duplicate row, and then simply deletes all those specific rows.

The "rowid" is an invisible column in SQLite which is the unique identity to any row in the database.

(There are other things to say, such as you can define INTEGER column to alias the rowid, and that there exists a WITHOUT ROWID type of schema, etc, but I think that's outside the scope of this answer).

(6.1) By Alex_Don on 2022-10-05 21:16:47 edited from 6.0 in reply to 5.1 [link] [source]

Ryan, thanks for the answer.

CREATE TABLE abc2 AS SELECT DISTINCT * FROM abc;
DELETE FROM abc;
INSERT INTO abc SELECT * FROM abc2;
DROP TABLE abc2;

It works, thx.

My database is not large, about 2 GB at most. Just wondering which of the suggested ways is the most effective and fastest to deal with the task?

(9) By Ryan Smith (cuz) on 2022-10-06 09:56:34 in reply to 6.1 [link] [source]

Just wondering which of the suggested ways is the most effective and fastest to deal with the task?

There is no canonical answer to that. It depends on lots of things:

Is there any concurrency to care about?
Is there a Time and/or Size constraint to the operation?
Are there other database objects involved with the table (triggers, indexes, etc)?
What is the magnitude of duplications?
How daunting is the complex SQL to the programmer? etc.

The methods given are all fast in SQLite. Deleting duplicates should be much faster if there are only a few duplicates in a large table (<5% or so). If there are 25%+duplicates the rewrite is almost certainly faster.

Further to that, if speed is paramount, dropping and recreating any triggers and indexes on the table before and after inserting the data would help too.

At the end of the day, the only real answer is testing the methods in your specific environment and seeing which are better/faster for you.

(7.2) By Sunny Saini (SunnySaini_com) on 2022-10-06 08:29:38 edited from 7.1 in reply to 5.1 [link] [source]

Your idea is very nice however, duplicate rows won't be deleted if the duplicate rows contain null field(s).

Thus, I improve your query as:

DELETE FROM abc WHERE rowid IN ( SELECT B.rowid FROM abc AS A JOIN abc AS B ON B.rowid > A.rowid
WHERE B.a is A.a AND B.b is A.b AND B.c is A.c AND ..{all cols}.. );

(8) By Ryan Smith (cuz) on 2022-10-06 09:36:41 in reply to 7.2 [link] [source]

Indeed yes - nicely spotted, using the IS operator is better.

(10) By Holger J (holgerj) on 2022-10-06 12:17:22 in reply to 1 [link] [source]

If you always import the whole table (and not add new rows to the existing table), you might consider removing duplicate rows on the shell level before import.

The sort command can be of help when used with option -u (for unique).

The output of sort can be fed directly into sqlite3 by calling sort from .import

(11) By anonymous on 2022-10-09 13:54:52 in reply to 1 [link] [source]

Easiest solution I've come to use is simply creating a unique constraint on the table then import the data and ignore the errors.

(12) By Chris Locke (chrisjlocke1) on 2022-10-09 15:00:41 in reply to 11 [link] [source]

Depends if you can be sure what you allow is the latest record and what you're ignoring is the oldest data... ;)