how DISTINCT remove the duplicate rows?

(1) By anonymous on 2021-04-13 17:38:04 [link] [source]

Hello,

I am looking for information in the documentation about how the DISTINCT keyword removes the duplicate rows, but couldn't find it.

I want to know what the result shold be when I am using this keyword on a BLOB column.

For example,

CREATE TABLE tab(c);
INSERT INTO tab values (1), (1.0);
SELECT DISTINCT c FROM tab;

The output will be 1. But why it isn't 1.0?

I want to know if it's been defined.

(2) By David Raymond (dvdraymond) on 2021-04-13 17:58:10 in reply to 1 [source]

In 2.4 of the SELECT documentation there's this:

4 . Removal of duplicate rows (DISTINCT processing).

One of the ALL or DISTINCT keywords may follow the SELECT keyword in a simple SELECT statement. If the simple SELECT is a SELECT ALL, then the entire set of result rows are returned by the SELECT. If neither ALL or DISTINCT are present, then the behavior is as if ALL were specified. If the simple SELECT is a SELECT DISTINCT, then duplicate rows are removed from the set of result rows before it is returned. For the purposes of detecting duplicate rows, two NULL values are considered to be equal. The usual rules apply for selecting a collation sequence to compare text values.

So it looks like it uses =/IS to compare them. And since 1 = 1.0 they're considered the same. Which one you get back is probably which one came first as it was traversing the values.

(3) By anonymous on 2021-04-13 18:12:12 in reply to 2 [link] [source]

Thank you David.

I've read the passage you quoted, but I am still not sure how SQLite choose between the value-equal rows. If it's like what you said, will it be some randomness in the results?

If it's possible, I want to hear from developers.

(5) By Keith Medcalf (kmedcalf) on 2021-04-13 18:29:27 in reply to 3 [link] [source]

The first projection result that is distinct will be output. Subsequent duplicates will be suppressed.

(6) By anonymous on 2021-04-13 18:51:48 in reply to 5 [link] [source]

Thank you Keith.

Does this mean that the final output row depends on the order of scanning the projection candidates?

Forget this example. By changing the order of scanning origin table, just like adding and using an index, the order of projection candidates will change with it. In that case, will the result change from 1 to 1.0? Isn't it confusing?

(7) By Larry Brasfield (larrybr) on 2021-04-13 18:58:18 in reply to 6 [link] [source]

Does this mean that the final output row depends on the order of scanning the projection candidates?

Yes.

Forget this example. By changing the order of scanning origin table, just like adding and using an index, the order of projection candidates will change with it. In that case, will the result change from 1 to 1.0? Isn't it confusing?

Yes, the result may change as you suggest. Perhaps that will confuse some people whose notion of what "DISTINCT" means differs from what the implementation does. However, avoiding that confusion would likely confuse other people with yet another expectation. People are often confused by their query results. They are best served, IMHO, by an implementation with well documented and human-comprehensible rules. I do not see anything in this thread to suggest a different resolution would be less confusing.

(8) By Keith Medcalf (kmedcalf) on 2021-04-13 19:09:39 in reply to 6 [link] [source]

Yes. Example:

sqlite> CREATE TABLE tab(c);
sqlite> create index ix on tab(c, typeof(c) desc);
sqlite> insert into tab values (1),(1.0);
sqlite> .eqp full
sqlite> SELECT DISTINCT c FROM tab;
QUERY PLAN
`--SCAN tab USING COVERING INDEX ix (~1048576 rows)
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     16    0                    0   Start at 16
1     Null           1     2     0                    8   r[2]=NULL
2     OpenRead       2     3     0     k(3,,-,)       0   root=3 iDb=0; ix
3     ColumnsUsed    2     0     0     1              0
4     Explain        4     0     0     SCAN tab USING COVERING INDEX ix (~1048576 rows)  0
5     Noop           0     0     0                    0   Begin WHERE-loop0: tab
6     Rewind         2     14    1     0              0
7       Noop           0     0     0                    0   Begin WHERE-core
8       Column         2     0     1                    0   r[1]=tab.c
9       Eq             1     13    2     BINARY-8       128  if r[2]==r[1] goto 13
10      Copy           1     2     0                    0   r[2]=r[1]
11      ResultRow      1     1     0                    0   output=r[1]
12      Noop           0     0     0                    0   End WHERE-core
13    Next           2     7     0                    1
14    Noop           0     0     0                    0   End WHERE-loop0: tab
15    Halt           0     0     0                    0
16    Transaction    0     0     2     0              1   usesStmtJournal=0
17    Goto           0     1     0                    0
┌─────┐
│  c  │
├─────┤
│ 1.0 │
└─────┘

Note that in this case the solution is slightly different. The table is traversed using the index and output rows that are duplicates of the preceding row are suppressed (since it is known that the results will be in order). THe index causes the value 1.0 (type REAL) to be processed before value 1 (type integer) because the word "real" sorts before the word "integer" (in descending order).

(9) By anonymous on 2021-04-13 19:27:04 in reply to 8 [link] [source]

Thank you Keith, that is exactly what I wanted.

(4) By Keith Medcalf (kmedcalf) on 2021-04-13 18:27:37 in reply to 1 [link] [source]

It is pretty simple.

For each projection candidate generate a row/key structure.
If the row/key exists in the ephemeral filter table go to step 5.
Add the row/key to the ephemeral filter table.
Output the candidate row.
If we have not reached the end of the candidates, go to step 1 to process the next candidate.
Clean up and discard the ephemeral table and halt.

You will note that this is exactly the procedure that SQLite3 uses to process the statement:

sqlite> CREATE TABLE tab(c);
sqlite> INSERT INTO tab values (1), (1.0);
sqlite> .eqp full
sqlite> SELECT DISTINCT c FROM tab;
QUERY PLAN
|--SCAN tab (~1048576 rows)
`--USE TEMP B-TREE FOR DISTINCT
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     17    0                    0   Start at 17
1     OpenEphemeral  1     0     0     k(1,B)         8   nColumn=0
2     OpenRead       0     2     0     1              0   root=2 iDb=0; tab
3     ColumnsUsed    0     0     0     1              0
4     Explain        4     0     0     SCAN tab (~1048576 rows)  0
5     Noop           0     0     0                    0   Begin WHERE-loop0: tab
6     Rewind         0     15    0                    0
7       Noop           0     0     0                    0   Begin WHERE-core
8       Column         0     0     1                    0   r[1]=tab.c
9       Found          1     14    1     1              0   key=r[1]
10      MakeRecord     1     1     2                    0   r[2]=mkrec(r[1])
11      IdxInsert      1     2     1     1              16  key=r[2]
12      ResultRow      1     1     0                    0   output=r[1]
13      Noop           0     0     0                    0   End WHERE-core
14    Next           0     7     0                    1
15    Noop           0     0     0                    0   End WHERE-loop0: tab
16    Halt           0     0     0                    0
17    Transaction    0     0     1     0              1   usesStmtJournal=0
18    Goto           0     1     0                    0
┌───┐
│ c │
├───┤
│ 1 │
└───┘