Show duplicate rows sorted

(1) By anonymous on 2020-06-12 10:58:50 [link] [source]

Hello,

I am selecting duplicates in the "name" column of a table called "Records" using this query.

select * from Records tab_a join (select name from Records group by name having count(name) > 1) tab_b on tab_a.name = tab_b.name

The problem with this query is that it won't give me the rows sorted by "name" - which is what I want -, and I can't use a simple ORDER BY because the "name" column can contain unicode characters.

Is it possible to achieve this in pure SQL?

Thanks!

(2) By Ryan Smith (cuz) on 2020-06-12 11:55:52 in reply to 1 [link] [source]

Firstly, this would be the query that produces quicker results (as least for my test, considering I have no idea the size and quantity of columns, position of name in your Records table, or whether you also have an Index on name, etc.)

SELECT R.*
  FROM (SELECT name FROM Records GROUP BY name HAVING COUNT(*) > 1) AS D
  JOIN Records AS R ON R.name = D.name
 ORDER BY R.name COLLATE unicode

Secondly, the COLLATE ordering term above is custom for me, but there are standard loadable extensions for SQLite that's easy to obtain or compile yourself (if you are using an obscure OS/Processor) which will allow that ability with a typical COLLATE statement like the example above.

Best publicly available one I've used is the nunicode library, but there may be others. Google is your friend.

PS: I'm not sure if the collate is needed for the "GROUP BY name" also... I think the grouping works regardless, but may need testing.

(3) By anonymous on 2020-06-12 17:18:31 in reply to 2 [source]

Hello Ryan,

Thanks for your answer.

Do you think there is a method to get the duplicates sorted without using ORDER BY?

(4) By Larry Brasfield (LarryBrasfield) on 2020-06-12 18:11:25 in reply to 3 [link] [source]

There is no sorting which should be relied upon [a] without using ORDER BY or subjecting query results to an external processing step.

[a. There is plenty of ordering that happens without ORDER BY, as a consequence of insertion order, query optimization, and probably many other influences, but it cannot be relied upon outside of the "Try it and see if, By Golly, it works!" school of Programming for Bumblers. ]

(5) By Keith Medcalf (kmedcalf) on 2020-06-12 18:14:22 in reply to 2 [link] [source]

Why not just:

  select name
    from records
group by name collate unicode
  having count(*) > 1
order by name collate unicode
;

I would seem to me that the "group by" needs a collation as well since there is the possibility that different "binary" representations may equate to "equal" unicode representations. That is to say that there is more than one "binary" way to express the same "unicode" presentation.

(6) By anonymous on 2020-06-12 18:52:17 in reply to 5 [link] [source]

Thanks for your answer.

I want to output the duplicates, hence the JOIN.

(7) By anonymous on 2020-06-12 18:58:13 in reply to 5 [link] [source]

As for using collate unicode, I'm on Python3/sqlite3 - the module seems not to include "unicode".

I think it should be possible to register a user-defind collation function, though.

(8) By Keith Medcalf (kmedcalf) on 2020-06-12 19:23:45 in reply to 7 [link] [source]

Python. Simply create a collation in Python. It will be slow, but it will work. Assuming you are using the sqlite3 wrapper you do something like this:

def UnicodeCollate(test1, test2):
    return 1 if test1 > test2 else -1 if test1 < test2 else 0

conn = sqlite3.connect(...)
conn.create_collation('unicode', UnicodeCollate)

...

Mutatis Mutandis for any other python wrapper that lets you define collation sequences. You might have to "fiddle faddle" with your UnicodeCollation function to make sure the two parameters are actually unicode strings before doing the comparison or you might not get the results you want.