Delete duplicate rows with an extra character

(1) By Adam (astranberg) on 2021-11-07 15:56:30 [link]

Hello all,

I would love your help on this. I have a table where:

CREATE TABLE players(

        player_tag TEXT,

        update_date TEXT,

        max_trophies TEXT,

        UNIQUE(player_tag))


The problem is I have some player_tag's which have an extra character I'd like to delete.

      example: "#Y123456" and "Y123456" are both player_tags.


I would like to remove the # from all player tags (i.e. delete #Y654321 if Y654321 exists), but that of course will error with non-unique rows.

      UPDATE players SET player_tag = replace(player_tag,"#","")

So I'm trying to write a command to delete these duplicates, but I'm not sure how to differentiate where player_tag is coming from in this phase:

      DELETE FROM players WHERE EXISTS (SELECT player_tag FROM players WHERE player_tag = replace(original.player_tag,"#",""))


Appreciate anybody's help!
~Adam

(2) By Larry Brasfield (larrybr) on 2021-11-07 18:58:52 in reply to 1 [link]

Here is how to get the rowid for delete candidates:<code>
  SELECT x.rowid FROM players x, players y
  WHERE substr(x.player_tag,2)=y.player_tag
   AND substr(x.player_tag,1,1) IN ('#');
</code>. Calling that query "Extras", the delete is:<code>
  DELETE FROM players WHERE rowid IN (Extras);
</code>. It helps here that each row has a unique [rowid](https://sqlite.org/rowidtable.html).

(3) By Keith Medcalf (kmedcalf) on 2021-11-07 19:00:03 in reply to 1 [link]

One presumes you mean something like this:

```
DELETE FROM players o
      WHERE player_tag like '#%'
        AND EXISTS (
                    SELECT 1
                      FROM players
                     WHERE player_tag == substr(o.player_tag, 2)
                    )
;
```

The original, although it can be made to work, is flawed in that it processes all rows in the table, rather than just the rows that might need to be deleted.  If you limit the examiation to only those rows implies that there is always a '#' as the first character so there is no need to search for it.

However, the following statements might be faster:

```
UPDATE OR IGNORE players
             SET player_tag = substr(player_tag, 2)
           WHERE player_tag like '#%'
;
DELETE FROM players
      WHERE player_tag like '#%'
;
```

(4) By Keith Medcalf (kmedcalf) on 2021-11-07 19:05:52 in reply to 3

Note that you will need to do `pragma case_sensitive_like=1;` to use the unique index on player_tag (unless, of course, player_tag is case insensitive) ...

(5) By Adam (astranberg) on 2021-11-07 19:22:34 in reply to 3 [link]

I really like your last two ideas, except that I don't want to delete ALL player tags like '#%', I only want to delete those like '#%' if it already exists without the #.

Here's my current solution. Does somebody have a more efficient one?

SELECT  * FROM players
WHERE   '#' || player_tag
IN      (SELECT player_tag FROM players WHERE player_tag like '#%')

(6) By Keith Medcalf (kmedcalf) on 2021-11-07 20:54:50 in reply to 5 [link]

There are two statements.  

The first is an UPDATE statement which, for all rows where the player_tag startswith '#', removes the '#' character prefix.  For any row in which this change causes a conflict (duplicate player_tag after the update) the update is ignored.  This means that the only rows remaining where player_tag commences with '#' are duplicates to be deleted.

The second statement deletes all those duplicates.

(7) By Keith Medcalf (kmedcalf) on 2021-11-07 20:57:23 in reply to 6 [link]

`DELETE` is not a valid conflict resolution method, hence two statements are required.  (ie, you cannot do an `UPDATE OR DELETE`)

(8) By Adam (astranberg) on 2021-11-07 23:05:27 in reply to 6 [link]

Oh I understand. Very helpful - thanks!

may mark as solved (if that's a thing here)