SQLite Forum

A name disambiguation challenge
Login

A name disambiguation challenge

(1.1) By Larry Brasfield (larrybr) on 2022-02-10 03:40:29 edited from 1.0 [link] [source]

(Edited to add one more requirement, at end.)

I've got a little puzzle to solve, with a few ways to do it that are not fully satisfactory. After seeing so many clever queries offered for folks who come here with similar requests, I am hopeful that somebody else can do better than I have managed, so far, for this problem.

Suppose I have a set of ordered names, RankedNames, an example of which is this CSV: rank,name 1,hippopotamus 2,pig 3,aligator 4,cow 5,cat 6,turkey 7,duck 8,cat 9,hippopotamus 10,horse 11,cat_08 12,sheep 13,goat 14,pig 15,cow

I want to treat the names as a set, without an incremental or looped algorithm that favors names early in the ordering over later ones. (In other words, the result should be logically the same if they are reordered.) A requirement of the treatment is that the treated names are unique. This query, SELECT count(DISTINCT treated_name)=count(treated_name) FROM TreatedRankedNames should return 1 (or true).

An objective, somewhat soft but not entirely disposable, is that the untreated names be retained and not be visually lost in whatever their treated form is. It would be good if the treated names were identical to the untreated names where they are distinct to begin with.

Another objective, also slightly soft, is that by looking at disambiguated duplicates, separated from their rank, one can tell where in the rank they were, if not absolutely then relatively. Use of the rank in the treatment is fine for this purpose.

A requirement (maybe too obvious to state) is that the query terminate, on its own. If some sort of recursive query reaches a limit, without reaching the uniqueness requirement, a fallback is alright where the original names are altered a lot (by character count) but remain intact.

It is preferrable, but not absolutely required, that the treatment is effected by a single query. Multiple applications of some DML in response to the uniqueness test failing are acceptable, provided the test can be known to pass after a smallish number of applications.

At least one solution suffers a shortcoming which ranks 5 8 and 11 help illustrate. That solution is to apply the rank as a suffix just to non-unique names. When this was done for ranks 5 and 8, it created a collision with the formerly unique name at rank 11. Any kind of systematic transformation which is designed, ad-hoc, to avoid aspects of a particular dataset, seems prone to this problem. If there is a transformation which systematically avoids this problem, and which is better than what I have1, I will be most appreciative.

Another requirement, which mainly goes to testability, is that if any kind of PRNG is used to create the alterations, it must be one which can be re-seeded. (Unfortunately, this rules out SQLite's random() function.)

This is not a toy puzzle. I have a legitimate use for a good solution.


  1. A crude solution is to extend all names in the set with sufficient '_' characters such that they are all the same length, then append the rank. This is kind of ugly and trashes a couple of the objectives.

(2) By Ryan Smith (cuz) on 2022-02-10 10:22:49 in reply to 1.1 [link] [source]

Hi Larry,

I'm not sure the problem statement is complete, or perhaps my comprehension is lacking, but if I put all of the above together in a set of logic rules, I find it exceedingly easy to achieve, meaning I am probably missing something.

As an example, I've simply added a calculated column to the suggested table which satisfies all the rules with zero iterative requirement, and looks rather pretty in a list:

CREATE TABLE ranked_names(
  id INTEGER PRIMARY KEY,
  rnk INTEGER NOT NULL UNIQUE,
  name TEXT NOT NULL,
  treated_name AS ('[' || substr('        ',length(rnk)) || rnk || ' ] ' || name)
);

  -- Adding data - I've mixed up the order here to be sure it doesn't matter.
INSERT INTO ranked_names(rnk, name) VALUES
 ( 1,'hippopotamus')
,(11,'cat_08')
,(14,'pig')
,( 4,'cow')
,( 8,'cat')
,(13,'goat')
,( 3,'alligator')
,(12,'sheep')
,( 5,'cat')
,(15,'cow')
,( 6,'turkey')
,( 7,'duck')
,(10,'horse')
,( 9,'hippopotamus')
,( 2,'pig')
;


SELECT treated_name FROM ranked_names

  -- treated_name               
  -- ---------------------------
  -- [        1 ] hippopotamus  
  -- [       11 ] cat_08        
  -- [       14 ] pig           
  -- [        4 ] cow           
  -- [        8 ] cat           
  -- [       13 ] goat          
  -- [        3 ] alligator      
  -- [       12 ] sheep         
  -- [        5 ] cat           
  -- [       15 ] cow           
  -- [        6 ] turkey        
  -- [        7 ] duck          
  -- [       10 ] horse         
  -- [        9 ] hippopotamus  
  -- [        2 ] pig           


SELECT count(DISTINCT treated_name)=count(treated_name) AS is_unique FROM ranked_names

  --   is_unique 
  -- ------------
  --       1     


  -- And here I ordered the query, just in case that is needed.
SELECT treated_name FROM ranked_names ORDER BY treated_name

  -- treated_name               
  -- ---------------------------
  -- [        1 ] hippopotamus  
  -- [        2 ] pig           
  -- [        3 ] alligator      
  -- [        4 ] cow           
  -- [        5 ] cat           
  -- [        6 ] turkey        
  -- [        7 ] duck          
  -- [        8 ] cat           
  -- [        9 ] hippopotamus  
  -- [       10 ] horse         
  -- [       11 ] cat_08        
  -- [       12 ] sheep         
  -- [       13 ] goat          
  -- [       14 ] pig           
  -- [       15 ] cow           

The rules I could glean from your writing are:

  • Treated like a set - Check
  • Should be unique - Check
  • Untreated names retained and distinguishable - Check
  • Be able to glean the Rank easily - Check
  • Treatment effected by single Query - Check++ (No query even needed)
  • Query must terminate - Check (not a problem)
  • Items like 5, 8 and 11 should be distinguishable and not collide - Check
  • No PRNG used - Check

Seems good to me. Did you have some other personal rules not stated? Like

  • Absolute size efficiency perhaps? I can see this solution able to improve in that arena.
  • Easy to type on a keyboard? It can do better there too, but it's not a stated requirement.
  • Ranks may not be unique? This is not stated explicitly, and can break this format if true.

Some notes on why I chose the format:

  • It's visually very clear,
  • fully reversible by simple programmatic means
  • Looks good and formatted (only important if it will be viewed like above)
  • Do not clutter or obscure the numbers or words, such as with underscores
  • Cannot have leading spaces or Zeroes if ever it will be copied to Excel or the like
  • You cannot control the characters that will appear in names, but you can for numbers, so starting with the number and some formatting characters around it (rather than the names) makes things much safer, easier and parse-able.
  • The spacing around the number allows for numbers up to Billions (without changing the bracket width), it can be shortened if you are sure the ranks would remain lower, or increased if ranks will go higher.

If I have missed some rules/preferences, please elaborate on them.

Here is another version in case Size efficiency does matter - all other rules are still adhered to, though it's slightly less easily readable and doesn't sort numerically.

  -- treated_name     
  -- -----------------
  -- [1]hippopotamus  
  -- [11]cat_08       
  -- [14]pig          
  -- [4]cow           
  -- [8]cat           
  -- [13]goat         
  -- [3]alligator      
  -- [12]sheep        
  -- [5]cat           
  -- [15]cow          
  -- [6]turkey        
  -- [7]duck          
  -- [10]horse        
  -- [9]hippopotamus  
  -- [2]pig           

Cheers!

PS: The format can be more easily achieved with "printf()" but I suppose performance should be tested before deciding.

(4) By Larry Brasfield (larrybr) on 2022-02-10 11:47:40 in reply to 2 [link] [source]

Thanks, Ryan, for stepping up on this. Responding to your points, reordered:

The rank values are unique and sequential starting with 1, always.

I did not want to make the problem overly hard, so did not set out as a requirement this highly valued objective: "It would be good if the treated names were identical to the untreated names where they are distinct to begin with." I should have stated this less ambiguously. The treated set shoulda leave the untreated individual names as-is unless it is necessary to alter them for uniqueness. And, if the set had no duplicates to begin with, it should not be altered at all.

The example set, of animal kinds, is an example of the output from a query meeting that criterion but with one of the duplicated/altered names maliciously added. ('cat_08' came from a set that had 'cat' duplicated. It illustrates difficulties with leaving the names alone when not presently duplicated -- they can then collide with the treated result of presently duplicated names.)

The application is this: A set of column names is provided by a user. But the user is not available to be told, "You've got duplicates. Try again!" Yet we want the user's heirs to have little trouble and minimal distraction when trying to recognize the fixed-up names when familiar with the originals. The fix-up is "minimal", in some vague (but real) sense.


a. Treat should as a retractable must. We really want it, but are willing to bend in the face of reality as the reality becomes clear.

(11) By Ryan Smith (cuz) on 2022-02-10 16:19:50 in reply to 4 [link] [source]

That makes sense.

My next suggestion then, which abides by all the rules and wishes[1] then becomes:


CREATE TABLE ranked_names(
  id INTEGER PRIMARY KEY,
  rnk INTEGER NOT NULL UNIQUE,
  name TEXT NOT NULL
);

INSERT INTO ranked_names(rnk, name) VALUES
 ( 1,'hippopotamus')
,(11,'cat_8')
,(14,'pig')
,( 4,'cow')
,( 8,'cat')
,(13,'goat')
,( 3,'alligator')
,(12,'sheep')
,( 5,'cat')
,(16,'cat')
,(15,'cow')
,( 6,'turkey')
,( 7,'duck')
,(10,'horse')
,( 9,'hippopotamus')
,( 2,'pig')
;

WITH FDups(id, name) AS (
    SELECT DISTINCT B.id, B.name FROM ranked_names AS A JOIN ranked_names AS B ON B.id > A.id AND B.name = A.name
), FDec(idx, decor) AS (
    SELECT 1, '_' UNION ALL SELECT idx+1, decor || '_' FROM FDec WHERE idx < 50
), FSet(id, name, di) AS (
    SELECT DISTINCT id, name, 1
      FROM ranked_names
      WHERE id NOT IN (SELECT id FROM FDups)
    UNION ALL
    SELECT FDups.id, FDups.name || FDec.decor || R.rnk, di + 1
      FROM FSet
      JOIN FDec ON FDec.idx = di
      JOIN FDups ON FDups.id > FSet.id AND FDups.name == FSet.name
      JOIN ranked_names AS R ON R.id = FDups.id
)
SELECT R.rnk, R.name, FSet.name
  FROM FSet
  JOIN ranked_names AS R ON R.id = FSet.id
 ORDER BY R.name, R.rnk
;


  --  rnk|name          |name            
  -- ----|--------------|----------------
  --   3 |alligator     |alligator       
  --   5 |cat           |cat_5           
  --   8 |cat           |cat             
  --  16 |cat           |cat_16          
  --  11 |cat_8         |cat_8           
  --   4 |cow           |cow             
  --  15 |cow           |cow_15          
  --   7 |duck          |duck            
  --  13 |goat          |goat            
  --   1 |hippopotamus  |hippopotamus    
  --   9 |hippopotamus  |hippopotamus_9  
  --  10 |horse         |horse           
  --   2 |pig           |pig_2           
  --  14 |pig           |pig             
  --  12 |sheep         |sheep           
  --   6 |turkey        |turkey          


I believe this matches the minimalist change view, and works across the board.

[1] There is one caveat I need to mention - One requirement was that the first duplicate of a set of duplicates be left unchanged, and another wish was that the changes do not in any way depend on the original order... well those two requirements are mutually exclusive. If the set is truly "unordered" then there is no concept of a "first" duplicate. You can have the first duplicate unmolested, OR you can have complete order independence, but not both.

I've run the above with adding a lot of duplicates and crafting them as pernicious as possible to try trip it up, so far it works without hitch. It will get a bit much for this forum, but I will paste one such output below to see how duplication-upon-duplication is mitigated or avoided. I've marked which ones are treated. Of course disassembly of the CTE's will be more informative.


CREATE TABLE ranked_names(
  id INTEGER PRIMARY KEY,
  rnk INTEGER NOT NULL UNIQUE,
  name TEXT NOT NULL
);

  -- Very convoluted names
INSERT INTO ranked_names(rnk, name) VALUES
 ( 1,'hippopotamus')
,(11,'cat_8')
,(14,'pig')
,( 4,'cow')
,( 8,'cat')
,(13,'goat')
,( 3,'alligator')
,(12,'sheep')
,( 5,'cat')
,(16,'cat')
,(15,'cow')
,( 6,'turkey')
,( 7,'duck')
,(10,'horse')
,( 9,'hippopotamus')
,( 2,'pig')
,(17,'cat_8')
,(18,'cat__8')
,(19,'cat_8')
,(20,'cat_08')
,(21,'cat8')
,(22,'cat___8')
,(23,'cow_')
,(24,'cow__')
,(25,'cow___')
;

WITH FDups(id, name) AS (
    SELECT DISTINCT B.id, B.name FROM ranked_names AS A JOIN ranked_names AS B ON B.id > A.id AND B.name = A.name
), FDec(idx, decor) AS (
    SELECT 1, '_' UNION ALL SELECT idx+1, decor || '_' FROM FDec WHERE idx < 50
), FSet(id, name, di) AS (
    SELECT DISTINCT id, name, 1
      FROM ranked_names
      WHERE id NOT IN (SELECT id FROM FDups)
    UNION ALL
    SELECT FDups.id, FDups.name || FDec.decor || R.rnk, di + 1
      FROM FSet
      JOIN FDec ON FDec.idx = di
      JOIN FDups ON FDups.id > FSet.id AND FDups.name == FSet.name
      JOIN ranked_names AS R ON R.id = FDups.id
)
SELECT R.rnk, R.name, FSet.name, IFNULL(NULLIF((R.name <> FSet.name),0),'') AS is_treated
  FROM FSet
  JOIN ranked_names AS R ON R.id = FSet.id
 ORDER BY R.name, R.rnk
;

  --  rnk|name          |name            |is_treated   
  -- ----|--------------|----------------|---------
  --   3 |alligator     |alligator       |         
  --   5 |cat           |cat_5           |    1    
  --   8 |cat           |cat             |         
  --  16 |cat           |cat_16          |    1    
  --  21 |cat8          |cat8            |         
  --  20 |cat_08        |cat_08          |         
  --  11 |cat_8         |cat_8           |         
  --  17 |cat_8         |cat_8_17        |    1    
  --  19 |cat_8         |cat_8_19        |    1    
  --  18 |cat__8        |cat__8          |         
  --  22 |cat___8       |cat___8         |         
  --   4 |cow           |cow             |         
  --  15 |cow           |cow_15          |    1    
  --  23 |cow_          |cow_            |         
  --  24 |cow__         |cow__           |         
  --  25 |cow___        |cow___          |         
  --   7 |duck          |duck            |         
  --  13 |goat          |goat            |         
  --   1 |hippopotamus  |hippopotamus    |         
  --   9 |hippopotamus  |hippopotamus_9  |    1    
  --  10 |horse         |horse           |         
  --   2 |pig           |pig_2           |    1    
  --  14 |pig           |pig             |         
  --  12 |sheep         |sheep           |         
  --   6 |turkey        |turkey          |         


(12) By Larry Brasfield (larrybr) on 2022-02-10 17:17:23 in reply to 11 [link] [source]

Please see my posts 7 and 10. I have concluded that it is an artifical and counter-productive constraint to insist that names change only when necessary to avoid duplicates. If some names are to be doctored, it is just as well to doctor others if need be.

On the "leave first member of replicated set alone": Which member would that be? They are ordered only in the sense that, ultimately, they will appear in their original order. But I think it is positively misleading to disguise the fact that a name was duplicated in the result. Better that all dupes get doctored. After a .import, somebody needs to decide what the pig_06 and pig_08 columns should become after the data/schema is further munged. It's better, IMO, that the de-duplication be obvious at that stage for all columns that needed it (or looked that way.)

(14.1) By Ryan Smith (cuz) on 2022-02-10 23:26:28 edited from 14.0 in reply to 12 [link] [source]

Very well, that simplifies things somewhat and will make the query much faster.

This delivers on the refined requirements I believe: (Whether you use it or not, this was fun in an otherwise bland day - thanks!)

CREATE TABLE ranked_names(
  id INTEGER PRIMARY KEY,
  rnk INTEGER NOT NULL UNIQUE,
  name TEXT NOT NULL
);

INSERT INTO ranked_names(rnk, name) VALUES
 ( 1,'hippopotamus')
,(11,'cat_8')
,(14,'pig')
,( 4,'cow')
,( 8,'cat')
,(13,'goat')
,( 3,'alligator')
,(12,'sheep')
,( 5,'cat')
,(16,'cat')
,(15,'cow')
,( 6,'turkey')
,( 7,'duck')
,(10,'horse')
,( 9,'hippopotamus')
,( 2,'pig')
;

WITH FDec(idx, decor) AS (
    SELECT 1, '_' UNION ALL SELECT idx+1, decor || '_' FROM FDec WHERE idx < 10
), FDups(id, name) AS (
    SELECT DISTINCT B.id, B.name
      FROM ranked_names AS A
      JOIN ranked_names AS B ON B.id <> A.id AND B.name = A.name
), FOpt(id, name, di) AS (
    SELECT R.id, R.name || decor || R.rnk, 1
      FROM FDups
      JOIN FDec ON idx = 1
      JOIN ranked_names AS R ON R.id = FDups.id
    UNION
    SELECT R.id, R.name || decor || R.rnk, di + 1
      FROM FOpt
      JOIN FDec ON idx = di + 1
      JOIN ranked_names AS R ON R.id = FOpt.id
), FSet(id, name) AS (
    SELECT R.id, R.name
      FROM ranked_names AS R
     WHERE R.id NOT IN (SELECT id FROM FDups)
    UNION ALL
    SELECT FDups.id, (SELECT FOpt.name FROM FOpt
                       WHERE FOpt.id=FDups.id AND FOpt.name NOT IN (SELECT name FROM ranked_names)
		       ORDER BY FOpt.di LIMIT 1)
      FROM FDups
)
SELECT R.name, FSet.name, IFNULL(NULLIF((FSet.name<>R.name),0),'') AS treated
  FROM FSet
  JOIN ranked_names AS R ON R.id = FSet.id
 ORDER BY R.name, FSet.name
;


  -- name          |name            |treated
  -- --------------|----------------|-------
  -- alligator     |alligator       |       
  -- cat           |cat_16          |   1   
  -- cat           |cat_5           |   1   
  -- cat           |cat__8          |   1   
  -- cat_8         |cat_8           |       
  -- cow           |cow_15          |   1   
  -- cow           |cow_4           |   1   
  -- duck          |duck            |       
  -- goat          |goat            |       
  -- hippopotamus  |hippopotamus_1  |   1   
  -- hippopotamus  |hippopotamus_9  |   1   
  -- horse         |horse           |       
  -- pig           |pig_14          |   1   
  -- pig           |pig_2           |   1   
  -- sheep         |sheep           |       
  -- turkey        |turkey          |       

and my version of added malicious test values to try break it:

CREATE TABLE ranked_names(
  id INTEGER PRIMARY KEY,
  rnk INTEGER NOT NULL UNIQUE,
  name TEXT NOT NULL
);

INSERT INTO ranked_names(rnk, name) VALUES
 ( 1,'hippopotamus')
,(11,'cat_8')
,(14,'pig')
,( 4,'cow')
,( 8,'cat')
,(13,'goat')
,( 3,'alligator')
,(12,'sheep')
,( 5,'cat')
,(16,'cat')
,(15,'cow')
,( 6,'turkey')
,( 7,'duck')
,(10,'horse')
,( 9,'hippopotamus')
,( 2,'pig')
,(17,'cat_8')
,(18,'cat__8')
,(19,'cat_8')
,(20,'cat_08')
,(21,'cat8')
,(22,'cat___8')
,(23,'cow_')
,(24,'cow__')
,(25,'cow___')
;

WITH FDec(idx, decor) AS (
    SELECT 1, '_' UNION ALL SELECT idx+1, decor || '_' FROM FDec WHERE idx < 10
), FDups(id, name) AS (
    SELECT DISTINCT B.id, B.name
      FROM ranked_names AS A
      JOIN ranked_names AS B ON B.id <> A.id AND B.name = A.name
), FOpt(id, name, di) AS (
    SELECT R.id, R.name || decor || R.rnk, 1
      FROM FDups
      JOIN FDec ON idx = 1
      JOIN ranked_names AS R ON R.id = FDups.id
    UNION
    SELECT R.id, R.name || decor || R.rnk, di + 1
      FROM FOpt
      JOIN FDec ON idx = di + 1
      JOIN ranked_names AS R ON R.id = FOpt.id
), FSet(id, name) AS (
    SELECT R.id, R.name
      FROM ranked_names AS R
     WHERE R.id NOT IN (SELECT id FROM FDups)
    UNION ALL
    SELECT FDups.id, (SELECT FOpt.name FROM FOpt
                       WHERE FOpt.id=FDups.id AND FOpt.name NOT IN (SELECT name FROM ranked_names)
		       ORDER BY FOpt.di LIMIT 1)
      FROM FDups
)
SELECT R.name, FSet.name, IFNULL(NULLIF((FSet.name<>R.name),0),'') AS treated
  FROM FSet
  JOIN ranked_names AS R ON R.id = FSet.id
 ORDER BY R.name, FSet.name
;


  -- name          |name            |treated
  -- --------------|----------------|-------
  -- alligator     |alligator       |       
  -- cat           |cat_16          |   1   
  -- cat           |cat_5           |   1   
  -- cat           |cat____8        |   1   
  -- cat8          |cat8            |       
  -- cat_08        |cat_08          |       
  -- cat_8         |cat_8_11        |   1   
  -- cat_8         |cat_8_17        |   1   
  -- cat_8         |cat_8_19        |   1   
  -- cat__8        |cat__8          |       
  -- cat___8       |cat___8         |       
  -- cow           |cow_15          |   1   
  -- cow           |cow_4           |   1   
  -- cow_          |cow_            |       
  -- cow__         |cow__           |       
  -- cow___        |cow___          |       
  -- duck          |duck            |       
  -- goat          |goat            |       
  -- hippopotamus  |hippopotamus_1  |   1   
  -- hippopotamus  |hippopotamus_9  |   1   
  -- horse         |horse           |       
  -- pig           |pig_14          |   1   
  -- pig           |pig_2           |   1   
  -- sheep         |sheep           |       
  -- turkey        |turkey          |       


Edit: Fixed a tested omission in the query.

(15) By Larry Brasfield (larrybr) on 2022-02-11 01:51:35 in reply to 14.1 [link] [source]

Ryan and Mark: Thanks much for your thoughts and SQL examples on this problem. I benefited from both, even though the SQL now checked in was not lifted from posts here.a I've been busy getting something to work in the way I pseudo-coded earlier today. Thanks also to others for helping to get my thinking past some artificial obstacles.

Here are some pastes taken from a session which demonstrate operation
showing the treated (or "doctored") results, then the input:

$ ./sqlite3 :memory: -cmd '.import -csv stuff.csv ImpTrial' ... sqlite> select name from pragma_table_xinfo('ImpTrial'); hippopotamus_01 pig_02 aligator cow_04 cat_05 turkey duck cat_08 hippopotamus_09 horse cat_11 sheep goat pig_14 cow_15 cow_16   sqlite> .shell tr , "'\n'" '<stuff.csv' "hippopotamus" "pig_2" "aligator" "cow_04" "cat" "turkey" "duck" "cat_04_99_1" "hippopotamus_1" "horse" "cat_2" "sheep" "goat" "pig" "cow" "cow" sqlite\>

The revision has effected the pseudo-code in post 7.

The SQL experts here may find the queries and D{D,M}L embedded in zAutoColumn() to be amusing. I'm sure that those who are adept with CTEs could compress it considerably and get the same job done, perhaps with no more than a single query on the ColNames table that is created and used within zAutoColumn. Suggestions along those lines are welcomeb, particularly ones accompanied by some explanation of what is going on in the query. Within the limits of my agreement with the SQLite organization, I will happily adopt suggestions when I grok them well enough to write something similar. (I cannot paste code into the sources unless it comes from somebody bound by the Contributor's Agreement.)

If anybody is interested, I would not mind having a good argument about why some doctoring beyond the minimum possible is a good approach, with respect to both user experience and implementation/test issues.


a. While I did not "lift" code from this thread, I studied it to refresh my memory about CTEs and associated subqueries.

b. If anybody becomes inspired to improve upon my SQL, the regexp shortcut should not be among the means of volume reduction.

(16.1) By Mark Lawrence (mark) on 2022-02-11 09:25:47 edited from 16.0 in reply to 15 [link] [source]

Your original set of requirements included this (in my opinion hard) one:

... somewhat soft but not entirely disposable, is that the untreated names be retained and not be visually lost in whatever their treated form is. It would be good if the treated names were identical to the untreated names where they are distinct to begin with

Which you seem to have later changed your view on:

Please see my posts 7 and 10. I have concluded that it is an artifical and counter-productive constraint to insist that names change only when necessary to avoid duplicates. If some names are to be doctored, it is just as well to doctor others if need be.

Based on this rationale:

This avoids the unsightly and potentially confusing appearance of doctoring upon doctoring. The rationale is: Such doctoring was never meaningful to humans anyway, (independent of its context in that duplicate-containing set), and is likely extraneous at any rate, so it is dispensable under the circumstances (where nobody cared enough to do their own name-doctoring.)

I don't personally find this argument strong enough to override the original requirement. You can't know the structure/importance/meaning of the original column names, and throwing good information away could be more painful to the user than not allowing duplicates in the first place.

With your current checked in code I consider this scenario worse than the status quo:

$ (tr , "\n" <x.csv && ./sqlite3 :memory: -cmd '.import --csv x.csv x' .dump) | sed -e 's/^/    /'

item
price
price_2021
price_2022
..ManyColumnsLater...
price

-- Loading resources from /home/mark/.sqliterc PRAGMA foreign_keys=OFF; BEGIN TRANSACTION;

CREATE TABLE IF NOT EXISTS "x"("item" TEXT, "price_2" TEXT, "price_3" TEXT, "price_4" TEXT,
 "..ManyColumnsLater..." TEXT, "price_6" TEXT);    COMMIT;

Now none of those price columns are useable, particularly as they cannot be easily found back in the original file. (Excel uses alphabetic column naming for example, not that I'm suggest that should be used here).

I think we all got a bit focused on the mechanisms before we all agreed on the principles of the outcome.

Edited: By the way, I would prefer this auto-naming to remain behind an --option, so that I know I can't import duplicate columns by default.

(17.1) By Larry Brasfield (larrybr) on 2022-02-11 13:07:23 edited from 17.0 in reply to 16.1 [link] [source]

(Post amended at end to reflect resolution.)

You can't know the structure/importance/meaning of the original column names, and throwing good information away could be more painful to the user than not allowing duplicates in the first place.

You make a compelling point here.

[good example] ... Now none of those price columns are useable, ...

I might argue that point. Most of why I wanted to get the column ordinals into the modified column names, and the ones that perhaps did not absolutely require a rename, is so that there is an easily predicted and understood mapping back to the source of the data, which is the CSV input.

However, despite that slight overstatement, I see that auto-rename has the potential to become part of a troublesome, cascading error sequence that many of us have seen, too often. I have often sneered at "It just works!" as a pernicious, aggravating lie. When it is false, it can be badly false and hard to remedy.

I think we all got a bit focused on the mechanisms before we all agreed on the principles of the outcome.

I do take exception to that, except in its hyper-technical sense. I gave much thought to what is to be achieved at the feature specification level, and why, before settling on the de-doctoring approach.

... prefer this auto-naming to remain behind an --option ... [not] by default

I am convinced this is the right answer. In an interactive session, the .import (without the --rename option) can fail, with an error message which mentions the option. It is easily redone with the option. But in an automated or scripted context, where the shell is being used as part of a larger process, the default should be that garbage in produces error message(s) and return status out (when requested.a) This is needed to avoid the all-too-common situation where something has gone wrong and an investigation must be conducted to discover where because some error(s) got swallowed or (not really) "fixed".

(Amendments follow.)

After discussion, we have settled upon a compromise between some competing objectives. If any renaming is done, it will cause an informative message to be issued. This will go to stderr unless the input device is the console.b There will be no option to prevent renaming. The whole point of having .import auto-create a table is its simplicity.

The feature will not be documented. Auto-create-table has been intended for convenient, ad-hoc, interactive use. (Hence its laxity on incoming column count discrepancies.) Its behavior may change, and should not be relied upon for use in scripting. The way to get predictable .import with errors on bad input data is to create the table prior to .import, and use appropriate checks.

At some future time, I will be getting shell extensibility ready for prime time. It will include the ability to replace or augment existing importers with ones of the user's choice. Those can enforce whatever strictures are desired (and worth the effort.)


a. This is why there is a -bail option and why we are careful to return error process exit status when that option takes effect.

b. Or, if not the console, a character device. So, when driving the CLI from your serial port, be prepared for prompts and chatter to go to stdout.

(21) By Larry Brasfield (larrybr) on 2022-02-14 01:35:00 in reply to 16.1 [link] [source]

The way this will work alleviates your concerns, I think. Here is the algorithm now on the auto-column branch tip: With each name that is not unique in the incoming names: append, as a suffix, '_' and the column ordinal rendered with Nz leading 0's choose Nz as the minimum value which: makes all suffixes the same width and suffices to make all result columns unique.

This leaves incoming column names untouched except for appended suffixes. The suffixes are only as ugly/in-ones-face as they need to be to get the job done. Only non-unique column names are altered in any way. Finally, the suffixes themselves help identify which incoming column was renamed.a

Finally, if any renaming is done, the .import completes with a message saying that renames were done and what they were.


a. This identification is not worth much for reasonable incoming data. But for incoming data with an absurd number of columns, or lots of duplication, it may help a human to correlate their new table columns with what came in.

(18) By anonymous on 2022-02-12 05:16:27 in reply to 15 [source]

So how well does your improved .import handle the CSV below?

COW,Cow,cow
1,2,3

(19) By Larry Brasfield (larrybr) on 2022-02-12 10:51:52 in reply to 18 [link] [source]

Thanks for pointing out this (formerly) problematic case.

Fixed a few minutes ago.

(20) By anonymous on 2022-02-12 18:24:54 in reply to 15 [link] [source]

The de-doctoring transformation boils down to this:

    if the name ends with one or more decimal digits preceded by an underscore then
        strip all trailing decimal digits and underscores from the name
    else
        don't touch the name

This can be expressed more succinctly.

    case when glob('*[0-9]',name) and glob('*_',rtrim(name,'0123456789')) then
        rtrim(name,'0123456789_')
    else
        name
    end

(22) By Larry Brasfield (larrybr) on 2022-02-14 01:49:54 in reply to 20 [link] [source]

I had semi-forgotten glob's character-set matching skill. It has proven useful to eliminate a recursive CTE, although I have not utilized the all-at-once stripping you suggest. (I had wanted, and still do want, the option of enabling a build-time option to replace the dedoctoring/alteration logic with something not in the controlled source.)

At any rate, thanks for the tip.

(3) By Mark Lawrence (mark) on 2022-02-10 11:13:00 in reply to 1.1 [link] [source]

I'm not sure I understood the requirements, but given Ryan's example I would propose either of the following:

rank  name          treated             treated2
----  ------------  ------------------  -------------------
1     hippopotamus        hippopotamus  hippopotamus  (1/2)
2     pig                 pig           pig           (1/2)
3     aligator            aligator      aligator
4     cow                 cow           cow           (1/2)
5     cat                 cat           cat           (1/2)
6     turkey              turkey        turkey
7     duck                duck          duck
8     cat           (2/2) cat           cat           (2/2)
9     hippopotamus  (2/2) hippopotamus  hippopotamus  (2/2)
10    horse               horse         horse
11    cat_08              cat_08        cat_08
12    sheep               sheep         sheep
13    goat                goat          goat
14    pig           (2/2) pig           pig           (2/2)
15    cow           (2/2) cow           cow           (2/2)

Generated as follows (please excuse the quick and dirty formatting):

create table items(
    rank integer,
    name text
);
insert into items values
    (1,'hippopotamus'),
    (2,'pig'),
    (3,'aligator'),
    (4,'cow'),
    (5,'cat'),
    (6,'turkey'),
    (7,'duck'),
    (8,'cat'),
    (9,'hippopotamus'),
    (10,'horse'),
    (11,'cat_08'),
    (12,'sheep'),
    (13,'goat'),
    (14,'pig'),
    (15,'cow');

with counted as (
    select
        rank,
        name,
        row_number() OVER (partition by name) as r
        ,sum(1) OVER(partition by name) as s
    from items
    order by name,rank
), limits as (
    select max(length(c.r) + length(c.s)) as m,
    max(length(c.name)) as m2
    from counted c
)
select
    c.rank,
    c.name,
    printf(
    '%'|| (l.m+3) || 's %s',
        case when c.s > 1 and c.r > 1
        then '(' || c.r || '/' || c.s || ')'
        else ''
        end,
        c.name) as treated,
    printf(
    '%-'|| (l.m2+1) || 's %s', c.name,
        case when c.s > 1
        then '(' || c.r || '/' || c.s || ')'
        else ''
        end
        ) as treated2
from counted c, limits l
order by 1
;

(5) By Larry Brasfield (larrybr) on 2022-02-10 12:37:06 in reply to 3 [link] [source]

That's nice for the altered names. Do you think it can be made to work by not altering the original names in any way, provided that it is not necessary to do so?

What I'm after is a minimal column renaming for scenarios where the originals are generally valued. It is only when some of them are not distinct that any renaming at all should be done. This is for an enhancement to the shell's .import meta-command, to happen auto-magically. (No .import retry; it just works, with zero impact upon proper incoming column names.)

(13.1) By Mark Lawrence (mark) on 2022-02-10 21:14:25 edited from 13.0 in reply to 5 [link] [source]

If (recursive) triggers are an option then the following should guarantee unique column names, gracefully:

  1. original name on first seen
  2. original name + '_$COUNT' on second seen, but
  3. add additional '_x' (+ '_x, ...) when original or generated name matches generated name 2.

Code:

pragma recursive_triggers=1;

create table items(
    rank integer,
    name text
);
create table treated(
    rank integer,
    name text unique
);

create trigger bi_treated before insert on treated
for each row when exists
    (select 1 from treated where name=NEW.name)
begin
    insert into treated(rank,name)
    values( NEW.rank, NEW.name || '_x');
    select raise(ignore);
end;

insert into items values
    (1,'hippopotamus'),
    (2,'pig'),
    (3,'aligator'),
    (4,'cow'),
    (5,'cat'),
    (6,'turkey'),
    (7,'duck'),
    (8,'cat'),
    (9,'hippopotamus'),
    (10,'horse'),
    (11,'cat_2'),
    (12,'sheep'),
    (13,'goat'),
    (14,'pig'),
    (15,'cow');

with a as (
    select
        rank,
        name,
        row_number() OVER (partition by name) as c
    from items
    order by name,rank
)
insert into treated(rank,name)
select
    a.rank,
    a.name ||
        case when a.c > 1
        then '_' || a.c
        else ''
        end AS name
    from a
;

select * from treated order by rank;

-- rank  name
-- ----  --------------
-- 1     hippopotamus
-- 2     pig
-- 3     aligator
-- 4     cow
-- 5     cat
-- 6     turkey
-- 7     duck
-- 8     cat_2
-- 9     hippopotamus_2
-- 10    horse
-- 11    cat_2_x
-- 12    sheep
-- 13    goat
-- 14    pig_2
-- 15    cow_2

(6) By anonymous on 2022-02-10 13:10:38 in reply to 1.1 [link] [source]

To avoid treated names clashing with untreated ones, select a name for treatment if it's non-unique or if it looks like a treated name.

create table RankedNames (
    rank integer primary key,
    name text not null);

insert into RankedNames(rank,name) values
    (1,'hippopotamus'),
    (2,'pig'),
    (3,'aligator'),
    (4,'cow'),
    (5,'cat'),
    (6,'turkey'),
    (7,'duck'),
    (8,'cat'),
    (9,'hippopotamus'),
    (10,'horse'),
    (11,'cat_08'),
    (12,'sheep'),
    (13,'goat'),
    (14,'pig'),
    (15,'cow');

create table TreatedRankedNames (
    rank integer primary key,
    name text not null,
    treated_name text not null unique);

with needs_treatment(name) as
    (select name from RankedNames
        group by name
        having count(*)>1
            or name regexp '_\d+$')
    insert into TreatedRankedNames (rank,name,treated_name)
        select rank,name,
                case when name in needs_treatment then
                    name || '_' || cast(rank as text)
                else
                    name
                end
            from RankedNames;

This gives you (at least on my computer):

+------+--------------+----------------+
| rank |     name     |  treated_name  |
+------+--------------+----------------+
| 1    | hippopotamus | hippopotamus_1 |
| 2    | pig          | pig_2          |
| 3    | aligator     | aligator       |
| 4    | cow          | cow_4          |
| 5    | cat          | cat_5          |
| 6    | turkey       | turkey         |
| 7    | duck         | duck           |
| 8    | cat          | cat_8          |
| 9    | hippopotamus | hippopotamus_9 |
| 10   | horse        | horse          |
| 11   | cat_08       | cat_08_11      |
| 12   | sheep        | sheep          |
| 13   | goat         | goat           |
| 14   | pig          | pig_14         |
| 15   | cow          | cow_15         |
+------+--------------+----------------+

(7) By Larry Brasfield (larrybr) on 2022-02-10 14:38:18 in reply to 6 [link] [source]

Thanks for that idea. I think I'll adapt it as follows (pseudo-code): if {any name-doctoring is needed} for each {name that looks doctored} undoctor name for each {name that is not unique in the set} doctor name .

This avoids the unsightly and potentially confusing appearance of doctoring upon doctoring. The rationale is: Such doctoring was never meaningful to humans anyway, (independent of its context in that duplicate-containing set), and is likely extraneous at any rate, so it is dispensable under the circumstances (where nobody cared enough to do their own name-doctoring.)

As to what "doctoring" means exactly, I favor something which does not require quoted identifiers. That's still a'pondering.

(9) By Bill Wade (billwade) on 2022-02-10 14:58:32 in reply to 7 [link] [source]

If you apply your pseudo-code, your cat_08 example may longer apply.

If looks doctored(x) means x has "_digits" suffix cat_08 looks doctored, so it gets undoctored to cat, meaning that 8,cat is allowed to be doctored to be cat_08.

Here I am assuming that "rank" is always a whole number (or at least never contains '_').

(10) By Larry Brasfield (larrybr) on 2022-02-10 15:53:11 in reply to 9 [link] [source]

If you apply your pseudo-code, your cat_08 example may longer apply.

Well, that's the point of the undoctoring. Maybe I misapprehend your point.1 The point of having cat_08 was to raise an issue, one which is dealt with by undoctoring first or zealous doctoring or doctoring upon doctoring (ad semi-infinitum.)

If looks doctored(x) means x has "_digits" suffix cat_08 looks doctored, so it gets undoctored to cat, meaning that 8,cat is allowed to be doctored to be cat_08.

Yes, exactly. If a doctored-like name set is presented, but its values are distinct, fine. But otherwise, they all get fixed and look similar afterward but with the "right" doctoring. User-supplied names are sacrosanct, until proven to be unsuitable as table column names. Then, only their mnemonic or un-algorithmic portion is sacrosanct. Those who are really picky about their random-garbage names, all thrown together somehow with duplication have a simple option: Do a better job, or create their own table with their preferred doctoring.

Here I am assuming that "rank" is always a whole number (or at least never contains '_').

Yes, although I need to change the name. It's a keyword in some other SQL-munchers.


  1. I sometimes have difficulty with the obvious. When somebody says something so perfectly obvious that it need not be said, I spend time wondering if they were trying to somehow qualify it. Especially if it was said with language having the slightest ambiguity. It's a wonder I'm still married.

(8) By Bill Wade (billwade) on 2022-02-10 14:41:34 in reply to 1.1 [link] [source]

I'll add some goals in the same spirit

0) I am assuming that rank is DISTINCT and not null (or at most one null).

1) If one or more names were cat, exactly one treated_name will be cat. This rule does not apply to names that were null.

2) The original name is always a prefix of the treated name.

3) If the name was treated, the treated name will have a suffix of _rank.

4) More specifically, if 8,cat gets treated, the treated_name will be cat_8, unless that was already a name, in which case the treated_name will be LIKE cat_%_8, with a reasonably short match for the %.

From my rules, your resulting rank,treated_name will include either 2,pig_2 and 14,pig, or they will include 2,pig and 14,pig_14. From your "I want to treat ..." clause, it sounds like you want those two outcomes to be equally likely. All I will do is an UPDATE OR IGNORE, so that one of those keeps pig, but which one is up to sqlite.

Process:

1) Compute a random seed (one seed for the entire table).

2) For each row, compute column hash = cryptographic_hash(seed||rank||name). If you are really paranoid, check for duplicates in the hash, and append a small integer to make them unique.

3) Put a UNIQUE constraint on treated name. Initialize all treated names to null.

4.1) To the extent possible, reuse existing names, but no name more than once. UNIQUE constraint will prevent duplicates (except for null, which will get replaced later)
    UPDATE OR IGNORE tablename SET treated_name = name;

4.2) otherwise if name_rank is not already a treated_name use it for treated_name. If name or rank can be null, the SET expression would be a bit more complicated.
    UPDATE OR IGNORE tablename SET treated_name = name||'_'||rank WHERE treated_name IS NULL;

4.3) otherwise find the shortest prefix of hash such that name_prefix_rank is not already a treated_name, and use that result.

4.4) otherwise find the smallest whole number, n, such that name_hash_n_rank is not already a treated name, and use that for the treated_name.

DML for 4.3 and 4.4 are left as an exercise for the reader.