CLI .import command ignore/enumerate duplicate headers?

(1) By anonymous on 2022-02-05 15:20:05 [link] [source]

Hello all,

Is there a way to have the CLI .import command handle duplicate column headers so it doesn't just fail?

Sometimes I get not-so-perfect files from customers or vendors that contain duplicate header names, and it would be nice to not have to manually specify table schemas, especially since sometimes the files have many hundreds of columns.

Thanks for any help/insights in advance.

(2) By Larry Brasfield (larrybr) on 2022-02-05 15:24:39 in reply to 1 [link] [source]

I doubt that automatic recognition of "line is a column header" and "column header is duplicated" can happen. (I certainly would not advocate such special-casing.)

However, the .import dot-command has had a "--skip N" option for awhile. If you are up to getting the duplication recognized and counted, that option would do the skipping for you (with an appropriate value of N.)

(3) By Mark Lawrence (mark) on 2022-02-05 18:22:34 in reply to 2 [source]

The way I understand the reported issue, the desire is being expressed to handle the following scenario better:

#!/bin/sh

cat > input.csv <<EOF
one,two,three,two
1,2,3,2
10,20,30,20
EOF

cat <<EOF | sqlite3
.import --csv input.csv input
select * from input;
EOF

# CREATE TABLE input(...) failed: duplicate column name: two
# Error: near line 2: no such table: input

I think it would be a little more convenient if sqlite added a _2 (_3, ...) postfix to duplicate column names. It is a bit of a pain having to fix up this type of thing manually.

(4) By Larry Brasfield (larrybr) on 2022-02-05 19:04:40 in reply to 3 [link] [source]

Ok, I read the OP's statement of difficulty another way. (An example such as yours makes all the difference.)

I think the answer needs to be similar though. The option to auto-generate a table definition, given the presumed presence of legitimate column names in the first CSV record, is a convenience for those who have the quite common CSV format with appropriate column names. There is a simple solution today, which is just to create the table before the .import and then use its --skip option.

I shudder to think of getting into the GITO business. (That's Garbage-In, Treasure-Out) There is no end to ad-hoc methods for recognizing and transforming Garbage that is not completely random junk. Even if the count of Garbage types was reasonably small, it would be hard to anticipate it. (I must grant that the OP's assumed instance of it and your example are simple, but I think the result is far from Treasure anyway; somebody has to figure out what to call the near-duplicate columns and/or what to do with them.)

What I think would be a good solution, for the OP and others who have CSV with rubbish as the first line but do not wish to count its columns and define a table prior to .import, is to add an option, --autocolumns, to the .import meta-command. It would create a new table, with column count conformed to the input, but with columns named "col_#" where '#' is an ascending integer starting with 1. Then, either the first CSV row can be skipped on .import, or read in as data and skipped when further processing the new table via SELECT ... OFFSET.

This feature would be catering to quick-and-dirty data munging tasks. I don't think that has been a priority for the CLI's design.

(5) By Mark Lawrence (mark) on 2022-02-06 16:08:16 in reply to 4 [link] [source]

I'm not feeling much improvement with your --autocolumns option in this case. It may be useful for those without header rows, but with headers in the data that would be throwing out the baby with the bathwater.

The OP mentioned up to hundreds of columns. Pre-creating schemas for that means reaching for a scripting language, or else painful scrolling in Excel or an $EDITOR to fix the dupes. Obviously such wide data has not been generated specifically for relational work, so magically generated "treasure from garbage" schemas are a real timesaver.

I often use SQLite for an initial investigation of CSV file structure/contents, with a view to creating narrower and cleaner views later. I guess my point is that the CLI can be great for lots of quick-and-dirty tasks, regardless of the developers' priorities :-)

How about a --dupcols option that lets SQLite remain strict by default?

(7) By Larry Brasfield (larrybr) on 2022-02-06 19:46:11 in reply to 5 [link] [source]

I hope you are willing to discuss this. I also have used the SQLite CLI for plenty of quick-and-dirty tasks, along with perl, head, sed, and etc. So, I'm sympathetic to the plight of those who want to get such done without a lot of futzing and would like to understand those scenarios and how possible^a new features would play out in them.

I'm not feeling much improvement with your --autocolumns option in this case. It may be useful for those without header rows, but with headers in the data that would be throwing out the baby with the bathwater.

I'm not so sure of that. Consider this sequence:.import --csv --autocolumns path/to/my.csv StagingTable DELETE FROM StagingTable; .import --csv --skip 1 path/to/my.csv StagingTable. With minimal line editing and typing, that gets the non-header data in.

The OP mentioned up to hundreds of columns. Pre-creating schemas for that means reaching for a scripting language, or else painful scrolling in Excel or an $EDITOR to fix the dupes.

I would not put "reaching for a scripting language" in the litany of horrors, but I agree that those things would be nice to minimize.

Obviously such wide data has not been generated specifically for relational work, so magically generated "treasure from garbage" schemas are a real timesaver.

I have a working demo, a screen-scrape from which is:sqlite> .import --csv --colname '%.11s_%02d' /b/Tmp/xact.csv Junk sqlite> .import --csv --colname '%.0sCol_%02d' /b/Tmp/xact.csv Rubbish sqlite> .schema CREATE TABLE IF NOT EXISTS "Junk"( "Date_01" TEXT, "Transaction_02" TEXT, "Name_03" TEXT, "Memo_04" TEXT, "Amount_05" TEXT ); CREATE TABLE IF NOT EXISTS "Rubbish"( "Col_01" TEXT, "Col_02" TEXT, "Col_03" TEXT, "Col_04" TEXT, "Col_05" TEXT );

To save a discussion round-trip: I realize that the numeric suffix will often not be needed. (After all, not every column name will suffer duplication.) I would argue that names which have been transformed in some regular manner are either good enough, or if not, more readily subject to some touch-up with a little scripting help. The "ALTER TABLE RENAME COLUMN ..." feature that SQLite gained last year can be a big help with this.

An additional reason to not avoid auto-decorating the names is that such avoidance will necessarily involve building and consulting a lookup table and much more logic than the above feature required. (Most of the needed code is for validating the format string to avoid nasty runtime faults.)

I welcome anybody to explain the shortcomings of this scheme.

I guess my point is that the CLI can be great for lots of quick-and-dirty tasks, regardless of the developers' priorities :-)

Your point is well taken.

How about a --dupcols option that lets SQLite remain strict by default?

I do not understand what this would do. Please explain.

a. No new .import features are promised by anything I say here.

(8) By Mark Lawrence (mark) on 2022-02-08 08:30:32 in reply to 7 [link] [source]

I hope you are willing to discuss this. I also have used the SQLite

Yup, just got delayed a bit. Thanks for the interest and efforts so far. Your second example first:

CREATE TABLE IF NOT EXISTS "Rubbish"(
  "Col_01" TEXT,
  "Col_02" TEXT,
  "Col_03" TEXT,
  "Col_04" TEXT,
  "Col_05" TEXT
);

This outcome is what I was referring to with the bathwater. I would not use this for CSV input containing a header row, as the information loss is worse than fixing up the duplicates.

sqlite> .import --csv --colname '%.0sCol_%02d' /b/Tmp/xact.csv Rubbish
sqlite> .schema
CREATE TABLE IF NOT EXISTS "Junk"(
  "Date_01" TEXT,
  "Transaction_02" TEXT,
  "Name_03" TEXT,
  "Memo_04" TEXT,
  "Amount_05" TEXT
);

Your format string implies colum names would actually be like "DateCol_01", "TransactionCol_02", ... right? Just a hack/execute/paste error I assume. Same with the Rubbish/Junk naming.

I welcome anybody to explain the shortcomings of this scheme.

This as proposed does the job for me, simply because header information is retained and duplicates don't block the import. I won't call it pretty but I understand the constraints, and for quick and dirty it functions just fine.

How about a --dupcols option that lets SQLite remain strict by default?

I do not understand what this would do. Please explain.

A poorly named suggestion for generating duplicate column names elegantly: "Date", "Transaction", "Name", "Date_02", ...

(9) By anonymous on 2022-02-23 03:13:20 in reply to 7 [link] [source]

OP here. Just wanted to say thanks a million for adding this functionality to the latest release of SQLite! It's going to be so much more convenient to not only have the system not fail every time, but the output that explains which columns are renamed is a fantastic way to know ahead of time what issues there were.

Again, this is fantastic. Thanks so much, Larry. It's truly, truly appreciated.

(6) By rbucker on 2022-02-06 17:40:29 in reply to 1 [link] [source]

Personally, I like to validate my data before blindly importing data. I have yet to experience any sort of trojan in the incoming data but I have experienced CSVs with tons of errors because they were exported from XLSX:

columns with unicode data
columns with embedded quotes or commas
multiple columns to the right of the last column
import sets all columns as text
since sqlite stores everything as text type casting type rules which I'm not an expert in there can be issues between = and like... one could make the point that the preprocessing should try to identify the datatype.

Anyway, You have several choices...

make your changes to the example csv extension
preprocess your input files with some snappy tools like awk