Issue importing csv file to db

(1) By AxioUK on 2020-07-23 20:09:38 [link] [source]

Dear, I have a "small" problem doing the import of a txt file with csv mode, the file is just over 13 million records, but the import does not bring all of them to the db, from record 11million approx onwards one that another record is left out, thus until completing almost 300,000 records, the text file is in this format: 123456|names|surnames|addresses|etc|etc

prior to import I perform an ANSI->UTF8 conversion from the text file.

the uncompressed txt file size is 1.3gb http://www2.sunat.gob.pe/padron_reducido_ruc.zip

The importation was done through:

c:sqlite3.exe -csv -init config.cfg CONSULT.db ""

where config.cfg contains:

CREATE TABLE IF NOT EXISTS SUNAT("RUC" INTEGER PRIMARY KEY, "NOMBRE O RAZÓN SOCIAL" TEXT, "ESTADO DEL CONTRIBUYENTE" TEXT, "CONDICIÓN DE DOMICILIO" TEXT, "UBIGEO" TEXT, "TIPO DE VÍA" TEXT, "NOMBRE DE VÍA" TEXT, "CÓDIGO DE ZONA" TEXT, "TIPO DE ZONA" TEXT, "NÚMERO" TEXT, "INTERIOR" TEXT, "LOTE" TEXT, "DEPARTAMENTO" TEXT, "MANZANA" TEXT, "KILÓMETRO" TEXT); .import padron_reducido_UTF8.txt SUNAT

any ideas...

Thanx in advance

(2) By Larry Brasfield (LarryBrasfield) on 2020-07-23 20:35:45 in reply to 1 [link] [source]

You should be aware that lines which have too few columns on CSV import have the "missing" fields set to NULL. If that result is what you mean by "one that another record is left out", you should investigate whether the file is legitimate CSV. (For example, using "|" to separate fields may confound the separator-quoting logic.)

I have used the SQLite shell's .import command to bring very large files into a DB, without any issues. I can also attest, from inspection of the code, that it does not do anything on the Nth line that is not done for any Mth line, where N and M are different numbers greater than 1, whether large or not.

(3) By Larry Brasfield (LarryBrasfield) on 2020-07-23 23:04:51 in reply to 1 [link] [source]

With a few minutes to spare, I downloaded your large text file. It is indeed using '|' as field separators, so it is very strange to be treating it as CSV.

Another issue is that the lines have 15 separators, which almost matches the 15 columns you define for table SUNAT. But the match is poor because 15 separators are needed for 16 fields. The import is very slow and noisy because of all the yapping about ignoring an extra field per record.

Another issue, perhaps explaining your disappointment, is that a great many lines, during the .import, produce an error to stderr such as: "adron_reducido_ruc.txt:13718893: unescaped " character". I think these are not accepted because they are invalid CSV.

I would recommend that you get these problems sorted out while testing with a few hundred lines of that file, perhaps including line 13718893. Do that in an interactive sqlite3 shell session so that you have a chance to catch errors rather than having them zip by in a flash or get swallowed. Once you have that working, your humongous .import may act more as you hoped.

(4) By Larry Brasfield (LarryBrasfield) on 2020-07-24 00:45:27 in reply to 1 [source]

I intended to mention one other problem.

When you .import into an existing table, the SQLite shell treats the first line of the file like all the others; it is not specially interpreted as naming the columns. Your large text file's first line has the same text values you use as column names in the CREATE TABLE statement you showed. Hence, if the .import had no other problems, the first line of the file is treated as mere data. That fails to be imported due to the constraint you have on the first column. This is kind of a happy accident, since you likely did not want it treated as data. But if you intended it to be treated as column headers, that is not happening either.

When you .import into a new table, (one not yet defined), the first line is treated as column headers, and the table is created with the names given. That may lead to an easier solution for you. It is a simple matter to .import into a temporary table to be created, then copy it (using a SELECT clause as the source) to a table which has all the types and constraints you need.

(5) By AxioUK on 2020-07-24 14:05:08 in reply to 4 [link] [source]

I use CREATE TABLE because I need the first column to be an INTEGER PRIMARY KEY, I don't know how to indicate this condition in the import, I added the missing column and considerably better the import time, but the import errors (missing records) continue.

Thanks for your answer Larry...

(6) By Larry Brasfield (LarryBrasfield) on 2020-07-24 17:29:31 in reply to 5 [link] [source]

Your missing records will be eliminated, partially if not wholly, by fixing the input where it is causing the shell to emit this:adron_reducido_ruc.txt:13718893: unescaped " character, where the '13718893' stands for quite a few numbers.

(8) By luuk on 2020-07-25 07:59:40 in reply to 6 [link] [source]

nl padron_reducido_ruc.txt  | grep '"' | awk -F'"' 'NF%2!=1{ print $0 }'

shows indeed some (483) lines which have an odd number of " characters.

I do not think import this file should take into account doing anything special with these (") characters.

I also (that's why i have the nl) do not see anything special about line 13718893

Because the OP (AxioUK) shared a file which he is not actually using, this will be a difficult problem, because we can only 'guess' that he is doing the conversion, from | delimited to , delimited, and from ISO-8859 to UTF-8, the correct way.

(7) By Keith Medcalf (kmedcalf) on 2020-07-24 19:03:53 in reply to 1 [link] [source]

Do your import
Identify a missing record
See why that record is not being imported

It is 100% certain that the problem exists in your input data.

(9) By Keith Medcalf (kmedcalf) on 2020-07-25 17:51:12 in reply to 7 [link] [source]

The first missing record is the one at line 11901523 in your input file.

The previous record 11901522 contains a "quoted field" which is not terminated for some number of lines.

There also appear to be many lines containing improperly escaped quotes, and quoted fields that are not terminated.

If you fix your quotes I am quite sure that your "problem" will go away.

(Note that I found this first erroneous record by using the above procedure).

(10) By Keith Medcalf (kmedcalf) on 2020-07-25 18:42:10 in reply to 9 [link] [source]

The following wee python3 script will "fix" your file for you.

fh = open('padron_reducido_ruc.txt', 'r', encoding='8859')
fo = open('padron_reducido_ruc.csv', 'w', encoding='utf8')
c = 0
for line in fh:
    row = line.strip(' \r\n').split('|')
    while len(row) < 16:
        row.append(None)
    for idx, item in enumerate(row):
        if item.startswith('"'):
            row[idx] = '"' + item.strip('"').replace('""','"').replace('"','""') + '"'
        elif '"' in item:
            row[idx] = item.replace('""','"').replace('"','""')
    print('|'.join(row), file=fo)
    c += 1
    if c % 1000000 == 0:
        print(c)
print(c)
fh.close()
fo.close()

And yes, fixing the quotes fixes the problem.