slow import of CSV files
(1) By Roman (moskvich412) on 2020-09-26 22:23:13 [link] [source]
Dear SQLiters,
I have large CSV files (30GB, ~100M records ). These are dumps from existing database (unfortunately). I therefore have to re-create schema, clean up (NULL versus empty string, convert to real, etc). I do this in two steps (all within sqlite3 shell):
Step 1. sqlite3's .import of CSV file into a table QQQ that will be deleted once all finished. This makes all columns TEXT since no type is available. This step is reasonably fast, 20min
Step 2. Create table according to schema with PRIMAMRY KEY and other constraints and INSERT from QQQ into the final destination.
The second step is slow and it gets slower as database size grows. I suspect but not sure the slowdown is caused by enforcing uniqueness of the PRIMARY KEY. Since I know the data source is a database, I do not need to enforce the check on the fly. Plus, I know there is PRAGMA integrity_check; that can be used to verify at the end after everything is inserted. If I am correct, is there a way to disable or otherwise speed up the second step? I know FOREIGN KEY constrain can be turned off (and I keep it off). I know PRAGMA ignore_check_constraints could turn off CHECK. But nothing on primary key enforcement. Maybe my two-step process is wrong and there is a better way?
In my case PRIMARY KEY is a string (used to be 38 digit integer, too large for SQLite as I understand).
Thank you,
Roman
(2) By Keith Medcalf (kmedcalf) on 2020-09-27 00:12:01 in reply to 1 [link] [source]
Is the insert "in order"?
(3) By Keith Medcalf (kmedcalf) on 2020-09-27 00:32:12 in reply to 2 [link] [source]
This is extremely faked, but notice the difference between in-order insertion and out-of-order insertion:
>sqlite sample.db
SQLite version 3.34.0 2020-09-26 18:58:45
Enter ".help" for usage hints.
sqlite> create table x(x text primary key not null);
sqlite> .timer on
sqlite> pragma cache_size=1000;
Run Time: real 0.000 user 0.000000 sys 0.000000
sqlite> insert into x select random() from wholenumber where value between 1 and 10000000;
Run Time: real 197.973 user 36.953125 sys 160.046875
sqlite> .exit
>dir sample.db
2020-09-26 18:22 601,960,448 sample.db
>del sample.db
>sqlite sample.db
SQLite version 3.34.0 2020-09-26 18:58:45
Enter ".help" for usage hints.
sqlite> pragma cache_size=1000;
sqlite> create table x(x text primary key not null);
sqlite> .timer on
sqlite> insert into x select random() from wholenumber where value between 1 and 10000000 order by 1;
Run Time: real 14.937 user 17.078125 sys 2.843750
In the first case the insertion is in "random" order, meaning that the index structures must be continually updated and the B-Tree rebalanced as the insert progresses.
In the second case the insertion is "in-order" so the B-Tree is always built in-order with a minimum of rebalancing as the insert progresses.
Note that I have reduced the cache to itty-bitty (which is the default) since this will have an devastatingly huge impact on the amount of I/O performed (I have a cruise-ship-load of RAM for caching, so without this the difference is less noticeable).
Even with the sort (order by clause) the in-order insertion is more than 10 times faster than the out-of-order insertion.
(4) By Keith Medcalf (kmedcalf) on 2020-09-27 00:37:56 in reply to 3 [link] [source]
This is the same thing with "Queen Elizabeth II" cruise ship sized cache.
>del sample.db
>sqlite sample.db
SQLite version 3.34.0 2020-09-26 18:58:45
Enter ".help" for usage hints.
sqlite> create table x(x text not null primary key);
sqlite> .timer on
sqlite> insert into x select random() from wholenumber where value between 1 and 10000000;
Run Time: real 21.598 user 20.265625 sys 0.984375
sqlite> ^Z
>del sample.db
>sqlite sample.db
SQLite version 3.34.0 2020-09-26 18:58:45
Enter ".help" for usage hints.
sqlite> create table x(x text not null primary key);
sqlite> .timer on
sqlite> insert into x select random() from wholenumber where value between 1 and 10000000 order by 1;
Run Time: real 20.600 user 19.046875 sys 1.171875
sqlite>
(5) By rayB on 2020-09-27 02:29:31 in reply to 1 [link] [source]
Are you using transactions? If not, can I suggest you do.
(6) By Roman (moskvich412) on 2020-09-27 03:39:20 in reply to 4 [link] [source]
Dear Keith,
I do not exactly understand what "in order" means, In your example, I do not know what "ORDER BY 1" does. I suspect, you mean that records are faster to insert if the insert happens in the order of the PRIMARY KEY. Perhaps, this helps indexing algorithm. I am not sure if my CSV files are ordered already, but I can certainly try.
You mention Queen Marry II cache, but I do not know what this means. Large? The numbers from your test indicate that ordering for such cache is irrelevant. Default in sqlite3 shell (ver 3.33) is "-2000", what does minus sign mean? I use the default.
Do I interpret correctly, despite you do not explicitly write, that indeed the slowness is due to enforcement or indexing of the PRIMARY KEY column which can not be temporarily turned off. Should I not declare column as PRIMARY KEY and then, after INSERT finishes, should I create unique index on the column?
Thank you,
Roman
(7) By Keith Medcalf (kmedcalf) on 2020-09-27 07:34:23 in reply to 6 [link] [source]
I do not exactly understand what "in order" means, In your example, I do not know what "ORDER BY 1" does. I suspect, you mean that records are faster to insert if the insert happens in the order of the PRIMARY KEY. Perhaps, this helps indexing algorithm. I am not sure if my CSV files are ordered already, but I can certainly try.
"ORDER BY 1" means to order by the first result column in the select. "ORDER BY 7, 3, 1, 8" would mean order by column 7 and within that by column 3 and within that by column 1 and within that by column 8. You can also use the column-name in place of the column-ordinal (usually).
You mention Queen Marry II cache, but I do not know what this means. Large? The numbers from your test indicate that ordering for such cache is irrelevant. Default in sqlite3 shell (ver 3.33) is "-2000", what does minus sign mean? I use the default.
The default -2000 means about 2 Megabytes. This means that if there are more than 2 MB of dirty pages in the cache they get flushed to disk. The expression means a bloody big cache -- in that case it was about 4 GB which is large enough to hold all the pages in the cache so they were never written until the end. If the cache is only small (as in the default 2 MB (-2000)) then if the same page is modified again after having been written out from the cache, then it will be re-read and re-written. This makes a huge difference if the index inserts are not in-order because eventually there is a high likelyhood that inserting a record will require multiple pages to be read and written to the physical I/O device as the B-Tree is re-balanced. This is called "thrashing" and occurs when the cache is too small to contain the working-set of modified pages. Eventually the overhead to manage the cache will exceed the cost of I/O but when this occurs depends on the size of the cache and the speed of the underlying I/O device. Technically you want to adjust the cache size to be just large enough to contain the working-set.
Do I interpret correctly, despite you do not explicitly write, that indeed the slowness is due to enforcement or indexing of the PRIMARY KEY column which can not be temporarily turned off. Should I not declare column as PRIMARY KEY and then, after INSERT finishes, should I create unique index on the column?
Yes. Not declaring the key as PRIMARY KEY and creating a UNIQUE index afterwards will be quicker because building the index via CREATE UNIQUE INDEX will scan the table to collect all the data, sort it, and then build the index in-order to minimize disk I/O and tree rebalancing. Sorting the data before insertion in the table into primary key order achieves the same result.
You will have a problem if you have multiple indexes because, presumably, one of them will be in least favourable (random) order and you can only optimize the insertion order for one at a time.
Except for the rowid (INTEGER PRIMARY KEY) the PRIMARY KEY is merely a unique index so you can get rid of the definition and create the unique index separately (unless you have a references clause that refers to the table by its bare name and not the name+columns, in which case you need to have a PRIMARY KEY declared so the short-form REFERENCE can figure out what column(s) are being referenced.
(8) By Keith Medcalf (kmedcalf) on 2020-09-27 07:45:50 in reply to 6 [link] [source]
In the first example that was unsorted (inserted in random order) with a small cache the I/O rate was sustained at several hundred megabytes/second.
When inserting "in-order" I/O rate was hugely reduced since there were almost no pages that needed to be re-read and re-written.
I/O was much lower with a larger cache (actually, where the cache was large enough to hold all the data the entire file was written in one big 3 GB/sec I/O at the end and all the operations occurring in the application/SQLite3 cache which is why there was almost no difference between the "in-order" and "random-order" insert speed.
(9) By Roman (moskvich412) on 2020-09-27 12:07:18 in reply to 7 [link] [source]
Thank you, Keith.
Now I understand and loaded the table (still the old way, it was running for 10 hours). I will test larger cache, I understand SQLite needs to hold index in RAM, otherwise it re-reads to ensure uniqueness.
I now wanted to count rows with SELECT count() FROM tableName; without any filtering criteria. And it is very slow, much slower than Linux' line count wc -l CSVFile. What causes this? Count does not need to hold anything in RAM. I would expect the two methods (sqlite and wc -l) to have similar speed if not SQLite being faster since disk I/O is the same.
Roman
(10.1) By little-brother on 2020-09-27 12:56:13 edited from 10.0 in reply to 1 [link] [source]
Just turn off the disk synchronization on import operation and turn it on after. * pragma synchronous = off * import csv * move data to target * clear QQQ * pragma synchronous = off Also you can try to process data by parts using limit and offset begin; insert into trg select * from QQQ limit 100000 offset 100000 * N; commit; P.S. Use SSD as SQLite storage if it possible.
(11) By Ryan Smith (cuz) on 2020-09-27 13:08:51 in reply to 9 [link] [source]
Firstly, try
SELECT (*) FROM tablename;
i.e: Make sure to include the asterisk - it allows special index optimization and should run in the order of milliseconds for a few million rows. [This /may/ work sans asterisk, but I'm not sure, hence suggesting it] Side-note: wc -l isn't quite the same. Lines-to-records relation isn't always 1-to-1, the CSV format allows single fields spanning multiple lines.
10 hours seems extremely excessive, even for 100GBs of CSV imports - perhaps if you are importing it with primary key (and/or other indexes) predefined on the table and out of order data, to spinning rust and with meager cache settings. Even then though, 10 hours makes me feel nauseous.
Are the CSV table data very wide perhaps? Contain ludicrously large strings and or blobs perhaps?
If I had a CSV file that was so fat it legitimately needed 10hours of import time, here are the first things I would do:
- Declare the table without ANY indexes, no primary key either,
- If the table is pre-existing, then perhaps drop the indexes.
- Avoid touching the rowid by not declaring an "... INTEGER PRIMARY KEY" field,
- Import to fastest drive hardware available,
- Set FSync off (pragma synchronous)
- Set the allowed Cache size to be several Gigs, and
- Instate/Re-instate FSync and any Indexes after the import.
This should drink in the CSV file at best possible speed, which I would be surprised if it is even more than 1 hour for a mere 30GB.
Notes:
- I realize not all these options are available to you if this should run on a Client's hardware. Do what you can.
- If there is a formatting step, like making sure some values are Integer or such, that is best done with the import step - Any slowdown from it typically outweighs having to import to a temp table and then format-importing to the real table.
All items I mentioned above are available in the docs from more info. If you need assistance with, or more information on, any specific item above, please ask. If you have implemented all of it and still get runs of 10 hours for a 30GB import, something must be wrong. Perhaps then share the CSV file so we can try and see (if the data isn't sacred, and if it is, do an obfuscation step first).
(12) By Richard Hipp (drh) on 2020-09-27 13:23:47 in reply to 1 [source]
Things will go much faster if you omit all PRIMARY KEY and UNIQUE constraints from your initial import. You can go back afterwards and do
CREATE UNIQUE INDEX ...
To emulate each of your PRIMARY KEYs and UNIQUEs after the import has finished.
Afterwards, if you want to create a new copy of the table that actually contains a PRIMARY KEY definition, then do so, and populate the new official table from the data in the import table using:
INSERT INTO official SELECT * FROM imported ORDER BY pk1, pk2, ...;
Where pk1, pk2, ... are the PRIMARY KEY columns.
All this boils down to the "in order" thing that Keith was talking about.
(13) By Roman (moskvich412) on 2020-09-27 23:01:37 in reply to 12 [link] [source]
Dear Richard, Dear SQLiters, Let me see if I understood you correctly. In order to re-create table with all constraints from a 30GB CSV file with 100M records using sqlite3 shell I have to follow 3 steps : I schematically describe the SQL commands keeping important things. If I do not list the important, then I missed it, please correct. 1. .import 'large.csv' QQQ_1. In table QQQ_1 all types are TEXT. 2. CREATE table QQQ_2( columns with proper type, but not unique or primary constraints). PRAGMA schema.cache_size = -1000000; -- set cache to 1G INSERT INTO QQQ_2 SELECT FROM QQQ_1; CREATE UNIQUE INDEX ON QQQ_2 to emulate PRIMARY KEY/unique constraints; 3. CREATE table official( columns with proper type, AND proper unique/primary constraints). INSERT INTO official SELECT FROM QQQ_2 ORDER BY list of columns in pk; Drop tables QQQ_1 and QQQ_2. Thank you, as always, for your help, Roman
(14) By Roman (moskvich412) on 2020-09-29 22:38:47 in reply to 13 [link] [source]
Dear SQLiters,
Evaluating steps to import CSV files (thank you again for explaining the process), I question if step 3 is needed. That is:
What is the difference between PRIMARY KEY and a separate UNIQUE index?
In general, I prefer constraints to be obvious in the table declaration. This argues for PRIMARY KEY. But I think that a separate UNIQUE index is just as efficient (and maybe internally is created for PRIMARY key). Advantage of the separate UNIQUE INDEX is ability to turn on and off its enforcement by dropping and recreating it. With this, maybe it is better to stop at step 2?
What is your recommendation? (I know it really is general and has little to do with the original question of importing CSV files)
Thank you,
Roman
(15) By Keith Medcalf (kmedcalf) on 2020-09-30 00:35:22 in reply to 14 [link] [source]
What is the difference between PRIMARY KEY and a separate UNIQUE index?
There is almost no difference between a PRIMARY KEY and a UNIQUE index (whether that UNIQUE index is specified via the UNIQUE constraint in a table definition or by a separate CREATE UNIQUE INDEX statement) with the following exceptions:
- INTEGER PRIMARY KEY in a RowID table (that is, a table that is not created WITHOUT ROWID) declares an explicit name for the rowid.
- if the table is the target of a REFERENCES clause that does not specify the column(s) referenced in the target, then the PRIMARY KEY column(s) are assumed.
- expressions can be used in the index expression list of a CREATE UNIQUE INDEX statement but cannot be used in a PRIMARY KEY or UNIQUE declaration in a table specification.
Other than these extremely specific points of functionality, PRIMARY KEY is merely syntactic sugar (an alternate spelling) for UNIQUE in the table definition, and merely an alternate declaration method (syntactic sugar) for a CREATE UNIQUE INDEX statement.
(16) By Ryan Smith (cuz) on 2020-09-30 10:46:37 in reply to 15 [link] [source]
To add - Normally the most obvious difference is that a UNIQUE Index can contain NULL values (where every NULL is regarded as distinct from every other NULL) whereas a PRIMARY KEY is not allowed to have NULL values in normal SQL.
This is enforced in most all SQL engines, but a peculiar bug in SQLite, now maintained for backward compatibility, has allowed SQLite-specific PK's to contain NULLs. This was possible by virtue of normal SQLite tables having a separate rowid indexer, a fact which informs the two exceptions to this rule:
SQLite PK's cannot contain NULL values in the cases of:
- the rowid-alias PK declared as
" ... INTEGER PRIMARY KEY"
- and tables where the rowid is omitted by being declared as:
"CREATE TABLE t(...) WITHOUT ROWID;".