SQLite Forum

Recursively Import 10 GB sized csv file to sqlite fast without blocking reads
Login

Recursively Import 10 GB sized csv file to sqlite fast without blocking reads

(1) By anonymous on 2021-08-26 19:46:06

Hello,

I have been trying import a csv file that has 9 columns.There are about 20Million entries in the csv file.

I have explored different journal_modes to do reads without blocking and write simultaneously, the best fit was WAL mode.

To load a csv file of 2GB into sqlite takes 4minutes the first time and however on the second time the operation becomes tremendously slow.(almost took 20 to 30 mins)

These are the commands that are executed:

```
PRAGMA journal_mode=WAL;
BEGIN;
DROP INDEX IF EXISTS idx;
DELETE FROM foo;
CREATE TABLE IF NOT EXISTS foo(ip TEXT, field1 TEXT, field2 TEXT, field3 TEXT, field4  TEXT, field5 TEXT, num1 REAL, num2 REAL, field6 TEXT);
.mode csv
.import sample.csv foo
create index idx on foo(ip);
COMMIT;
```

1st time write = 4 minutes
2nd time write = 30 mins

Is there a better way to write/load this csv file into sqlite given that reads must happen in parallel.

(2) By Keith Medcalf (kmedcalf) on 2021-08-26 20:31:35 in reply to 1 [link]

DO the following:

```
PRAGMA journal_mode=WAL;
BEGIN;
DROP INDEX IF EXISTS idx;
DELETE FROM foo;
CREATE TABLE IF NOT EXISTS foo(ip TEXT, field1 TEXT, field2 TEXT, field3 TEXT, field4  TEXT, field5 TEXT, num1 REAL, num2 REAL, field6 TEXT);
VACUUM;
.mode csv
.import sample.csv foo
create index idx on foo(ip);
COMMIT;
```

Secondly, make sure you set a sufficient cache_size.

(3) By Simon Slavin (slavin) on 2021-08-26 21:01:42 in reply to 1 [link]

If blocking a database is your major problem, do the above CSV import to a different database file, then write your own code to copy rows from that database to the one being read.  That way you can use whatever transaction sizes you want.  Or perhaps the single optimised command <code>INSERT INTO table SELECT ...</code> will be good for you.

The <code>ATTACH</code> command is used to open two databases on the same connection.