SQLite Forum: slow import of CSV files

> I do not exactly understand what "in order" means, In your example, I do not know what "ORDER BY 1" does. I suspect, you mean that records are faster to insert if the insert happens in the order of the PRIMARY KEY. Perhaps, this helps indexing algorithm. I am not sure if my CSV files are ordered already, but I can certainly try.

"ORDER BY 1" means to order by the first result column in the select.  "ORDER BY 7, 3, 1, 8" would mean order by column 7 and within that by column 3 and within that by column 1 and within that by column 8.  You can also use the column-name in place of the column-ordinal (usually).

 > You mention Queen Marry II cache, but I do not know what this means. Large? The numbers from your test indicate that ordering for such cache is irrelevant. Default in sqlite3 shell (ver 3.33) is "-2000", what does minus sign mean? I use the default.

The default -2000 means about 2 Megabytes.  This means that if there are more than 2 MB of dirty pages in the cache they get flushed to disk.  The expression means a bloody big cache -- in that case it was about 4 GB which is large enough to hold *all* the pages in the cache so they were never written until the end.  If the cache is only small (as in the default 2 MB (-2000)) then if the same page is modified again after having been written out from the cache, then it will be re-read and re-written.  This makes a huge difference if the index inserts are not in-order because eventually there is a high likelyhood that inserting a record will require multiple pages to be read and written to the physical I/O device as the B-Tree is re-balanced.  This is called "thrashing" and occurs when the cache is too small to contain the working-set of modified pages.  Eventually the overhead to manage the cache will exceed the cost of I/O but when this occurs depends on the size of the cache and the speed of the underlying I/O device.  Technically you want to adjust the cache size to be just large enough to contain the working-set.

 > Do I interpret correctly, despite you do not explicitly write, that indeed the slowness is due to enforcement or indexing of the PRIMARY KEY column which can not be temporarily turned off. Should I not declare column as PRIMARY KEY and then, after INSERT finishes, should I create unique index on the column?

Yes.  Not declaring the key as PRIMARY KEY and creating a UNIQUE index afterwards will be quicker because building the index via CREATE UNIQUE INDEX will scan the table to collect all the data, sort it, and then build the index in-order to minimize disk I/O and tree rebalancing.  Sorting the data before insertion in the table into primary key order achieves the same result.

You will have a problem if you have multiple indexes because, presumably, one of them will be in least favourable (random) order and you can only optimize the insertion order for one at a time.

Except for the rowid (INTEGER PRIMARY KEY) the PRIMARY KEY is merely a unique index so you can get rid of the definition and create the unique index separately (unless you have a references clause that refers to the table by its bare name and not the name+columns, in which case you need to have a PRIMARY KEY declared so the short-form REFERENCE can figure out what column(s) are being referenced.