sqlite write performance is slower than the filesystem
(1.1) Originally by bioinfornatics with edits by Richard Hipp (drh) on 2020-03-14 18:32:55 from 1.0 [link]
Dear, I wrote a little script (availlable [here](https://gist.github.com/bioinfornatics/2870d9e7dba192114f32ce8299796269)) to compare i/o perf sqlite vs raw tabulated file in python. The result it is a big win for the raw text writer as it is 10 times faster than sqlite writer. While sqlite claims to be [faster than the filesystem](https://www.sqlite.org/fasterthanfs.html). Is it possible to speedup sqlite ? As sqlite does not enforce the [length of a VARCHAR](https://sqlite.org/faq.html#q9) it seem that sqlite can not split rows to insert them by block by guessing the maxmimal size that can take a row. So If I use [SQLITE\_MAX\_LENGTH](https://sqlite.org/limits.html) does the sqlite engine will optimize prepared insert statement with the use of multithread ? Thanks for your insight Have a nice day best regards
(2) By Wout Mertens (wmertens) on 2020-03-14 19:33:33 in reply to 1.1 [link]
Try removing the VACUUM call, there's no empty pages to defragment anyway. Furthermore, you're doing a bunch of PRAGMA calls that may or may not help. Try not doing those.
(3.2) By bioinfornatics on 2020-03-14 20:34:56 edited from 3.1 in reply to 2 [link]
Thanks wmertens for your help originally I had: ``` database was writed in 1.3185472186693612 seconds. text was writed in 0.11124303232645616 seconds. ``` without `VACUUM` and pragma: ``` database was writed in 0.5217875576733301 seconds. text was writed in 0.11288163432618603 seconds. ``` It is 5 times slower than classical filesystem write
(4.1) By bioinfornatics on 2020-03-14 20:11:47 edited from 4.0 in reply to 2 [link]Deleted
(5) By Richard Hipp (drh) on 2020-03-14 20:53:01 in reply to 1.1 [link]
If I'm reading [your Python][prog] correctly, you seem to be trying to do all of the following on every write: 1. Open an SQLite database connection. 2. Configure that connection. 3. Do the write. 4. Vacuum. 5. Close the connection. I suspect you are spending most of the CPU time in steps 1, 2, 4, and 5. I suggest you revise your program so that it opens a single database connection, just once, configures it just once, then do lots of writes. Then at the very end, vacuum if you think you should (you probably don't need to) and then close the connection. I think that will go *a lot* faster. [prog]: https://gist.github.com/bioinfornatics/2870d9e7dba192114f32ce8299796269
(6.2) By bioinfornatics on 2020-03-14 21:41:49 edited from 6.1 in reply to 5 [link]
I repeat 3 times the writes process and compute the mean time. I try to do a fair comparison with the classical way wich is: 1. open file 2. do the write 3. close file I updated [the code](https://gist.github.com/bioinfornatics/2870d9e7dba192114f32ce8299796269) to remove time spent on SQLite database connection and configuration as we coud consider that these operations are done only a few time during an application life ``` ./write_perf_db_txt.py database was writed in 0.5052063270122744 seconds. text was writed in 0.10451052167142431 seconds. ```
(7) By Keith Medcalf (kmedcalf) on 2020-03-14 21:47:20 in reply to 1.1 [link]
Of course if you remove all the overhead of parsing the SQL, parsing the parameters, allocating cells to store the records, and storing the records, then a simple sequential file is faster. Now, compare the speed of retrieving the n'th record of data. The SQLite3 database can do that in a few MICROSECONDS while it will take you (on average), 50 milliseconds to do that since you will have to read and discard n-1 records before you get to the record you want. The SQLite3 database will just return the single record that you wanted directly. So your "sequential file" is 50,000 times slower than SQLite3 at retrieving data. Assuming that you are merely writing all your records once, you only need to retrieve more than 3 random records, on average, before the "speed" of using an SQLite3 database is "faster" that your sequential file. So yes, a sequential file accessed entirely sequentially is superior to a database accessed entirely sequentially. And how is this surprising?
(8) By bioinfornatics on 2020-03-14 22:01:31 in reply to 7 [link]
Thanks kmedcalf for your help and insight I though that, as a row size can be estimated (sum of max size for each fields), sqlite was able to write a batch of row in parrallel and take advantage of SSD. While a sequential write can not.
(9) By Keith Medcalf (kmedcalf) on 2020-03-14 22:35:06 in reply to 8 [link]
SQLite is an "in process" library -- it executes entirely "in line" within the callers thread. The only possible exception to this is that if-and-only-if all the following conditions are met: - the library code was *not* compiled with SQLITE_THREADSAFE=0 - the library code was compiled with SQLITE_MAX_WORKER_THREADS greater than 1 - the current configuration is not SQLITE_CONFIG_SINGLETHREAD - the value returned by PRAGMA THREADS is greater than 1 (initially set to min(SQLITE_MAX_WORKER_THREADS, SQLITE_DEFAULT_WORKER_THREADS) then SORT operations may utilize additional SQLite3 created threads up to the limit set by PRAGMA THREADS, such as for the creation of indexes or the operation of ORDER BY or GROUP BY sorting. Otherwise, all operations are carried out linearly in the single thread of execution provided by the caller.
(10) By Jim Dodgen (jimdod) on 2020-03-14 22:45:44 in reply to 1.1 [link]
I think the sqlite open->write->close, open->write->close,open->write->close logic is a little odd for a database databases are more like open-> Insert to a tableA, insert to tableB, insert to tableC, -> close
(11) By Simon Slavin (slavin) on 2020-03-15 00:33:44 in reply to 8 [link]
SQLite does not use the same amount of space for each row. Different integers take different numbers of bytes to store. Strings take the amount of space their length needs. NULL fields take just one byte, or sometimes no bytes at all.
(12.1) By bioinfornatics on 2020-03-15 01:20:52 edited from 12.0 in reply to 1.1 [link]
So to be complete I updated the code [rev3](https://gist.github.com/bioinfornatics/2870d9e7dba192114f32ce8299796269) in order to bench write rows into the database vs sequential write. And added a second part which bench to select row if the first field is into a list. I compare too with and without primary key (constraint). ``` text was writed in 0.10 seconds. database was writed in 0.58 seconds. database with constraint was writed in 1.25 seconds. text selected in 0.32 seconds. database selected in 0.00 seconds. database with constraint selected in 0.00 seconds. ```
(13) By Rowan Worth (sqweek) on 2020-03-16 05:29:19 in reply to 12.1 [link]
SQLite also does extra work on the write-path compared to your text implementation, to ensure durability. ie. it calls fsync() to ensure the data is actually on disk. You can try adding this for benchmarking purposes: `PRAGMA synchronous=OFF` But note that this is not considered a safe option in production - if a machine loses power mid-transaction then you'll end up with a corrupt database.
(14) By Warren Young (wyoung) on 2020-03-16 17:11:14 in reply to 13 [link]
Another way to get much the same effect is to do all of your inserts within a single explicit SQL transaction, calling `COMMIT` only once at the end. It's a common pattern for importing bulk data into any SQL DB for this very reason.
(15) By Keith Medcalf (kmedcalf) on 2020-03-16 18:00:37 in reply to 14 [link]
The sqlite3 wrapper in Python does this automagically unless you have turned the automagic setting to "manual". The wrapper calls this the isolation_level of the connection and is controlled by a connection property (and connect method parameter) of the same name. How well this magic works depends on the particular version of pysqlite2 that is being used (the sqlite3 wrapper), and thus the version and release date of the version of python being used. The sample uses the default setting so the first "insert" on the connection will begin a deferred transaction which is committed when the .commit method of the connection is called. The .executemany method prepares the SQL statement and then binds and steps the resulting statement for each member of the list of bind parameters.
(16) By Rowan Worth (sqweek) on 2020-03-17 01:55:28 in reply to 15
Yeah I figured all the inserts must have been magically batched into a single transaction, otherwise the SQLite write times would have been _much_ slower.