Speed of exporting a query result via command line shell
(1) By lordgrenville on 2023-07-17 15:19:37 [link] [source]
To export the result of a fairly simple but large query (~45 million records) to a CSV file I used the command line shell:
$ sqlite3 db.db3
> .headers on
> .mode csv
> .once result.csv
> select .....
This took about 9 hours to run. I then tried Python:
import sqlite3
import pandas as pd
conn = sqlite3.connect('db.db3')
df = pd.read_sql(query, conn)
df.to_csv('output.csv')
This took about 20 minutes. I understand why Python might be a little bit faster but did not expect such a huge difference. Why is the command line shell so much slower? Is there a preferred way to do it?
I have a hunch that the answer might be that the shell tool opens the file and writes for each result of the output, whereas Python writes it in larger batches, but I am interested to hear from someone who knows more.
P.S. This is basically a cross-post of my question on Stack Overflow.
(2) By jose isaias cabrera (jicman) on 2023-07-17 15:52:08 in reply to 1 [link] [source]
What is your SQLite3 command line version?
What is your SQLite3 python version?
(3) By lordgrenville on 2023-07-17 16:40:56 in reply to 2 [link] [source]
sqlite3 version: 3.37.2
The sqlite3 Python library is part of the standard library, I am using Python 3.8.12.
(4) By Larry Brasfield (larrybr) on 2023-07-17 16:53:06 in reply to 1 [link] [source]
It would be interesting to see how the execution times compare if you get the CLI (your "command line shell") to put the query results into memory first, then write that out as CSV. This would put the CLI and Pandas/Python I/O on a more equal footing.
Do your db.db3 and result.csv files reside on the same, spinning storage medium? Maybe your initial result interposes a lot of head seeking and rotation latency time penalties. If so, getting one to a different drive would help.
(5) By Larry Brasfield (larrybr) on 2023-07-17 16:55:12 in reply to 3 [link] [source]
What does "select sqlite_version();" say when given to your Python sqlite3?
(6) By jose isaias cabrera (jicman) on 2023-07-17 17:49:20 in reply to 3 [link] [source]
Just for the fun of it, I followed your steps using version 3.37.2 and exported 200K records and it took 5.42 seconds. I did the same thing with version 3.43.0 2023-07-11 19:54:17, and it took 7.68 seconds. I thought the newer version would be, at least, a little bit faster. So, it looks like the newer version may take even longer. This was a quick one-time test, though.
(7) By lordgrenville on 2023-07-17 18:02:17 in reply to 5 [link] [source]
In [54]: conn.cursor().execute('select sqlite_version();').fetchone()
Out[54]: ('3.37.2',)
Full output of CLI version is 3.37.2 2022-01-06 13:25:41 872ba256cbf61d9290b571c0e6d82a20c224ca3ad82971edc46b29818d5d17a0
(8) By lordgrenville on 2023-07-17 18:22:38 in reply to 4 [link] [source]
to put the query results into memory first, then write that out as CSV
I agree, and I would like to try, but how do I do this in the CLI? I only know how to export a query with .output
or .once
. Or via sqlite3 "query" > output.csv
. Is there a way to first load the query and then write it?
Do your db.db3 and result.csv files reside on the same, spinning storage medium?
Yes, on an SSD NTFS partition mounted on a Linux machine, which has sometimes caused me issues in the past. Info in case this is relevant:
$ sudo parted -l
Model: SAMSUNG MZQLB7T6HMLA-00007 (nvme)
Disk /dev/nvme1n1: 7682GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 7682GB 7681GB ntfs Basic data partition msftdata
In my original post I was going to use the term CLI but saw "command line shell" in the docs so I thought it might be preferred :)
(9) By jchd (jchd18) on 2023-07-17 20:11:06 in reply to 1 [link] [source]
AFAIK the .mode csv option force the CLI to examine every column of every row to decide if it needs to be enclosed in double quotes. I don't know if it does other processing as well.
That might explain a difference in runtime but maybe not such a large gap.
(10.1) By Larry Brasfield (larrybr) on 2023-07-17 22:07:43 edited from 10.0 in reply to 8 [link] [source]
[H]ow do I ... ?
This would do get the DB from "disk" into memory so it can be queried without I/O delays1:
sqlite3 db.db3
# Persistent DB file now open as connection #0
.connection 1
# Specify a sharable, in-memory DB for connection #1
.open file:inmem?mode=memory&cache=shared
.connection 0
# Clone the connection #0 DB to that same in-memory DB
.clone file:inmem?mode=memory&cache=shared
.connection 1
# Output the in-memory DB to CSV without any fetching from "disk".
.mode csv
.once result.csv
SELECT * from SomeTable;
Yes, on an SSD NTFS partition
I suppose that is spinning, about 15 degrees per hour.
- ^ This does not avoid main memory read delays, but is comparable to reading into a Pandas in-memory data object.
(11) By anonymous on 2023-07-17 23:02:31 in reply to 8 [link] [source]
how do I do this in the CLI?
attach database ':memory:' as in_memory;
create table in_memory.large_result as [insert original query here];
.mode csv
.headers on
.once result.csv
select * from in_memory.large_result;
(12) By lordgrenville on 2023-07-18 07:25:40 in reply to 10.1 [source]
Yes, this is great, thanks! I forgot to measure the exact times but the write took 8 minutes and loading the whole database into memory took around 10-15 minutes, so this is around the same as in Python.
Does this mean that when I did it originally most of the time it took was lots of separate disk reads? Does it matter that this is an SSD and not a spinning disk?
It seems strange that there is a big difference in between the two methods, as naively I would have guessed that when I query a database on disk the program first fetches the whole query into memory, and then writes it, so the running time would be about the same.
(13) By Stephan Beal (stephan) on 2023-07-18 07:35:02 in reply to 12 [link] [source]
It seems strange that there is a big difference in between the two methods, as naively I would have guessed that when I query a database on disk the program first fetches the whole query into memory
That generally only happens if the software involved is naive or knows for certain that it is working with relatively small data sets. SQL statements are meant to be iterated over, one at a time, with only a single row being in memory at a time. The result is that the memory costs stay more or less flat regardless of how big the db is (i.e. 10 rows and 10 billion cost the same amount of peak memory). If the whole data set were read in memory in advance, the app might run out of memory or hog up enough that other apps run out of memory. It would also mean that the time it takes to start processing the data would increase as the number of db rows increases.
(14) By lordgrenville on 2023-07-18 08:18:33 in reply to 13 [link] [source]
Ok, that makes a lot of sense. Thanks for explaining.
(15) By Larry Brasfield (larrybr) on 2023-07-18 14:40:20 in reply to 12 [link] [source]
(Referring to execution times becoming comparable during experiment:)
Does this mean that when I did it originally most of the time it took was lots of separate disk reads?
I would not put it quite that way. I would say the extra time was probably consumed getting data read out of the NVRAM devices when page reads within the SSD are largely going to waste because the SSD-level reads are interrupted by unrelated write operations.
I believe that SSDs are highly optimized for lots of reading and little writing.1 Your initial results demonstrate (to me) that the buffering system in the SSD is doing very little good for transfer speed in your balanced read/write usage.
Does it matter that this is an SSD and not a spinning disk?
I expect that it matters. With a disk spinning (at 90+ revolutions/second), data fetching and data writing from/to separate tracks on the medium would likely be separated by movement of the flying head (aka "seeking") and waiting for the data to whirl around to underneath the head(s). That might be slower. However, given that spinning disks are often used in balanced read/write scenarios, a spinning disk buffering scheme might better support such a pattern than your SSD does.
A side-note in case it is not obvious: I believe you are badly beating up that poor SSD. I would avoid your usage pattern purely for miserly reasons.
- ^ The write wear-out specs on SSDs and NVRAM devices generally are enough to heavily discourage patterns of usage that are not read-mostly.
(16) By lordgrenville on 2023-07-19 09:34:39 in reply to 15 [link] [source]
Ok, thanks for the explanation.
A side-note in case it is not obvious
It was not (to me). I'll bear this in mind - thanks!