SQLite Forum

Inserting One Billion Rows in SQLite Under A Minute
Login

Inserting One Billion Rows in SQLite Under A Minute

(1) By Simon Slavin (slavin) on 2021-07-18 13:40:31 [link] [source]

Inserting One Billion Rows in SQLite Under A Minute

https://avi.im/blag/2021/fast-sqlite-inserts/

" Recently, I ran into a situation where I needed a test database with lots of rows and needed it fast. So I did what any programmer would do: wrote a Python script to generate the DB. Unfortunately, it was slow. Really slow. So I did what any programmer would do: went down the rabbit hole of learning more about SQLite, Python, and eventually Rust… in my quest to get a 1B row database under a minute. This blog post is a summary of this fun and educational exercise. "

(2) By David Jones (vman59) on 2021-07-18 14:56:26 in reply to 1 [link] [source]

I've done a couple of those things and seen moderate results. My experience with making the page cache very large is it adds a big delay when you close the DB (presumably due to the I/O flushing the cache).

If the schema has several indices, is it better to defer adding them until after the tables are loaded? I don't know if an index build/rebuild can optimize page cache management while walking a large table for an index build/rebuild.

(3) By Keith Medcalf (kmedcalf) on 2021-07-18 20:49:26 in reply to 2 [source]

My experience with making the page cache very large is it adds a big delay when you close the DB (presumably due to the I/O flushing the cache).

It takes no longer to free(x) if x is 1 byte or if x is 1 TB.

What you are talking about is "flushing dirty pages" (a commit operation). It is related to the number of dirty pages in the cache. Writing 1000 pages takes pretty much the same total time actually writing whether you write them one by each spread over an hour or all at once.

If the schema has several indices, is it better to defer adding them until after the tables are loaded?

Maybe. If you can do that then yes, the number of pages to be updated decreases as the size of the batch increases.

I don't know if an index build/rebuild can optimize page cache management while walking a large table for an index build/rebuild.

Index Pages are also stored in the page cache. If the page cache is too small then "partial changes" must be spilled onto physical media more frequently and these I/O operations take time. The fewer I/O operations you can perform on the same page (by keeping it in the closest cache -- the application cache) the better.

(5) By ThanksRyan on 2021-07-19 17:46:29 in reply to 2 [link] [source]

(4) By ThanksRyan on 2021-07-19 17:44:22 in reply to 1 [link] [source]

Avi's SQLite thread from May when he asked about this:

https://sqlite.org/forum/forumpost/22f19295620387a1?t=h