Increasing insert performance of an in-memory db

(1) By srinarasi on 2021-08-10 08:29:29

# Some context
We have an in-memory database with a simple table(3 integers and a string, the first 3 numbers form the primary key, no other indexes at this point ). We need to create the DB from scratch on launch and populate 100 million rows in the table. The DB grows to ~9GB after it's completely populated. Currently it takes around 90 minutes which means the insertion rate is ~18000 rows/second. 

Our code is written in C++. I have tried everything in [this thread](https://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite). My machine is on Azure with 12 cores and has 128 GB of RAM. We compile sqlite with `SQLITE_WIN32_MALLOC` and `SQLITE_WIN32_HEAP_CREATE`.


# On profiling:
I see that around 75% of the time is spent in `memcpy_repmovs()`. 97% of the time is spent in `sqlite3BtreeMovetoUnpacked()`. 

# Questions
1) Why does `memcpy()` hog so much of the CPU when we're only storing around ~1.5MB/sec? Is this normal?
2) For storing around 100 million rows of around 90 byes each, what do you expect the best case insertion rate to be? I'm trying to set my own expectations of what we can achieve with our current approach.

Thanks,

Bhargava

(2) By anonymous on 2021-08-10 11:15:12 in reply to 1 [link]

A couple thoughts:


The number of cores in the machine is irrelevant, for the highest insert speed you need to be using a single thread/core for that. If you are not doing so then try to have a single inserter thread.

I am not sure about your transaction size, but you should be able to achieve much higher throughput with larger transactions on modern hardware.

Have you tried WAL with synchronous=0? This option leads to better insert speeds in some cases.

(4) By srinarasi on 2021-08-11 09:24:16 in reply to 2 [link]

I'm already using a single inserter thread. I played around with multiple inserters but that didn't really improve things much (as expected).

My transaction size is 50,000. Since the bottleneck seems to be memcpy(), do you think tweaking transaction size helps much?

My journal mode is `OFF` and the database is in-memory. Looking at the documentation, I think `synchronous` matters only for writing to the disk.

(7) By anonymous on 2021-08-11 12:50:17 in reply to 4 [link]

I would play around with the transaction size to see if there is a sweet spot to be found.

As for the journal mode, I *think* you should try the WAL mode, since this causes inserts to become simple appends vs changing a b-tree structure (even if you will have to pay later with a checkpoint). With a synchronous = 0 and large transactions your app will mostly be doing everything in memory. 

Also, in WAL mode and given your transaction size, you want to make sure that your page cache is big enough to hold your transactions.

Do you have any reads ongoing while you are inserting?

(13) By srinarasi on 2021-08-12 10:42:14 in reply to 7 [link]

No reads are happening during the test. Playing around with WAL mode and synchronous. It helped a lot. I can now insert 250,000 rows/sec. Thanks!

(3) By J.M. Aranda (JMAranda) on 2021-08-10 17:30:38 in reply to 1 [link]

From experience I would say that an Azure-Core does not necessarily imply an Intel-Core. And I have some 16Gb cards that hold it.

(5) By srinarasi on 2021-08-11 09:25:27 in reply to 3 [link]

That's a good point. I tried it on a real i7 core as well. The profile looks similar. `memcpy()` is the bottleneck and the insertion rate is ~18,000 rows/sec.

(6) By Bill Wade (billwade) on 2021-08-11 12:04:23 in reply to 1 [link]

Often, appends are faster than inserts.

Are your inserts in primary key order? If so, an insert of ~100 records touches a couple of leaf pages (depends on page size, and your string sizes), and about the same number of index pages.

If not, it touches about 100 leaf pages and about 100 index pages.

The thread you referenced was doing inserts in primary key order.

(9) By srinarasi on 2021-08-12 07:45:55 in reply to 6 [link]

Thanks for this! The inserts were not in primary key order. Changing my code to do inserts in primary key order improved the performance by around 80%. Now `memcpy()` takes around 15% of the time. `malloc()` takes around 75%.

I guess the next step is to play around with `page_size` to see if I can do fewer mallocs? The documentation says that `page_size` needs to be set before the database is created. But how do I do it in case of in-memory DBs? 

Do you have any other ideas about speeding up memory allocation?

(12) By srinarasi on 2021-08-12 10:41:29 in reply to 9 [link]

I used `WITHOUT ROWID` option, WAL journaling mode and `synchronous = 0` and played around with `SQLITE_DEFAULT_PCACHE_INITSZ` and `SQLITE_DEFAULT_PAGE_SIZE`. I can now insert ~250,000 rows/sec. I wonder if I can push it to 1,000,000 rows/sec :).

(16) By anonymous on 2021-08-12 11:23:27 in reply to 12 [link]

Do you need to have all the data in a single table? Because if you can live with the data split over a couple tables then here's a trick you can try.

Say you have the data over 4 tables, spawn 4 threads, create a SQLite connection for each. Sqlite should be compiled from the begin-concurrent branch. Now start the large transactions in each thread with BEGIN CONCURRENT rather than BEGIN, with each connection writing into a separate table. This way you ensure commits will not conflict, and you can use multiple cores for the transaction execution up until the commit (which remains serialized with respect to other commits). 

Depending on your transaction size, amount of cores/threads and the duration of serial commits you might very well see an appreciable uplift in throughput.

(17) By anonymous on 2021-08-12 13:27:03 in reply to 16 [link]

Yes unfortunately it has to be on one table. I am just beginner of SQL so I would appreciate some easy approach :)

(18) By srinarasi on 2021-08-12 13:43:07 in reply to 16 [link]

It has to be in a single table. So it's not going to work for us.

(14) By Gunter Hick (gunter_hick) on 2021-08-12 11:03:47 in reply to 9 [link]

It may be faster to create indices (apart from the one corresponding to insert order) after populating the tables.

(8) By anonymous on 2021-08-11 20:02:54 in reply to 1 [link]

1. Is it a <code>WITHOUT ROWID</code> table?
2. What sizes are the strings?

(10) By srinarasi on 2021-08-12 07:48:57 in reply to 8 [link]

1) It's not a `WITHOUT ROWID` table. Does it have an impact on insert performance?
2) The strings are between 0 and 10000 bytes long. On average they are around 70 bytes long. I use `SQLITE_STATIC` while binding them to my prepared statement.

(15) By Gunter Hick (gunter_hick) on 2021-08-12 11:08:38 in reply to 10 [link]

Be sure to keep TEXT and BLOB fields near the end of the row, in decreasing access frequency, and fields used in the indices near the begining of the row. This minimizes the amount of decoding, as accessing any fields located after a TEXT/BLOB with sizeable data will require access to overflow page(s).

(11) By Stephan (stephancb) on 2021-08-12 08:44:35 in reply to 1 [link]

Copying in memory might in this case be avoidable to some extend by using SQLite's [incremental BLOB I/O](https://www.sqlite.org/c3ref/blob_open.html).

Though there still seems to be room for further optimization: The SQLite API could provide

* a function `sqlite3_blob_write_from_io` (or other name) where a file descriptor is passed to SQLite. The function can then retrieve the data directly from the file into its memory, or,

* provide a pointer to its internal buffer and allow the user to fill it (using `read(fd, ...)` if it is directly from I/O represented by a file descriptor).

It seems to me that there is potential to boost insert performance in this case and perhaps many others.

(19) By anonymous on 2021-08-12 18:26:36 in reply to 1 [link]

How often does the data to be loaded change? If your program will run multiple times with the same data, you can save time with a two-phase approach:

1. When the data changes, run a preprocessing program that creates the database in memory and then saves a snapshot to disk using a <code>VACUUM INTO</code> statement.
2. In the main program, use the backup API to load the snapshot from disk into memory. (This is fast because it boils down to allocating a block of memory and reading the entire file into it.)

<https://www.sqlite.org/backup.html>