Increasing insert performance of an in-memory db
We have an in-memory database with a simple table(3 integers and a string, the first 3 numbers form the primary key, no other indexes at this point ). We need to create the DB from scratch on launch and populate 100 million rows in the table. The DB grows to ~9GB after it's completely populated. Currently it takes around 90 minutes which means the insertion rate is ~18000 rows/second.
Our code is written in C++. I have tried everything in this thread. My machine is on Azure with 12 cores and has 128 GB of RAM. We compile sqlite with
I see that around 75% of the time is spent in
memcpy_repmovs(). 97% of the time is spent in
- Why does
memcpy()hog so much of the CPU when we're only storing around ~1.5MB/sec? Is this normal?
- For storing around 100 million rows of around 90 byes each, what do you expect the best case insertion rate to be? I'm trying to set my own expectations of what we can achieve with our current approach.
A couple thoughts:
The number of cores in the machine is irrelevant, for the highest insert speed you need to be using a single thread/core for that. If you are not doing so then try to have a single inserter thread.
I am not sure about your transaction size, but you should be able to achieve much higher throughput with larger transactions on modern hardware.
Have you tried WAL with synchronous=0? This option leads to better insert speeds in some cases.
I'm already using a single inserter thread. I played around with multiple inserters but that didn't really improve things much (as expected).
My transaction size is 50,000. Since the bottleneck seems to be memcpy(), do you think tweaking transaction size helps much?
My journal mode is
OFF and the database is in-memory. Looking at the documentation, I think
synchronous matters only for writing to the disk.
I would play around with the transaction size to see if there is a sweet spot to be found.
As for the journal mode, I think you should try the WAL mode, since this causes inserts to become simple appends vs changing a b-tree structure (even if you will have to pay later with a checkpoint). With a synchronous = 0 and large transactions your app will mostly be doing everything in memory.
Also, in WAL mode and given your transaction size, you want to make sure that your page cache is big enough to hold your transactions.
Do you have any reads ongoing while you are inserting?
No reads are happening during the test. Playing around with WAL mode and synchronous. It helped a lot. I can now insert 250,000 rows/sec. Thanks!
From experience I would say that an Azure-Core does not necessarily imply an Intel-Core. And I have some 16Gb cards that hold it.
That's a good point. I tried it on a real i7 core as well. The profile looks similar.
memcpy() is the bottleneck and the insertion rate is ~18,000 rows/sec.
Often, appends are faster than inserts.
Are your inserts in primary key order? If so, an insert of ~100 records touches a couple of leaf pages (depends on page size, and your string sizes), and about the same number of index pages.
If not, it touches about 100 leaf pages and about 100 index pages.
The thread you referenced was doing inserts in primary key order.
Thanks for this! The inserts were not in primary key order. Changing my code to do inserts in primary key order improved the performance by around 80%. Now
memcpy() takes around 15% of the time.
malloc() takes around 75%.
I guess the next step is to play around with
page_size to see if I can do fewer mallocs? The documentation says that
page_size needs to be set before the database is created. But how do I do it in case of in-memory DBs?
Do you have any other ideas about speeding up memory allocation?
WITHOUT ROWID option, WAL journaling mode and
synchronous = 0 and played around with
SQLITE_DEFAULT_PAGE_SIZE. I can now insert ~250,000 rows/sec. I wonder if I can push it to 1,000,000 rows/sec :).
Do you need to have all the data in a single table? Because if you can live with the data split over a couple tables then here's a trick you can try.
Say you have the data over 4 tables, spawn 4 threads, create a SQLite connection for each. Sqlite should be compiled from the begin-concurrent branch. Now start the large transactions in each thread with BEGIN CONCURRENT rather than BEGIN, with each connection writing into a separate table. This way you ensure commits will not conflict, and you can use multiple cores for the transaction execution up until the commit (which remains serialized with respect to other commits).
Depending on your transaction size, amount of cores/threads and the duration of serial commits you might very well see an appreciable uplift in throughput.
Yes unfortunately it has to be on one table. I am just beginner of SQL so I would appreciate some easy approach :)
It has to be in a single table. So it's not going to work for us.
(14) By Gunter Hick (gunter_hick) on 2021-08-12 11:03:47 in reply to 9 [link] [source]
It may be faster to create indices (apart from the one corresponding to insert order) after populating the tables.
- Is it a
- What sizes are the strings?
- It's not a
WITHOUT ROWIDtable. Does it have an impact on insert performance?
- The strings are between 0 and 10000 bytes long. On average they are around 70 bytes long. I use
SQLITE_STATICwhile binding them to my prepared statement.
(15) By Gunter Hick (gunter_hick) on 2021-08-12 11:08:38 in reply to 10 [source]
Be sure to keep TEXT and BLOB fields near the end of the row, in decreasing access frequency, and fields used in the indices near the begining of the row. This minimizes the amount of decoding, as accessing any fields located after a TEXT/BLOB with sizeable data will require access to overflow page(s).
Copying in memory might in this case be avoidable to some extend by using SQLite's incremental BLOB I/O.
Though there still seems to be room for further optimization: The SQLite API could provide
sqlite3_blob_write_from_io(or other name) where a file descriptor is passed to SQLite. The function can then retrieve the data directly from the file into its memory, or,
provide a pointer to its internal buffer and allow the user to fill it (using
read(fd, ...)if it is directly from I/O represented by a file descriptor).
It seems to me that there is potential to boost insert performance in this case and perhaps many others.
How often does the data to be loaded change? If your program will run multiple times with the same data, you can save time with a two-phase approach:
- When the data changes, run a preprocessing program that creates the database in memory and then saves a snapshot to disk using a
- In the main program, use the backup API to load the snapshot from disk into memory. (This is fast because it boils down to allocating a block of memory and reading the entire file into it.)