Increasing insert performance of an in-memory db
(1) By srinarasi on 2021-08-10 08:29:29 [link] [source]
Some context
We have an in-memory database with a simple table(3 integers and a string, the first 3 numbers form the primary key, no other indexes at this point ). We need to create the DB from scratch on launch and populate 100 million rows in the table. The DB grows to ~9GB after it's completely populated. Currently it takes around 90 minutes which means the insertion rate is ~18000 rows/second.
Our code is written in C++. I have tried everything in this thread. My machine is on Azure with 12 cores and has 128 GB of RAM. We compile sqlite with SQLITE_WIN32_MALLOC
and SQLITE_WIN32_HEAP_CREATE
.
On profiling:
I see that around 75% of the time is spent in memcpy_repmovs()
. 97% of the time is spent in sqlite3BtreeMovetoUnpacked()
.
Questions
- Why does
memcpy()
hog so much of the CPU when we're only storing around ~1.5MB/sec? Is this normal? - For storing around 100 million rows of around 90 byes each, what do you expect the best case insertion rate to be? I'm trying to set my own expectations of what we can achieve with our current approach.
Thanks,
Bhargava
(2) By anonymous on 2021-08-10 11:15:12 in reply to 1 [link] [source]
A couple thoughts:
The number of cores in the machine is irrelevant, for the highest insert speed you need to be using a single thread/core for that. If you are not doing so then try to have a single inserter thread.
I am not sure about your transaction size, but you should be able to achieve much higher throughput with larger transactions on modern hardware.
Have you tried WAL with synchronous=0? This option leads to better insert speeds in some cases.
(3) By J.M. Aranda (JMAranda) on 2021-08-10 17:30:38 in reply to 1 [link] [source]
From experience I would say that an Azure-Core does not necessarily imply an Intel-Core. And I have some 16Gb cards that hold it.
(4) By srinarasi on 2021-08-11 09:24:16 in reply to 2 [link] [source]
I'm already using a single inserter thread. I played around with multiple inserters but that didn't really improve things much (as expected).
My transaction size is 50,000. Since the bottleneck seems to be memcpy(), do you think tweaking transaction size helps much?
My journal mode is OFF
and the database is in-memory. Looking at the documentation, I think synchronous
matters only for writing to the disk.
(5) By srinarasi on 2021-08-11 09:25:27 in reply to 3 [link] [source]
That's a good point. I tried it on a real i7 core as well. The profile looks similar. memcpy()
is the bottleneck and the insertion rate is ~18,000 rows/sec.
(6) By Bill Wade (billwade) on 2021-08-11 12:04:23 in reply to 1 [link] [source]
Often, appends are faster than inserts.
Are your inserts in primary key order? If so, an insert of ~100 records touches a couple of leaf pages (depends on page size, and your string sizes), and about the same number of index pages.
If not, it touches about 100 leaf pages and about 100 index pages.
The thread you referenced was doing inserts in primary key order.
(7) By anonymous on 2021-08-11 12:50:17 in reply to 4 [link] [source]
I would play around with the transaction size to see if there is a sweet spot to be found.
As for the journal mode, I think you should try the WAL mode, since this causes inserts to become simple appends vs changing a b-tree structure (even if you will have to pay later with a checkpoint). With a synchronous = 0 and large transactions your app will mostly be doing everything in memory.
Also, in WAL mode and given your transaction size, you want to make sure that your page cache is big enough to hold your transactions.
Do you have any reads ongoing while you are inserting?
(8) By anonymous on 2021-08-11 20:02:54 in reply to 1 [link] [source]
- Is it a
WITHOUT ROWID
table? - What sizes are the strings?
(9) By srinarasi on 2021-08-12 07:45:55 in reply to 6 [link] [source]
Thanks for this! The inserts were not in primary key order. Changing my code to do inserts in primary key order improved the performance by around 80%. Now memcpy()
takes around 15% of the time. malloc()
takes around 75%.
I guess the next step is to play around with page_size
to see if I can do fewer mallocs? The documentation says that page_size
needs to be set before the database is created. But how do I do it in case of in-memory DBs?
Do you have any other ideas about speeding up memory allocation?
(10) By srinarasi on 2021-08-12 07:48:57 in reply to 8 [link] [source]
- It's not a
WITHOUT ROWID
table. Does it have an impact on insert performance? - The strings are between 0 and 10000 bytes long. On average they are around 70 bytes long. I use
SQLITE_STATIC
while binding them to my prepared statement.
(11) By Stephan (stephancb) on 2021-08-12 08:44:35 in reply to 1 [link] [source]
Copying in memory might in this case be avoidable to some extend by using SQLite's incremental BLOB I/O.
Though there still seems to be room for further optimization: The SQLite API could provide
a function
sqlite3_blob_write_from_io
(or other name) where a file descriptor is passed to SQLite. The function can then retrieve the data directly from the file into its memory, or,provide a pointer to its internal buffer and allow the user to fill it (using
read(fd, ...)
if it is directly from I/O represented by a file descriptor).
It seems to me that there is potential to boost insert performance in this case and perhaps many others.
(12) By srinarasi on 2021-08-12 10:41:29 in reply to 9 [link] [source]
I used WITHOUT ROWID
option, WAL journaling mode and synchronous = 0
and played around with SQLITE_DEFAULT_PCACHE_INITSZ
and SQLITE_DEFAULT_PAGE_SIZE
. I can now insert ~250,000 rows/sec. I wonder if I can push it to 1,000,000 rows/sec :).
(13) By srinarasi on 2021-08-12 10:42:14 in reply to 7 [link] [source]
No reads are happening during the test. Playing around with WAL mode and synchronous. It helped a lot. I can now insert 250,000 rows/sec. Thanks!
(14) By Gunter Hick (gunter_hick) on 2021-08-12 11:03:47 in reply to 9 [link] [source]
It may be faster to create indices (apart from the one corresponding to insert order) after populating the tables.
(15) By Gunter Hick (gunter_hick) on 2021-08-12 11:08:38 in reply to 10 [link] [source]
Be sure to keep TEXT and BLOB fields near the end of the row, in decreasing access frequency, and fields used in the indices near the begining of the row. This minimizes the amount of decoding, as accessing any fields located after a TEXT/BLOB with sizeable data will require access to overflow page(s).
(16) By anonymous on 2021-08-12 11:23:27 in reply to 12 [link] [source]
Do you need to have all the data in a single table? Because if you can live with the data split over a couple tables then here's a trick you can try.
Say you have the data over 4 tables, spawn 4 threads, create a SQLite connection for each. Sqlite should be compiled from the begin-concurrent branch. Now start the large transactions in each thread with BEGIN CONCURRENT rather than BEGIN, with each connection writing into a separate table. This way you ensure commits will not conflict, and you can use multiple cores for the transaction execution up until the commit (which remains serialized with respect to other commits).
Depending on your transaction size, amount of cores/threads and the duration of serial commits you might very well see an appreciable uplift in throughput.
(17) By anonymous on 2021-08-12 13:27:03 in reply to 16 [link] [source]
Yes unfortunately it has to be on one table. I am just beginner of SQL so I would appreciate some easy approach :)
(18) By srinarasi on 2021-08-12 13:43:07 in reply to 16 [source]
It has to be in a single table. So it's not going to work for us.
(19) By anonymous on 2021-08-12 18:26:36 in reply to 1 [link] [source]
How often does the data to be loaded change? If your program will run multiple times with the same data, you can save time with a two-phase approach:
- When the data changes, run a preprocessing program that creates the database in memory and then saves a snapshot to disk using a
VACUUM INTO
statement. - In the main program, use the backup API to load the snapshot from disk into memory. (This is fast because it boils down to allocating a block of memory and reading the entire file into it.)