Faster way to insert into WITHOUT ROWID table?

(1) By Matthijs Tijink (mtijink) on 2021-11-25 10:02:46 [link] [source]

Hi all,

I have a table like this:

CREATE TABLE data (a INTEGER NOT NULL, b INTEGER NOT NULL, c INTEGER, d INTEGER, PRIMARY KEY (a, b)) WITHOUT ROWID;

I want to fill this table as fast as possible with ~20 million rows, so I do some of the usual stuff (one big transaction, prepared statements). Inserting in order also helps a bit. This takes ~19-23 seconds on my machine.

However, this is still roughly 3-4x slower than inserting in a table without primary key and with rowid, which takes ~5-6 seconds. In both cases SQLite is bottlenecked on CPU, not on disk.

I tried a couple of things already, but they didn't help much:

Using an in-memory database. Helps a bit, but still slower than without the primary key on-disk.
Not using a journal. Didn't do a lot, presumably because the table was empty initially.
Inserting all data in an identical in-memory temporary "data2" table, and using INSERT INTO SELECT FROM. Ideally, SQLite should see that the data is already sorted and unique and skip some of the work. Intended to simulate what would happen if I'd create a virtual table. Roughly as fast as inserting directly into the "data" table.
Create and fill a normal table and create an index afterwards. Not ideal, because it creates a larger database and is slower when reading. And it turns out that it's slower than the original approach.

Are there approaches I'm missing? Things I can somehow tell to SQLite to make this faster?

Or is there a way to improve SQLite for this usecase? Ideally approach 3 (with a virtual table) can skip a lot of the insertion work.

(2) By ddevienne on 2021-11-25 12:04:56 in reply to 1 [link] [source]

You haven't said how that DB / table is used later. But If your use case is really as described,
i.e. only fixed-size columns, then you can have random access into your data, by direct indexing,
whether it on-disk or in-memory, if you implement your own virtual table (and indexing) on top of
that regular / tabular data.

Might end up bigger that the SQLite DB, which uses varints, if you intergers are small.
But the extra space is worth it, for direct random-access addressing, performance-wise.

A long time ago, I measured vtables over C++ std::unordered_map-like containers of structs,
being 5x faster than equivalent regular in-memory SQLite tables. And that was even w/o direct
addressing like in your use-case (node-based hash map, so each row is a separate alloc / address).

Short of the above, you've covered all the bases, from what you wrote. FWIW. --DD

(3) By Matthijs Tijink (mtijink) on 2021-11-25 15:51:29 in reply to 2 [link] [source]

Thank you for your reply, good to know.

It's intended for data storage and transfer to other systems, for which SQLite is excellent. The other side might read the data into some in-memory structure of course, but I do need to store the data in regular tables first.

A virtual table approach would work for generating the data (although it's compute-heavy, not completely random access). However, my experiment above seems to show that it's not faster with INSERT INTO SELECT FROM virtual-table than by just inserting directly.

(4) By Keith Medcalf (kmedcalf) on 2021-11-25 17:58:18 in reply to 1 [link] [source]

Havew you increased the cache size?

(5) By Matthijs Tijink (mtijink) on 2021-11-26 09:17:08 in reply to 4 [link] [source]

Yes, but that does not make a difference. I'm not suprised, since SQLite is CPU bound in this case.

(6) By Richard Hipp (drh) on 2021-11-26 11:26:40 in reply to 5 [link] [source]

How are you executing the SQL that does the inserting? Are you preparing a new SQL statement for each row? Or do you prepare a single INSERT statement using parameters, then bind data to the parameters and rerun that one statement for each row?

(7) By Matthijs Tijink (mtijink) on 2021-11-29 08:08:55 in reply to 6 [link] [source]

I'm doing something like this (error checking omitted), so preparing a single time and repeatedly run the statement:

sqlite3_exec(db, "CREATE TABLE data (a INTEGER NOT NULL, b INTEGER NOT NULL, c INTEGER, d INTEGER, PRIMARY KEY (a, b)) WITHOUT ROWID;", nullptr, nullptr, nullptr);

sqlite3_stmt *stmt;
sqlite3_prepare_v2(db, "INSERT INTO data (a, b, c, d) VALUES (?, ?, ?, ?);", -1, &stmt, nullptr);

sqlite3_exec(db, "BEGIN;", nullptr, nullptr, nullptr);

//Real computation loop is a bit more complicated
for (int i = 0; i < N; ++i) {
    int c, d;
    compute_result(a[i], b[i], &c, &d);

    sqlite3_bind_int(stmt, 1, a[i]);
    sqlite3_bind_int(stmt, 2, b[i]);
    sqlite3_bind_int(stmt, 3, c]);
    sqlite3_bind_int(stmt, 4, d);
    sqlite3_step(stmt);
    sqlite3_reset(stmt);
}

sqlite3_exec(db, "COMMIT;", nullptr, nullptr, nullptr);

(8) By Gunter Hick (gunter_hick) on 2021-11-29 08:32:46 in reply to 7 [link] [source]

Can you compute_result() outside the INSERT loop? Maybe you are actually CPU bound computing the values to insert. Or at least run with a tool that gives you percentages of runtime spent in SQLite code and your code (like gcov and gprof on linux).

(9) By Matthijs Tijink (mtijink) on 2021-11-29 08:45:46 in reply to 8 [link] [source]

Actually that's what I do: I compute the values on other threads and let the SQLite thread only pull from a queue. Profilers also confirm that >99% of the SQLite thread time is spent in SQLite code.

I also verified that the computation is not the bottleneck, by first computing all results. After that I measured only the insertion loop time, which was the same.

(10) By Gunter Hick (gunter_hick) on 2021-11-29 08:53:45 in reply to 9 [link] [source]

OK, from your example code it looked like c and d are computed using the a and b array entries, which potentially may be a "CPU cycle sink".

(11) By anonymous on 2021-11-29 10:05:44 in reply to 1 [link] [source]

What are the ranges of the a and b values? Can they be combined into one value that still fits in 64 bits?

(12) By Matthijs Tijink (mtijink) on 2021-11-29 14:15:58 in reply to 11 [link] [source]

Ah, interesting suggestion. Yes, they are nonnegative and less than 2^31, so you could indeed combine them into a single 64-bit value.

And for compatibility I could define a and b as generated columns. So you'd get:

CREATE TABLE data (ab INTEGER NOT NULL PRIMARY KEY, c INTEGER, d INTEGER, a INTEGER GENERATED ALWAYS AS (ab >> 32), b INTEGER GENERATED ALWAYS AS (ab & 0xffffffff));

I think I lose the ability to quickly filter on a and b tough, is that correct?

That is something to consider in general for me: is inserting faster worth giving up fast lookups on (a, b). Because I could also just make it a normal table and don't create an index on (a, b).

(13) By Bill Wade (billwade) on 2021-11-29 15:02:30 in reply to 12 [link] [source]

I suspect that even the original form doesn't have fast filtering on b, unless you specifically create a "b" index.

In principle, the query planner could do something smart for
   select ... where ab>>32 = :a;
but I'd be a bit surprised.

I would hope it would do something smart for
   select ... where ab >= :a<<32 and ab < (:a+1)<<32;
so you might be able to do something with that.

(14) By Matthijs Tijink (mtijink) on 2021-11-29 16:59:05 in reply to 13 [link] [source]

No indeed, but it does have fast filtering on (a, b) together.

I can't modify the consumers of the database, so for backwards compatibility I can't tell them to use ab directly, unfortunately.

(15) By Matthijs Tijink (mtijink) on 2021-11-29 17:05:39 in reply to 1 [link] [source]

I did some more experimenting, and I see the following instructions in the EXPLAIN output:

NoConflict
Halt
IdxInsert

I modified the implementation of the NoConflict to pass bias = 1 to sqlite3BtreeMovetoUnpacked, since I'm inserting in order. And that gives quite a speed boost! (I can't exactly reproduce last week's timings, but I saw a ~20-30% speedup today)

Would it makes sense to add an OPFLAG for NoConflict that it's most likely an append? And emit it too in the parser, possibly only when using INSERT INTO SELECT FROM with a matching ordering?

(16) By Richard Hipp (drh) on 2021-11-29 18:00:36 in reply to 15 [link] [source]

I think you must have made your modifications on an older version of SQLite. The btree "moveto" routines have now been split up into separate procedures for table-btrees and index-btrees (for better performance) and the index-btree version does not have an append-bias flag. See check-in 3b0d34e5e5f9a16c for the details of the change.

(17) By Matthijs Tijink (mtijink) on 2021-11-30 09:40:12 in reply to 16 [link] [source]

I did, sorry for that. I checked again with 3.37.0, and the results I get are similar (maybe a little bit faster). The modification with the append-bias flag on the earlier version (3.35.4) was faster than the newest version though.

Looking further at the code, I found that I can skip the NoConflict and directly do an insert if I use INSERT OR REPLACE INTO. That's not faster or slower, this does mean I can pass the BTREE_APPEND flag to sqlite3BtreeInsert. But that flag isn't used, since the insert is an "unpacked record" :-(

From profiling it seems that quite a large part of the time is being spent finding the place to insert the new row. This is because it apparently goes to the root page and descends the B-tree for every single row being inserted. I think most of that can be avoided if the seekResult parameter and/or the BTREE_SAVEPOSITION flag of sqlite3BtreeInsert can be used somehow. I'm not familiar enough with the SQLite code to do that though.

From the comments, it appears that inserts in rowid-tables do have such optimizations. I might have misunderstood these parts of the SQLite code, but given that the above is true-ish: is it worthwhile to add such optimizations for without-rowid tables too? Ideally the NoConflict opcode could use this optimization too.

(18) By Richard Hipp (drh) on 2021-11-30 14:46:59 in reply to 17 [link] [source]

is it worthwhile to add such optimizations for without-rowid tables too?

Such an "optimization" actually slows things down in the common case. If we had a way to determine when the optimization would benefit, then it would be worth adding. But I don't yet have an algorithm to do that.

(19) By Bill Wade (billwade) on 2021-11-30 16:57:14 in reply to 18 [source]

Perhaps a pragma ordered_bulk_insert could make the planner choose to attempt the optimization?

(20) By Matthijs Tijink (mtijink) on 2021-12-01 09:01:15 in reply to 18 [link] [source]

I would guess it's often useful in these conditions:

It is an INSERT INTO SELECT FROM (i.e. the target table is likely empty)
The target table is WITHOUT ROWID or it has a related index
The SELECT FROM is ordered in the same way as the table or index

Regardless of what you decide regarding implementation (or deciding not to), thank you for answering my questions on this topic.

(21) By Gunter Hick (gunter_hick) on 2021-12-01 09:52:06 in reply to 20 [link] [source]

ad 1) CREATE TABLE AS SELECT guarantees that the target table is empty. INSERT INTO SELECT FROM is a common pattern for "imported transactions" (e.g from a csv file in an external format) that need working on to integrate into the schema.

ad 2) Any benefit ist limited to target tables that have exactly one index, be it the PRIMARY KEY of a WITHOUT ROWID table, the sole UNIQUE constraint of a ROWID table or the single UNIQUE INDEX on a ROWID table; any additional index will be unordered load

ad 3) I don't think INSERT INTO SELECT FROM ORDER BY is a common figure because SQL is based on (unordered) sets as opposed to (ordered) permutations and set operations should be independent of the implementation dependent order of processing. Also, since the question is "initially populating from external source", the source table will have to be a virtual table that knows and reports (via the xBestIndex method) the pre-existing order of the records