SQLite Forum

Hctree is an experimental high-concurrency database backend for SQLite.
Login

Hctree is an experimental high-concurrency database backend for SQLite.

(1) By Dan Kennedy (dan) on 2023-01-18 15:57:08 [link] [source]

Hctree is an experimental high-concurrency database backend for SQLite. It uses optimistic row-level locking.

https://sqlite.org/hctree

The prototype is still missing features and contains many bugs. But it works well enough to test. Some preliminary performance test results are here:

https://sqlite.org/hctree/doc/hctree/doc/hctree/threadtest.wiki

If you have comments or suggestions, or know how a high-concurrency embedded database should be really done and want to tell us, or if you try testing hctree yourself, please post here!

Dan.

(2) By punkish on 2023-01-18 16:54:16 in reply to 1 [link] [source]

Hctree is an experimental high-concurrency database backend for SQLite. It uses optimistic row-level locking.

why not call it "HiLite"?

(3) By Donald Griggs (dfgriggs) on 2023-01-18 18:58:24 in reply to 1 [link] [source]

That's an exciting development.

Small possible typos when you're next editing https://sqlite.org/hctree/doc/hctree/doc/hctree/index.html

"CREATE TABLE t1(INTEGER PRIMARY KEY, b TEXT, c TEXT);"  Maybe exactly as desired, but I wondered if "a INTEGER" was intended)

"Then during transaction execution, hct records " Maybe "hctree"

"If, when the attempt to commit is made, the hctree clients finds..." I wondered if the "s" on "clients" was intended.

Hoping to help -- not at all complaining.

Congratulations on bringing hctree to this stage!

Donald

(4) By anonymous on 2023-01-18 20:22:44 in reply to 1 [link] [source]

That is very exciting work, a quick question, the current limitations mention that all concurrent connections need to be in the same os process as of now. Is multi-process support a design goal?

(5) By Dan Kennedy (dan) on 2023-01-18 21:04:24 in reply to 4 [source]

Is multi-process support a design goal?

Yes. Certainly is.

Dan.

(11) By anonymous on 2023-01-19 13:02:04 in reply to 5 [link] [source]

That's awesome! Can't wait to try it out.

I have been thinking about this problem (concurrency) for a while and I was wondering if there could be a simpler way? Would it make sense, at commit time, and when the pages are checked for conflicts to just redo the reads (all reads, not just selects) for the conflicting transaction and check if they match the previous results? This may or may not be keeping around more data during the transaction lifetime, but it wouldn't require touching the database and adding all these TID and CID data. The caveat though is that reads need to be deterministic, but that's orthogonal and can perhaps be discussed separately.

That approach has multiple potential benefits as it doesn't care about the amount of visited data, rather just the results. In a way this is field level optimistic locking, or even, consider a case where you calculated a sum of a specific field over a large table and wrote the result in another one. Then multiple rows where changed while you were running this transaction, but the effect on the sum was the same, you got the same result, in that case the read will get you the same value and there will be no conflict, which is true from the perspective of the long running transaction

Also, one other thought, even in the current model, if the transaction is not returning any data to the user (no selects and no return clauses, etc.), i.e. fully idempotent, wouldn't it make sense to silently redo it at commit time if it conflicts? Sparing the users the need to retry it themselves?

(14) By Dan Kennedy (dan) on 2023-01-19 16:49:07 in reply to 11 [link] [source]

I have been thinking about this problem (concurrency) for a while and I was wondering if there could be a simpler way? Would it make sense, at commit time, and when the pages are checked for conflicts to just redo the reads (all reads, not just selects) for the conflicting transaction and check if they match the previous results?

I don't think it doesn't make sense. An engineering tradeoff I suppose.

This may or may not be keeping around more data during the transaction lifetime, but it wouldn't require touching the database and adding all these TID and CID data. The caveat though is that reads need to be deterministic, but that's orthogonal and can perhaps be discussed separately.

I don't think this part of the idea works for hctree as it is, because we need the TID values in the database (CID values are never stored in the db, only in memory) for readers. A reader might load a page that contains modifications made since its transaction was started, and in that case it uses the TID value associated with each entry on the page to tell which it should ignore and which are actually part of its database snapshot.

You're quite right that the TID values are taking up too much space at present though. I think that needs to be addressed by (a) removing the TID fields once it is certain that there are no readers so old that they are required, and (b) storing them in fewer bytes when they are first written by making all TID values on a page relative to some TID value in a page header field.

That approach has multiple potential benefits as it doesn't care about the amount of visited data, rather just the results. In a way this is field level optimistic locking, or even, consider a case where you calculated a sum of a specific field over a large table and wrote the result in another one.

A finer grained optimistic logging.

Right now, if your transaction contains something like:

SELECT a, b FROM tbl WHERE rowid BETWEEN 10 AND 20 AND (rowid%2)==0;

hctree records two pieces of information for validation while processing the statement:

  • that the rows of tbl with rowids between 10 and 20 were accessed, and
  • the set of leaf pages that those rows span.

then, at validation time, it first checks if any of the leaf pages identified have been modified. If not, transaction validation passes. We try this page-level locking step first because it's fast - checking if a page has been modified is just a matter of comparing the 8-byte page-map value. If any of those leaves have been modified then it falls back to actually checking rows 10 through 20 by inspecting TID values.

But this is still coarser than it could have been, as validation will fail if a row with a rowid that does not match the unindexed ((rowid%2)==0) constraint, or if some field not read by the SELECT (i.e. "c" or "d") is modified in a row.

So I guess you're saying why stop there? Why not store the entire SELECT. When it's run as part of the transaction record a fast cryptographic checksum of the results returned. Then check that this same checksum can still be obtained during validation.

multiple rows where changed while you were running this transaction, but the effect on the sum was the same, you got the same result, in that case the read will get you the same value and there will be no conflict

Yikes! I don't see any reason why that would not be valid though. I guess toggled fields is a similar case.

Also, one other thought, even in the current model, if the transaction is not returning any data to the user (no selects and no return clauses, etc.), i.e. fully idempotent, wouldn't it make sense to silently redo it at commit time if it conflicts?

Maybe, but I think it's better to let a layer above SQLite do that. You'd have to store the SQL program - statements and bindings - for the duration of the transaction, which we don't currently do. And you would have to figure out what to do if the transaction kept on failing.

Dan.

(6) By skywalk on 2023-01-18 21:44:55 in reply to 1 [link] [source]

Very interesting data.
I thought I read here that it is more beneficial to increase page size, not lower it?!
Even at 1 thread it is neck and neck.

(7) By anonymous on 2023-01-19 02:53:38 in reply to 6 [link] [source]

If my thinking is correct, I would speculate that the difference is likely due to hct dealing with finer pieces (optimistic row level locking).  So it makes sense to me that hct1024 would outperform larger blocks of hct4096 (default hct) because the row's payload are relatively small in the tests.  Once the memory bandwidth is saturated (i.e. with larger payloads) the hct4096 (default hct) might even things up a bit.

But I'm just speculating.  Definitely looking forward to running some tests with the hcttree backend myself.

It would be interesting to see that same test machine compare the other journal modes of DELETE (the traditional non WAL mode), and OFF (no journal as a 0 baseline).

2023 is looking interesting for SQLite.

(8) By ddevienne on 2023-01-19 09:48:19 in reply to 6 [link] [source]

I thought I read here that it is more beneficial to increase page size, not lower it?!

That's explained in the page. The IO unit is always page-sized. If you change anything, even a single byte, you must write at least a page.
So if you make small updates, having pages 4x larger involved 4x the IO. And Dan wrote that with larger pages his test saturates the memory bus
of the machine he's testing on, while with 4x smaller pages, it decreases the IO enough to no longer saturate it, so becomes CPU bound instead.

(9) By Eduardo on 2023-01-19 12:10:13 in reply to 8 [link] [source]

1KB is oriented then to r < w (write heavy) oriented databases. If you have r > w databases perhaps 4KB pages takes more sense. A stable db with low writes

Note also that the schema used for testing is :

CREATE TABLE tbl( a INTEGER PRIMARY KEY, b BLOB(200), c CHAR(64) ); CREATE INDEX tbl_i1 ON tbl(substr(c, 1, 16)); CREATE INDEX tbl_i2 ON tbl(substr(c, 2, 16));

Didn't test, but I suspect that changing the blob to 3000, so each row is bigger than 1KB, will make 4KB the winner.

(10) By fnavarro on 2023-01-19 12:36:41 in reply to 1 [link] [source]

Dan, very interesting. Thanks for your efforts on this. I would like to ask if it is on the design goals to be able to store the db file to on a shared filesystem for being accessed from multiple devices simoultaneously, being resilient to mutexes implementations and cache consistency problems at least as an ISAM file could be on the same environment or if a server component would be envisored.

(16) By Dan Kennedy (dan) on 2023-01-19 16:56:42 in reply to 10 [link] [source]

if it is on the design goals to be able to store the db file to on a shared filesystem for being accessed from multiple devices simoultaneously

A file-system shared over a network? I don't think so. For the same reason as SQLite in wal mode can't do it - all clients need access to common shared memory.

Dan.

(17) By fnavarro on 2023-01-19 23:55:13 in reply to 16 [link] [source]

Thanks Dan for the answer. Yes I meant over a network. I thought if it could be used in this specific segment, by small transaction processing systems or similar with traditional desktop software architecture recording its data simoultaneously from some machines on a local server, for example in some cases to migrate ISAM files to an embedded SQL database (or perhaps with a small server component) without installing a database server engine. As it seemed to be possible on Berkeley DB before abandoning support for client/server and SQLite Api. But I did understand that it is not the goal, and that the idea itself would be very unlikely in any case.

(18) By Donal Fellows (dkfellows) on 2023-01-20 16:48:03 in reply to 17 [link] [source]

The problem is that network filesystems have some subtly broken semantics, especially in relation to locking, and that these problems are essentially inherent to how networking functions. This is because of the sheer number of potential weird failure modes, including some idiot unplugging a router in the middle of a commit in such a way that neither side of the connection can spot the problem for a while, or another device deciding to flood the network with packets such that comms become essentially impossible until the offending system is forcefully disconnected. Typical OS handling of networked filesystems provides no way to properly report these failures, or indeed any way to even do non-blocking access to files, so you instead get some very bad behaviours. (I remember losing a week of work back in the 1990s to a severely misbehaving NFS server that spent about 95% of the time in a crashed state with every process on my workstation — including the Xserver — blocked waiting to page in from the downed service.)

By the point you get to needing distributed databases (with write access anywhere; totally read-only access is different) you probably need to switch to using a server-based database solution.

(12) By anonymous on 2023-01-19 14:26:22 in reply to 1 [link] [source]

Any thoughts on how to synchronize between different processes? Would something like NOTIFY / LISTEN be in scope of SQLite or are there other mechanisms already?

/jonasb

(13) By Gunter Hick (gunter_hick) on 2023-01-19 15:23:47 in reply to 12 [link] [source]

NOTIFY/LISTEN requires a client/server architecture. I don't think something advertised as a "backend" will change SQLite from a library to a client/server.

(15) By ddevienne on 2023-01-19 16:52:44 in reply to 13 [link] [source]

NOTIFY/LISTEN requires a client/server architecture

Not if you restrict it to same-host notifications, I think.
Then you can use shared-memory and regular IPCs to implement localhost NOTIFY/LISTEN.
Not as generic, sure, but Dan answered multi-process, which does not necessarily imply those processes can be on different hosts.

(19) By SeverKetor on 2023-01-21 03:43:11 in reply to 1 [link] [source]

Well this is a nice way to start the year; it already looks pretty interesting. I have a few questions about it:
1) Is this planned to be merged into the main SQLite trunk, or is this always (as far as you know) going to be a separate thing?
2) The docs say that transactions and rollback are not supported. However, in multiple places following, it talks about how transactions work and a bit about how it does rollbackd. Are those just single-statement transactions and forced-rollbacks (such as caused by errors in a statement) then?
3) Is there a rough estimate on when this will be fully ready? Months? Next year? Further out, or just not really possible to say?

(20) By Dan Kennedy (dan) on 2023-01-21 13:57:38 in reply to 19 [link] [source]

Is this planned to be merged into the main SQLite trunk, or is this always (as far as you know) going to be a separate thing?

No plans to merge this in. It would be quite a change of direction for SQLite, to say the least.

The docs say that transactions and rollback are not supported.

Hctree supports transactions in all the same ways as stock SQLite does.

This is a documentation problem I think. The design is split into two layers - a fast on-disk tree-like data structure and a layer to manage transactions above that. It's the tree structure that provides no support for transactions or rollback - as that is done by the layer above.

Is there a rough estimate on when this will be fully ready? Months? Next year? Further out, or just not really possible to say?

Really couldn't say at this point.

Dan.

(21) By SeverKetor on 2023-01-21 14:49:12 in reply to 20 [link] [source]

Alrighty, thanks for the answers.

I do have a follow-up question: the documentation lists 8 shortcomings. Since you mentioned it will stay on its own branch, does that mean some of these are inherent to Hctree, or is there some other reason to not merge it (like increased library size or some backwards compatibility issue(s))? Primarily looking at #8 (no support for recovering from power failure or OS crash).

I can see an eventual use-case, so I'd like to get my ducks in a row since I'd have some convincing to do before being able to use it.

(22) By Dan Kennedy (dan) on 2023-01-21 19:48:19 in reply to 21 [link] [source]

the documentation lists 8 shortcomings. Since you mentioned it will stay on its own branch, does that mean some of these are inherent to Hctree, or is there some other reason to not merge it (like increased library size or some backwards compatibility issue(s))?

The SQLite project has very high standards for backwards compatibility and interface stability. And once something is in it, it has to be maintained forever. I think more flexibility would be nice for quite a while yet.

Primarily looking at #8 (no support for recovering from power failure or OS crash).

I mean to fix them all at some point, but I think (8) is a fairly low priority. And it's difficult to say how much performance will be impacted by any solution for this.

Dan.

(23) By Eduardo on 2023-01-22 21:19:25 in reply to 1 [link] [source]

Dan, very nice work. I've tested it it and is FAST, yes, with capital letters. Didn't tried in crash and unplug scenarios so don't know how well does it recover

Very very good work Sir.

(24) By Gregory Petrosyan (pgregory) on 2023-01-23 11:25:14 in reply to 1 [link] [source]

Very exciting work!

What is the general relationship of hctree and "regular" BEGIN CONCURRENT + wal2 work, does hctree use WAL at all? Will hctree likely accelerate or slowdown the wal2 mode development?

(25) By Dan Kennedy (dan) on 2023-01-24 17:26:38 in reply to 24 [link] [source]

What is the general relationship of hctree and "regular" BEGIN CONCURRENT + wal2 work, does hctree use WAL at all?

It doesn't make use of wal mode. Hctree replaces the blue "backend" box in the diagram here with something different:

https://www.sqlite.org/arch.html

So they're quite different things in some ways. (BEGIN CONCURRENT + wal2) allows you to write concurrently prepared transactions to a regular SQLite database. Hctree uses a different database format designed to allow concurrent writers.

Dan.

(26) By ddevienne on 2023-01-24 17:48:04 in reply to 25 [link] [source]

It doesn't make use of wal mode. Hctree replaces the blue "backend" box

Could being able to replace the backend be something that becomes a new extension point?

Hctree obviously reuses a large part of SQLite, but if they are separate, the long term maintenance will be problematic, no?

Plus being able to replace the backend at a different level than the VFS (a large one in fact) would open up the SQLite ecosystem even more.

I'm sure it's not that simple. But surely this was already discussed. So can you share some thoughts in that regard?

(27) By Dan Kennedy (dan) on 2023-01-25 16:10:00 in reply to 26 [link] [source]

Could being able to replace the backend be something that becomes a new extension point?

Not impossible I would think. The biggest problem is that it's not a stable interface and is not likely to ever be deemed one.

Hctree obviously reuses a large part of SQLite, but if they are separate, the long term maintenance will be problematic, no?

It's not really a factor. We have good tools for this these days:

https://www.sqlite.org/privatebranch.html

Dan.