SQLite Forum

Native support for pluggable backends in SQLite
Login

Native support for pluggable backends in SQLite

(1.1) By Ben Trask (btrask) on 2020-11-13 00:19:06 edited from 1.0 [link] [source]

Hi all,

I was talking with Dan Shearer of LumoSQL and came up with this interface/design document. (See below)

Note that none of the code behind this is written yet. It would require very deep changes to the SQLite btree code, and (probably) take months of work, especially to get it up to production grade.

I'm mainly just curious whether there is interest for such a thing.

Here is some of my past work in this direction:
https://github.com/btrask/sqlheavy - SQLite ported to LevelDB, LMDB, RocksDB, and anything else you can imagine
https://github.com/btrask/libkvstore - C wrapper around transactional key-value stores (used by SQLHeavy)

--Ben

The sample header:

// Proposed SQLite support for pluggable key-value store back-ends

// Design criteria
// 1. Non-invasive, "relatively" easy to interject within the btree.h API
// 2. Easy to implement by kv modules (wrappers for existing kv-stores)
// 3. Doesn't expose too many inner workings of SQLite
// 4. Support both SQLite internal configuration and arbitrary backend configuration
// 5. Allow SQLite users to configure arbitrary backend options (via PRAGMAs)
// 6. Work with existing custom VFSes (when possible)
// 7. Allow compiling without btree.c or related code at all ("SQLITE_OMIT_KV_BTREE")
// 8. Support ALL features/production-ready (integrity checking, etc)
// 9. Able to specify KV module per DB when opening, without adding sqlite3_open_v3()
// 10. Auto-detection during open with filename extensions (e.g. "db.sqlite_lmdb")
// 11. Able to be upstreamed into SQLite proper, negligible ongoing maintenance

// Non-goals/out of scope
// - Extracting btree.c as an independent key-value store (although it would be easier after this change)

// Note: Good performance is a baseline requirement.
// The overhead of this change will basically be one indirect v-table
// function call per btree access from SQLite main.
// Existing kv-stores will work at their full speed (however fast that is).
// The general performance of SQLite on other back-ends is already proven
// by SQLightning, SQLHeavy, LumoSQL and others.


// During DB open, SQLite would check e.g. "db.sqlite_lmdb", "db.sqlite_level",
// etc., based on the KV back-ends that are registered (in order).
// By using _, these names don't conflict with ".sqlite-wal", etc.
// The original btree back-end would still use plain ".sqlite"
const sqlite3_kv_methods *sqlite3_kv_find(const char *zExt);
int sqlite3_kv_register(sqlite3_kv_methods*, int makeDflt);
int sqlite3_kv_unregister(sqlite3_kv_methods*);


// Property IDs for sqlite3_kv_methods xGetProperty and xSetProperty
// These properties would need to be implemented by each back-end
// although for some back-ends it could be just a dummy (e.g. return
// free page count as zero for back-ends that don't expose that info)
#define SQLITE_KV_FREE_PAGE_COUNT     1
#define SQLITE_KV_DEFAULT_CACHE_SIZE  2
#define SQLITE_KV_LARGEST_ROOT_PAGE   3
#define SQLITE_KV_INCR_VACUUM         4 // TODO: This might need to be handled differently
#define SQLITE_KV_METADATA            5 /* int key, int val */
#define SQLITE_KV_CACHE_SIZE          6
#define SQLITE_KV_MMAP_LIMIT          7
#define SQLITE_KV_SAFETY_LEVEL        8
#define SQLITE_KV_SYNC_DISABLED       9
#define SQLITE_KV_PRAGMA             10 /* char *key, char *val */


typedef struct sqlite3_kv_methods sqlite3_kv_methods;
typedef struct sqlite3_kv sqlite3_kv;
typedef struct sqlite3_kv_cursor sqlite3_kv_cursor;
typedef int (*sqlite3_kv_cmp_fn)(void*,int,const void*,int,const void*);

struct sqlite3_kv_methods {
	int iVersion;
	const char *zExt; // E.g. "lmdb" -> "db.sqlite_lmdb"
	sqlite3_kv_methods *pNext;

	int (*xInit)(void);
	void (*xShutdown)(void);

	// This function would be called by sqlite3_open(), etc., once it has
	// found the appropriate kv back-end to use. For newly created
	// databases, the back-end can optionally be specified by filename
	// extension, such as "db.sqlite_lmdb".
	int (*xOpen)(sqlite3*, sqlite3_vfs*, const char *zFilename, sqlite3_kv*)
	void (*xClose)(sqlite3_kv*);

	int (*xSetComparator)(sqlite3_kv*, sqlite3_kv_cmp_fn, void*);
	int (*xGetConfig)(sqlite3_kv*, int prop, void *pArg);
	int (*xSetConfig)(sqlite3_kv*, int prop, void *pArg);

	int (*xIntegrityCheck)(sqlite3_kv*);
	int (*xIncrementalVacuum)(sqlite3_kv*);
	// Note: These methods might allow a simplified API,
	// or conversely a few more methods like this might be necessary.

	int (*xBeginTxn)(sqlite3_kv*, int flags);
	int (*xSavepoint)(sqlite3_kv*); // nested transactions
	int (*xCommitPhaseOne)(sqlite3_kv*); // can be a no-op, depending on back-end
	int (*xCommitPhaseTwo)(sqlite3_kv*);
	void (*xRollback)(sqlite3_kv*);

	// Note: These functions refer to KV tables, not SQL tables.
	int (*xCreateTable)(sqlite3_kv*, int*, int flags);
	int (*xDropTable)(sqlite3_kv*, int iTable, int *piMoved);
	int (*xClearTable)(sqlite3_kv*, int iTable, int *pnChange);
	// Note: 'piMoved' and 'pnChange' arguments might be omitted.
	// These are things that you probably don't want enshrined
	// in a public API.

	int (*xCursorOpen)(sqlite3_kv*, int iTable, sqlite3_kv_cursor**);
	void (*xCursorClose)(sqlite3_kv_cursor*);
	int (*xCursorDelete)(sqlite3_kv_cursor*, int flags);
	int (*xCursorPut)(sqlite3_kv_cursor*, const void *key, i64 nkey, const void *data, i64 ndata, int flags);
	int (*xCursorFirst)(sqlite3_kv_cursor*, int dir);
	int (*xCursorNext)(sqlite3_kv_cursor*, int dir);
	int (*xCursorSeek)(sqlite3_kv_cursor*, const void *key, i64 nkey, const void *data, i64 ndata, int dir);
	int (*xCursorCurrent)(sqlite3_kv_cursor*, const void *key, i64 nkey, const void *data, i64 ndata);

};

(2) By Gunter Hick (gunter_hick) on 2020-11-13 07:13:50 in reply to 1.1 [link] [source]

We have found that the Virtual Table interface is sufficient to implement any type of backing store without changes to anything internal to SQLite.

Specifically we have operational virtual table implementations for

  • Faircom CTree files
  • Our own proprietary Data Dictionary (residing in shared memory sections)
  • diverse in-memory linked lists in C
  • FastBit compressed bitmap indexes
  • structured TLV decomposition

What, specifically, would the unique selling point of your proposal be that cannot be achieved using a virtual table?

(3) By Ben Trask (btrask) on 2020-11-13 12:33:15 in reply to 2 [link] [source]

Hi Gunther, thanks for the reply.

Virtual tables have many limitations listed here: https://sqlite.org/vtab.html

  • One cannot create a trigger on a virtual table.
  • One cannot create additional indices on a virtual table. (Virtual tables can have indices but that must be built into the virtual table implementation. Indices cannot be added separately using CREATE INDEX statements.)
  • One cannot run ALTER TABLE ... ADD COLUMN commands against a virtual table.

Also just the general syntax is different:

CREATE VIRTUAL TABLE tablename USING modulename;
-- vs
CREATE TABLE tablename (col1 INTEGER PRIMARY KEY, etc);

The docs say:

Each module-argument is passed as written (as text) into the constructor method of the virtual table implementation when the virtual table is created and that constructor is responsible for parsing and interpreting the arguments. The argument syntax is sufficiently general that a virtual table implementation can, if it wants to, interpret its arguments as column definitions in an ordinary CREATE TABLE statement.

But this is work that a true key-value store doesn't want to do, and would be a source of incompatibilities and bugs.

In short, the virtual table API is too high level for this purpose, and the virtual file system (VFS) API is too low level.

There are several projects which replace SQLite's native btree with other kv stores, in an ad hoc way:

Unifying these under a single official API would make things more robust and flexible. It would also probably help SQLite's own code quality (and testing?), by enforcing a clean layering. And it might help SQLite get a built-in LSM-tree back-end in the future. :)

(I'm sorry if the title of the thread was too general and lead to confusion. I was trying to avoid saying "key-value store" for people who might not be familiar.)

Ben

(5) By Gunter Hick (gunter_hick) on 2020-11-13 13:49:27 in reply to 3 [link] [source]

I can see your points. Dev is a large team effort here, so we usually don't need to do ad hoc changes to the schema. We have a Data Model stored a xml files, which are run through a "Data Model Compiler" to produce config files for the backing stores and the corresponding Virtual Table modules. Backing stores are rebuilt during install if necessary. So our DDL statements look like

CREATE VIRTUAL TABLE <table> USING <module> (<config file>,<entry>);

And the module already knows what the row looks like and which keys are defined.

Being that SQLite is used as a reporting/diagnosis tool here, we don't really miss triggers.

Thinking about adding indeces to virtual tables just made me realise that we actually could add them to the backing store without telling SQLite anything about it. A process restart is all that would be required.

(4) By Ben Trask (btrask) on 2020-11-13 12:36:49 in reply to 2 [link] [source]

I'm sorry for getting your name wrong, Gunter!

(6) By Larry Brasfield (LarryBrasfield) on 2020-11-13 14:51:38 in reply to 4 [link] [source]

FYI: You may observe that their is an "Edit" button on your own posts. Using it for spelling and grammar fixups is good for reducing post clutter.

(7) By Dan Shearer (danshearer) on 2020-11-13 15:47:39 in reply to 1.1 [link] [source]

Ben, Gunter, others,

Ben's proposal is the sort of philosophy we're implementing in LumoSQL, although we are approaching it from the opposite direction. In LumoSQL we are trying to identify what is missing in SQLite, and then implement minimally-intrusive ways of addressing those problems. As of LumoSQL v0.3 we have working, benchmarked prototypes and are starting to close in on what a backend API might look like.

Gunter raised a point about the Virtual Table Interface which I respond to near the bottom.

I will leave a direct answer to Ben about header files and API details to another post. Because I feel we need to know why we are making changes, and then how we're going to do it, and finally what should be done. I have already discussed the general source architecture with Ben (and of course he was working in this area years ago) so I'll be glad to move the code architecture chat out in public.

We have done a great deal of work on the "what's missing?" problem, including extensively documenting existing knowledge. I recommend anyone interested in storage APIs for SQLite to look there for inspiration in code and papers related to SQLite, K-V stores, VFS, benchmarking, transaction theory and more. A lot of SQLite work has taken place in isolated hard forks, and we are trying to make it easier for these forks to feel less isolated.

None of this overlooks the value of the existing binary on-disk format of SQLite as relied on by billions of people daily. That's another part of the discussion. This discussion is about what actions will address the pressure for SQLite storage to do more and different things.

What Is Missing in SQLite Storage?

Nobody Really Knows :-)

In terms of performance, reliability and tradeoffs, nobody knows what is missing. That is because there is no consistent way of comparing one version of SQLite with a modified version, and certainly not for repeating these comparisons across many different users, systems and build environments. That's why we built benchmarking, which also works for vanilla SQLite-on-SQLite.

Ability to Try New or Different K-V Options

There is a small number of K-V stores that provide the facilities SQLite needs: MVCC, compatible transaction model, a Berkely Database-style API, and implemented in C. The main candidates are: LMDB, BDB (from when Oracle bought Sleepycat) and BDB (forked by Bloomberg when Oracle bought Sleepycat, and adding things like page-level locking.) Karl Malbrain's database is a candidate that implements the Serial Safety Net concurrency protocol, which rethinks the way concurrency is handled in a K-V store. This is all about looking at alternative futures for SQLite.

A Modern Alternative to 1990s-era WAL files

Every mainstream SQL database including SQLite uses 1990s-era batch design to achieve reliabile persistent page storage. This works, but it also ignores the last two decades of development in virtual memory management. Modern operating systems know how to save and retrieve pages from disk with very low corruption risk and little-to-no recovery requirements. SQLite has an option to try this out, because LMDB is ubiquitous in open source codebases and we know a lot about how it behaves. SQLite is the only mainstream database that has any chance of exploring 21st century page management. SQLite+LMDB has no WAL files, and greatly reduced pager code.

Better Corruption Detection

SQLite's main techniques for avoiding corruption seems to be to take extreme care in the WAL implementation, and to benefit from efficient code that limits the time window for errors - plus not needing to worry about consistency when doing network replication. The major databases (Pg, MS, Oracle, MySQL, MariaDB) all take the approach of having in-and-out of band corruption detection mechanisms. Being an embedded database, SQLite is often used in quite unforgiving hard-crash scenarios, and the question "who has regular difficulty with corrupted SQLite databases?" always results in a forest of raised hands at talks and conferences.

What's missing therefore is:

  • pre-emptive corruption detection (see next point)
  • option to use the operating system for this (see previous point)

Unlike the other main databases, SQLite has no pre-emptive corruption detection and only fairly basic on-demand detection.

Checksummed Corruption Detection

For reasons that are unclear, not one mainstream database has the facility to do fine-grained corruption detection that is directly visible to the user. Richard Hipp recently wrote the SQLite Checksum VFS which shows some of the possibilities, with many limitations.

LumoSQL has investigated this extensively and designed a robust system for corruption detection and magic . This needs to be part of the backend API abstraction so that it applies to all K-V stores.

Privacy Support is Missing

Encryption is not electronic privacy, but it is a pre-requistite for it. Since assorted new privacy laws mandate reliable encryption, and since non-crypto-capable SQLite is deployed pervasively, something needs to change. Exactly what or how is unclear because there are many existing options. LumoSQL has documented most of the crypto work that has been done relevant to SQLite, and looked at the new requirements that either exist or are coming soon, and is developing a design that works with multiple backends.

Networking

This isn't further discussed here, however several of the SQLite hard forks have added networking backends. LumoSQL is trying to design the backend storage API so that it is not incompatible with networking, which is not the same thing as supporting it.

Why Not Use an Existing SQLite API?

An alternative K-V stores can be implemented as a Virtual Table. That is what Gigimushroom's Database Backend Engine did in 2019. But that won't give access to the internal features of the VDBE-and-below layer that is the secret to performant storage in SQLite, which in turn is why so many SQLite forks have choses to access these internals directly.

It is interesting to read SQLCipher's comments on not using existing APIs for their encryption product. A storage engine isn't the same as an encryption layer, but there are design issues in common. SQLCipher says:

SQLCipher is an extension to SQLite, but it does not function as a loadable plugin for many reasons. Instead, SQLCipher modifies SQLite itself, and is maintained as a separate version of the source tree. SQLCipher releases are baselined against a specific source version of SQLite. However, the project minimizes alterations to core SQLite code to reduce the risk of breaking changes during upstream SQLite merges.

The reasons that SQLCipher is packaged this way, as opposed to a "plugin" or extension to the SQLite amalgamation, follow:

  • Enabling an SQLite codec requires the compile-time definition of SQLITE_HAS_CODEC, which is not present on standard, unmodified SQLite builds.
  • Even when enabled, SQLite isn't setup to load codecs as plugins. While SQLite does have a plugin function for loadable extensions, it does not extend access to any system internals (it mainly used to allow custom user functions).
  • SQLCipher makes calls to internal functions that are not part of the public SQLite API. Sometimes these APIs change, even in between minor SQLite versions. Thus, each update adn merge requires inspection, testing and verification. Making SQLCipher portable across multiple versions of SQLite would not be feasible, nor could it to use only the public API (for instance, even the first critical step of attaching the codec callback to the pager uses an internal API).
  • SQLCipher modifies supporting functions to introduce special pragmas, built in functions, etc (e.g. "PRAGMA cipher_*"). Injecting this functionality in a plugin architecture wouldn't be possible.
  • SQLCipher's test harness relies on support in testfixture to take advantage of the test API and various internal checks (memory reference counting, etc.)
  • Even if it were possible to use a loadable plugin, dynamic libraries aren't available on all supported platforms, for example iOS

For these reasons, and more, LumoSQL is implementing a backend storage API abstraction.

(8) By phpstatic on 2020-11-18 15:33:26 in reply to 7 [link] [source]

I check your benchmark report from this link: https://lumosql.org/benchmarking/

The number seems not good(100TPS for sqlite, 300 TPS for sqlite+LMDB). Is the test enable MMAP + WAL ?

On my test for https://sqlite.org/src/doc/754ad35c/README-server-edition.html with WAL2 + MMAP:

synchronous=OFF, 26425 TPS.

synchronous=NORMAL, 13000 TPS.

synchronous=FULL, 9082 TPS.

with memory 81920 TPS.

Each TPS include 6 INSERT.

The read TPS is 210000 for total 65536 records.

I test on Intel 4.5G + NVME SSD, WAL2 branch.

Enabling an SQLite codec requires the compile-time definition of SQLITE_HAS_CODEC, which is not present on standard, unmodified SQLite builds.

This is not the case any more, use a VFS shim can handle the encryption and corruption detection. Use VFS no need call internal functions.

SQLCipher modifies supporting functions to introduce special pragmas, built in functions, etc (e.g. "PRAGMA cipher_*"). Injecting this functionality in a plugin architecture wouldn't be possible.

I am not sure this could be used for the task: https://github.com/sqlite/sqlite/blob/master/ext/misc/cksumvfs.c#L725

(10) By Dan Shearer (danshearer) on 2020-11-18 17:58:15 in reply to 8 [link] [source]

phpstatic on 2020-11-18 15:33:26 wrote:

I check your benchmark report from this link: https://lumosql.org/benchmarking/

Thanks, it's really helpful to have people look at what we've done so far.

The concept is that you can run the benchmarking tool for yourself. That is the tool that produced these example results on old hardware, so you will get much better numbers.

The reason for this design is that typically benchmarking is about one person running many tests and publishing their results, or, many people posting to a forum about the results they have. Results are very hard to replicate, and almost impossible to compare. Soon you will also be able to use this tool to upload your results database to create a public database of reproducible measurements for comparison.

In this case, while you have much better hardware and get much bigger numbers, the comparisons should be equally valid.

Please tell me if I have not been clear in this explanation so far :-)

The number seems not good(100TPS for sqlite, 300 TPS for sqlite+LMDB).

I recommend you look at the benchmark filter tool which illustrates how you can:

  1. Get your own numbers for 3.33.0 (or some other version you specify). I'm sure your numbers will be much higher than these example numbers.
  2. Compare on your exact machine a 3.33.0 vs 3.31.0. Or 3.31.0 vs 3.8.1+LMDB . Etc.

After that, when your benchmarking database is uploaded and combined with the example benchmarking database you saw at lumosql.org, then we can answer some more questions:

  1. Does SQLite version X vs SQLite version Y behave proportionately the same on your modern hardware to my old hardware? Perhaps it doesn't, because of memory pressures, or disk I/O bottlenecks. That's very useful information.
  2. Same question as (3), except for different backends. We do not necessarily expect uniform linear improvements when moving from old hardware to new hardware.

Is the test enable MMAP + WAL ?

That is an excellent point, because it is about the dimensions of benchmarking. Now that we have a working benchmarking tool that saves information to an SQLite database, we are improving the data it collects. The dimensions we have so far include:

  • SQLite version
  • Build options
  • Backend, and backend version
  • Environment major characteristics, eg Linux/Windows and their versions
  • Environment details - whatever we can get, in a portable way. Hardware details, OS detailed version numbers, disk media info.

While we are making the schema as future-proof as we can, we are also trying to make it comprehensive now. We are very aware that we haven't done any benchmarking on Windows, and never even run LumoSQL on Android once... but these are both important deployment targets. So the benchmarking has to try very hard to run on these systems and give results that can be meaningfully compared.

Here is what is still to come in benchmarking, very soon:

  • Capturing system hardware
  • Results upload system
  • Verification of results, by testing that benchmark results rows are internally consistent

This is not the case any more, use a VFS shim can handle the encryption and corruption detection. Use VFS no need call internal functions.

Unfortunately that is not quite the case. In the LumoSQL Corruption Design document the heading "Design of the SQLite Checksum VFS Loadable Extension" lists positives and negatives. The negatives listed include:

  • No information about the logical location of this error, eg what row(s) it affects. The application knows nothing about how rows map to pages. All the application knows is that SQLITE_IOERR_DATA was returned during a read operation. That's a lot better than silent corruption, but also not as helpful as it could be.
  • Brittle implementation due to requirements of the file format. The "bytes of reserved space on each page" value at offset 20 in the SQLite database header must be exactly 8. What if we want a better or different checksum?
  • No facility for isolation or recovery of data. If we know with certainty that only row number 1234 is corrupted, then the application can take some action that is a lot less drastic than suspecting the entire file. This comes down to the design goal: the checksum VFS is aimed at the "random storage bitflips" problem. This is a very real issue, but there are many other causes of corruption in a database, including application bugs, partial restore of a backup file, an SQLite crash, and more.

(11) By phpstatic on 2020-11-19 02:52:45 in reply to 10 [link] [source]

I try run lumosql benchmarks, but get some error like this:

./tool/get-lumo-sources /opt/osx/lumosql/sources test 3.33.0
Cannot load VCS(git): Can't locate Git.pm in @INC (you may need to install the Git module) (@INC contains: /usr/local/bin/../lib /Library/Perl/5.18/darwin-thread-multi-2level /Library/Perl/5.18 /Network/Library/Perl/5.18/darwin-thread-multi-2level /Network/Library/Perl/5.18 /Library/Perl/Updates/5.18.4/darwin-thread-multi-2level /Library/Perl/Updates/5.18.4 /System/Library/Perl/5.18/darwin-thread-multi-2level /System/Library/Perl/5.18 /System/Library/Perl/Extras/5.18/darwin-thread-multi-2level /System/Library/Perl/Extras/5.18 .) at /Library/Perl/5.18/NotFork/VCS/Git.pm line 13, <$fh> line 10.
BEGIN failed--compilation aborted at /Library/Perl/5.18/NotFork/VCS/Git.pm line 13, <$fh> line 10.
Compilation failed in require at (eval 6) line 2, <$fh> line 10.
make: *** [build/3.33.0] Error 1

I already run this:

perl -MCPAN -e shell
install Git::Repository

No information about the logical location of this error, eg what row(s) it affects. The application knows nothing about how rows map to pages. All the application knows is that SQLITE_IOERR_DATA was returned during a read operation. That's a lot better than silent corruption, but also not as helpful as it could be.

This is very true. I try get the page information for each transition, to know if they are conflict, to use with the Raft pipelining optimistic(still a work at early stage).

Brittle implementation due to requirements of the file format. The "bytes of reserved space on each page" value at offset 20 in the SQLite database header must be exactly 8. What if we want a better or different checksum?

I consider checksum as a reference implementation, I try build a VFS 64 byte reserved for encryption.

(14) By Dan Shearer (danshearer) on 2020-11-19 10:35:45 in reply to 11 [link] [source]

phpstatic wrote on 2020-11-19 02:52:45:

I already run this:

perl -MCPAN -e shell install Git::Repository

This suggests that any program that says "Use Git;" at the top will fail, so there appears to be something wrong with your Perl setup.

Did you perhaps choose to use CPAN directly, where your operating system has its own Perl package management? For example...

  • On Debian/Ubuntu: apt install libgit-wrapper-perl
  • On Centos/Red Hat: yum install perl-Git

(coming to not-forking soon is better dependency checking for things like Git and the Git Perl modules at runtime not build/install time. But that isn't going to fix your problem, only perhaps detect it better. Although Fossil support is also coming, and that might fix all your Git-related problems :-)

Dan Shearer

(16) By Dan Shearer (danshearer) on 2020-11-19 17:03:56 in reply to 14 [link] [source]

Dan Shearer (danshearer) wrote on 2020-11-19 10:35:45:

(coming to not-forking soon is better dependency checking for things like Git and the Git Perl modules at runtime not build/install time. But that isn't going to fix your problem, only perhaps detect it better. Although Fossil support is also coming, and that might fix all your Git-related problems :-)

All done and committed to not-forking.

It's worth installing an updated not-forking and doing lumosql "make benchmark".

Dan Shearer

(12) By anonymous on 2020-11-19 07:35:28 in reply to 8 [link] [source]

This is not the case any more, use a VFS shim can handle the encryption and corruption detection.

While this is true, if you need reserved bytes in a page, then you must do either:

  • The application uses SQLITE_FCNTL_RESERVE_BYTES to set the reserved bytes in the page. The VFS will not be able to do so by itself.

  • When the VFS creates a new database file, it must prefill the header. However, then it might not be possible to reinitialize some fields that cannot be initialized after the database is created (such as the text encoding).

Neither alternative seems entirely satisfactory. There may be other things that could be improved with the VFS implementation, although I don't know.

(9) By Dan Shearer (danshearer) on 2020-11-18 16:48:09 in reply to 7 [link] [source]

Dan Shearer (danshearer) on 2020-11-13 15:47:39:

I received a comment on the section I wrote:

... "What's Missing... Nobody Really Knows :-)"

In terms of performance, reliability and tradeoffs, nobody knows what is missing. That is because there is no consistent way of comparing one version of SQLite with a modified version, and certainly not for repeating these comparisons across many different users, systems and build environments. That's why we built benchmarking, which also works for vanilla SQLite-on-SQLite.

It would have been better to say "The LumoSQL project hasn't been able to find published benchmarking", because there may be unpublished benchmarking, and there are experts on this forum who have an excellent understanding on how to make SQLite run fast.

The LumoSQL Knowledgebase uses better words on this point under the heading "List of Relevant Benchmarking and Test Knowledge". The paper Dangers and complexity of sqlite3 benchmarking listed says:

...changing just one parameter in SQLite can change the performance by 11.8X... up to 28X difference in performance

That doesn't mean the benchmarking itself is so hard, but I'm not aware of anyone publishing like-for-like comparisons.

Dan Shearer

(18) By Karl Malbrain (malbrain) on 2021-01-09 01:02:27 in reply to 7 [link] [source]

SSN makes a lower concurrency level like MVCC serialized. It is built on top of the MVCC KV data record and adds a few words of timestamp just after the KV data record to serialize the commits. It use is optional in the "dbstack" for non MVCC concurrency records.

record locking : a simple KV store record would stack its values and return its recno which is indexed by an extendible array of file offsets and 1 bit of recno latch.

(13) By anonymous on 2020-11-19 08:19:20 in reply to 1.1 [source]

I like this idea. SQLite4 has a similar idea, but SQLite3 doesn't have any so far. However, due to how the way SQLite3 storage works (and due to other things), I have a few suggestions:

  • The comparison function will presumably need to be set for the cursor (when you are opening it), I would expect. (This is only applicable for cursors that do not use 64-bit integer keys, though.)

  • Each "KV table" has either 64-bit integer keys (which always uses numerical order, I think) and arbitrary sequences of bytes as data, or the arbitrary byte sequence used as data is also used as the key and no separate key is stored (in which case the comparison function is needed, I think). This could be specified perhaps when creating the table, and maybe it is also needed when opening the cursor. This can be specified as flags, either SQLITE_KV_TABLE (for ordindary tables) or SQLITE_KV_INDEX (for indexes and WITHOUT ROWID tables). This means that the types for xCursorPut, xCursorSeek, and xCursorCurrent will need to be corrected.

  • For accessing the data stored in the header of the standard SQLite file format, you could have a method that is given the number which is the offset in the standard SQLite format, for example 68 to read or set the application ID, or 56 for the text encoding. Some of these header fields will not be applicable to custom storage engines.

  • It shouldn't care about filename extensions (different programs might use different filenames, and some might not use any filename extension at all; not everything uses ".sqlite", and SQLite itself doesn't care), although it will need a name, like the VFS needs a name to refer to it. Which one to use can be specified as a URI parameter, or by trying different storage implementations until one works. (If multiples are loaded, and one finds that the file is not in its format, it can return SQLITE_NOTADB to indicate that the next one should be tried. If it does care about the filename, its xOpen method can check the filename and return SQLITE_NOTADB based on the filename, without trying to open the file.)

  • It might need to indicate savepoint numbers.

  • There are a few other problems with the API that you have specified. (For one thing, some of it is unclear. Some things might also be unnecessary, and some things might be missing.)

Virtual tables are useful for different things than a storage implementation is useful for, and each has its advantages and disadvantages. (Even so, I do think there are some deficiencies in the virtual table mechanism which could be corrected, but some things they just won't or shouldn't do, and there are different things that the key/value mechanism just won't or shouldn't do.)

(15) By Ben Trask (btrask) on 2020-11-19 15:27:01 in reply to 13 [link] [source]

Thank you, anonymous, for your detailed analysis!

  • The comparison function will presumably need to be set for the cursor (when you are opening it), I would expect. (This is only applicable for cursors that do not use 64-bit integer keys, though.)
  • Each "KV table" has either 64-bit integer keys (which always uses numerical order, I think) and arbitrary sequences of bytes as data, or the arbitrary byte sequence used as data is also used as the key and no separate key is stored (in which case the comparison function is needed, I think). [...]

Key handling is one of the biggest pain points of this whole idea, since the way SQLite 3 handles btree keys internally is far from "clean". SQLite 4 was going to clean this up, but, what can you do? :)

From your comment I think you are saying the details of key comparison vary at runtime based on the schema. The API I was imagining would abstract over this from the point of view of the back-ends, but there are definitely pros and cons to each approach. One means potentially more redundant code per back-end, but also potentially more control. I'm not opposed to either way.

  • For accessing the data stored in the header of the standard SQLite file format, you could have a method that is given the number which is the offset in the standard SQLite format, for example 68 to read or set the application ID, or 56 for the text encoding. Some of these header fields will not be applicable to custom storage engines.

This could be better. The concerns are 1. flexibility (forward compatibility), 2. usefulness (allowing back-ends to actually use fields if they have reason to) 3. minimizing back-end code for back-ends that don't care. My hunch is that using byte offsets is "less general", but at this point the SQLite format is well documented and unlikely to change, so... Yeah, either way.

If multiples are loaded, and one finds that the file is not in its format, it can return SQLITE_NOTADB to indicate that the next one should be tried. If it does care about the filename, its xOpen method can check the filename and return SQLITE_NOTADB based on the filename, without trying to open the file.

This is an interesting idea, but it could make detecting corruption non-trivial (or automatically "fixing" it destructive, if the back-end is a different format). SQLite makes incredible guarantees for reliability and long term support, so I don't want to cut any corners that might introduce problems down the line. Storing yet more info in the filename is bad, but in this case I think it's necessary given the entire file content is under control of the back-end.

I can also imagine back-ends that might wrap the existing btree (e.g. for transparent compression at the KV level), which you wouldn't want the btree back-end to try to open directly, even though the header would be compatible.

It might need to indicate savepoint numbers.

Yes, quite likely. Good catch.

There are a few other problems with the API that you have specified. (For one thing, some of it is unclear. Some things might also be unnecessary, and some things might be missing.)

Yes, indeed. There would definitely be at least a few minor changes discovered during implementation.

I'm mainly concerned what the SQLite authors (Dr. Hipp et al.) think of the overall direction. Including some of the "policy" questions like using filename extensions to identify back-ends and the handling of keys as opaque buffers, and whether there is any interest in such an invasive and "risky" change at all.

Thanks again for your comments!

--Ben

(17) By anonymous on 2021-01-07 21:10:05 in reply to 15 [link] [source]

In order to enable understanding of database internals among UnderGrad CS projects, I am writing the SQL level, the mongodb level, and the KV level beneath that, using vanilla javascript as the language of demonstration and preference.

github.com/malbrain/database github.com/malbrain/javascript-database

The approach to keys is simplified by keeping a bit map deduplication array to filter in the KV level as the keys are returned by the btree level. Each key needs a small value structure hierarchy used by the plugins.