SQLite Forum

Timeline
Login

26 events by user billwade

2021-11-30
16:57 Reply: Faster way to insert into WITHOUT ROWID table? artifact: 58657fb221 user: billwade

Perhaps a pragma ordered_bulk_insert could make the planner choose to attempt the optimization?

2021-11-29
15:02 Reply: Faster way to insert into WITHOUT ROWID table? artifact: 838c0d8577 user: billwade
I suspect that even the original form doesn't have fast filtering on b, unless you specifically create a "b" index.

In principle, the query planner could do something smart for
   select ... where ab>>32 = :a;
but I'd be a bit surprised.

I would hope it would do something smart for
   select ... where ab >= :a<<32 and ab < (:a+1)<<32;
so you might be able to do something with that.
2021-11-13
15:16 Reply: Usage of application_id and magic.txt artifact: 9ec939e080 user: billwade

The SQLite documentation reads:

"Applications that use SQLite as their application file-format should set the Application ID integer to a unique integer so that utilities such as file(1) can determine the specific file type rather than just reporting "SQLite3 Database". A list of assigned application IDs can be seen by consulting the magic.txt file in the SQLite source repository."

The documentation doesn't explicitly say how to 1) come up with a unique integer, and 2) insure that it remains unique. However it implies that some such mechanism might exist.

A quick reading suggests:

  1. Pick an integer not in magic.txt.
  2. Get it added to magic.txt

Since magic.txt is in the sqlite repository, that would imply contacting someone who has permission to update the sqlite repository. That is what the OP attempted.

The sparsity, and age of the entries in magic.txt is a strong suggestion that some other mechanism should be used. The clues file(1) and "magic" suggest searching through unix (or unix-like) documentation, and that might tell you how to determine if an integer you picked is currently registered, and how to register it for yourself.

2021-11-04
11:41 Reply: Is this inherent casting by design? artifact: 7cdefb118d user: billwade
The behavior is documented.

"Mathematical operators (+, -, *, /, %, <<, >>, &, and |) interpret both operands as if they were numbers. STRING or BLOB operands automatically convert into REAL or INTEGER values."

The rules for converting strings to numbers roughly take the longest prefix from the string that looks like a number.
2021-10-26
13:20 Reply: No ".open" option in version 3.7.17 2013-05-20 artifact: e51d19f8e6 user: billwade
That test tells you it is not a typical sqlite3 database file. So you've eliminated one possibility out of many (and mostly the only one this forum discusses).

Since you have unix utilities, you might try the "file" command, which (according to its man page) determines file types.

If I rename a Microsoft Access .mdb file to abc.def I get

$ file abc.def
abc.def: Microsoft Access Database

so it sometimes works.
12:20 Reply: No ".open" option in version 3.7.17 2013-05-20 artifact: 79e6f32064 user: billwade
The first several bytes in an sqlite db are ascii "SQLite format 3" (without the quotes). There are exceptions to that with some VFS implementations, but for those you'd have to do something special with sqlite3.exe anyway.

If you open the file with a binary editor, you can check for that string, or if you have unix utilities:

strings myfile.db | head

to see if that string shows up.
2021-08-11
12:04 Reply: Increasing insert performance of an in-memory db artifact: 1bb589490b user: billwade

Often, appends are faster than inserts.

Are your inserts in primary key order? If so, an insert of ~100 records touches a couple of leaf pages (depends on page size, and your string sizes), and about the same number of index pages.

If not, it touches about 100 leaf pages and about 100 index pages.

The thread you referenced was doing inserts in primary key order.

2021-07-29
12:12 Edit reply: Why data is lost in SQLite database with WAL mode on when connection is not closed properly? artifact: 8322376593 user: billwade

By default, the WAL is not fsync'd after every transaction (transactions aren't guaranteed to be durable).

I suspect that the file is fsync'd when the connection is closed, but I didn't see that with a quick look at the documentation.

If

(8) Close application (without closing Connection 1)

also involves crashing the OS, I'd say the behavior you describe is documented, unless you set pragma synchronous=FULL.

"A transaction committed in WAL mode with synchronous=NORMAL might roll back following a power loss or system crash."

12:11 Reply: Why data is lost in SQLite database with WAL mode on when connection is not closed properly? artifact: d36af8770e user: billwade

By default, the WAL is not fsync'd after every transaction (transactions aren't guaranteed to be durable).

I suspect that the file is fsync'd when the connection is closed, but I didn't see that with a quick look at the documentation.

If

  1. Close application (without closing Connection 1)

also involves crashing the OS, I'd say the behavior you describe is documented, unless you set pragma synchronous=FULL.

"A transaction committed in WAL mode with synchronous=NORMAL might roll back following a power loss or system crash."

2021-07-13
14:49 Reply: How much would you trust a page-level checksum ? artifact: 963cca6d27 user: billwade

For Simon's original question:

"what's the risk that a change in a page's contents would leave the checksum unchanged ?"

I agree that the birthday paradox does not apply when you are considering only two pages (before and after). Accidental collisions would be exceedingly rare for some reasonable definition of a good 64-bit checksum. As others have pointed out, checksums can reliably tell you that they are different, and if "exceedingly rare" isn't good enough for you, do the full sequence test determine if they are indeed the same.

However the subject had drifted a bit. For malicious attackers, I'd worry that it wasn't enough. I'd worry about any 64-bit "digital signature." Birthday paradox isn't directly applicable, but it, along with knowledge of the checksum internals are tools for the attacker, and the smaller N gets, the more useful some such tools become.

The birthday paradox really kicks in when you start comparing the checksum of every page in the database (or across databases), perhaps to reduce duplication. There the numbers start to get large enough to "reasonably" expect 64-bit checksums to match when the contents don't, even in the absence of bugs and malicious actors.

I probably extracted Simon's original post a bit too far. My apologies.

13:11 Edit reply: How much would you trust a page-level checksum ? artifact: 411774f14a user: billwade

You calculated the total number of checksums, and divided that by the time it would take to generate that number of checksums.

It is certainly possible that you could generate N=2^64 checksums and not get a single duplicate, just as it is possible that given a random collection of month-day pairs, you could have 365 with no duplicates.

However, https://en.wikipedia.org/wiki/Birthday_problem says that the probability of a duplicate exceeds 50% once you have 23 random dates.

Above I used 23 as (roughly) sqrt(365). Admittedly that was very rough.

Edit (really bad math in the previous version):

That web page says that you a 50% of at least one collision in 64-bit random numbers once you have generated 5.1e9 of those numbers, which is before every person in the world has rolled the dice once.

12:02 Reply: How much would you trust a page-level checksum ? artifact: fb88f25e1f user: billwade

You calculated the total number of checksums, and divided that by the time it would take to generate that number of checksums.

It is certainly possible that you could generate N=2^64 checksums and not get a single duplicate, just as it is possible that given a random collection of month-day pairs, you could have 365 with no duplicates.

However, https://en.wikipedia.org/wiki/Birthday_problem says that the probability of a duplicate exceeds 50% once you have 23 random dates.

Above I used 23 as (roughly) sqrt(365). Admittedly that was very rough.

If I change your 2^64 to sqrt(2^64) (admittedly that is too pessimistic), your 74 years drops to about 6.3 microseconds.

2021-07-12
12:30 Reply: How much would you trust a page-level checksum ? artifact: ee13af799e user: billwade

Is computing one checksum, and comparing it to another faster than just comparing two sequences? That depends on how fast the checksum algorithm is, and the relative speeds of computation and sequential memory access. I'm sure people can find systems where either one wins.

Even if you get around that issue, I believe a 64-bit checksum is too small.

Birthday paradox says (roughly) that given random numbers in 0 ... N, you should expect duplicates to show up when you have roughly sqrt(N) numbers.

For the smallest sqlite page size, that "roughly" shows up at a couple of terabytes. That is too small for comfort, and one reason UUID's aren't only 64 bits.

I think you'd want to use a good 128 bit checksum, instead.

As to whether it is better to check for changes as you update a row or field, or to just wait until commit time and check the entire page my guess is:

  1. Doing checks at the page level is simpler to code, if your only concern is to avoid writing unchanged pages. If you also want to avoid updating indexes that refer only to unchanged fields (I don't know if sqlite does that optimization), I'd guess that waiting to see if the page is dirty before updating indexes makes things trickier.

  2. I suspect that page-level checks (with or without a checksum) are a performance boost only if it is very common for a page to get transactions where a sequence of update/delete/insert results in no net change to the page.

I do see a value in checksums, just to catch accidents or bugs. I just don't believe they are likely to be much of a speedup for learning whether or not a bunch of small changes resulted in any net change.

2021-07-09
13:18 Reply: How to handle collation version changes artifact: 833cd2181f user: billwade

I suspect that REINDEX can fail. Because of a collation change, a table that was previously valid for a UNIQUE index may have records that violate that constraint.

For a WITHOUT ROWID table, changing the collation of the primary key might involve shuffling the location on disk of almost every record. That might be much more expensive (time and space) than other REINDEX operations.

2021-07-08
19:59 Reply: Deleted rows not in FreeList artifact: 53b6112c8e user: billwade

You may want to describe in more detail what you are doing (provide the SQL and/or other scripts that reproduce your actions), the results you are seeing, and the results you expect to see.

For instance, depending on your journal mode, even a committed transaction may not yet be in the main database file.

2021-07-07
20:40 Reply: transient indices vs. LIMIT artifact: 3b493dc1c0 user: billwade

I'd call it strong evidence that they aren't doing a full sort.

To do a full sort of 320 MB of data in a process that is only using 10 MB of memory I would need more than 320 MB of reads + 240 MB of writes.

There weren't enough writes to just copy the input (unless individual writes were bigger than reads, which I suppose is possible).

It seems like there were way too many writes to maintain a small index, even if they did a write (or a few writes) when the top-10 contents change.

I'd expect that every time you double the number of records you've already processed, you normally get between 5 and 10 new entries in the top-10.

Log2(20e6) is about 24, so that would imply you can write every change to the top-10 index with just a few hundred writes.

If instead of keeping the top-10, they kept the top page of the index (a few hundred entries), that might explain the number of writes I saw.

13:29 Reply: transient indices vs. LIMIT artifact: 2b5821f5ed user: billwade
A heap could help for ORDER BY, but it I don't see how it helps for DISTINCT.

Using a smaller index (only holding items that are currently candidates to meet the limit) would change the complexity from O(n log n) to O(n log k), where n is the number of rows in the table.

In Python (3.6.8), I turned journal_mode off. I built a table with 20 million rows of floats (from random.random()). Schema was CREATE TABLE a(b);

At that point, Windows task manager said my process had done a couple of hundred reads and about 80k writes. File size was 330MB.

I did SELECT COUNT(*) FROM a;
That added about 80k reads to the total. As expected, it looks like it had to read the entire database to satisfy that query.

I twice did

SELECT DISTINCT b FROM a ORDER BY b LIMIT 10;

each time that resulted in about 80k reads and about 60k writes.

I'm not sure what that says about the implementation. 80k reads is what it takes to read the table once. It isn't enough to do that and also build a full-size additional index.

However, it shouldn't take any writes at all to maintain a 10-entry temporary index, so it isn't clear what those 60k writes were actually doing.

I never saw the python process memory much above 10MB, so it certainly wasn't holding the full database (or a full temporary index) in memory.
2021-06-28
12:04 Reply: High cpu usage of System.Data.SQLite.SQLiteConnection..ctor artifact: 1566872abd user: billwade

Larry's reply is likely applicable for many databases, but "even more so" for SQLite.

The on-disk format of the SQLite schema is mostly just text (a master table holds the text of each table's schema). At startup, SQLite parses that text to build its in-memory representation of the schema. Reading and parsing that text takes time.

Many other databases store a parsed representation of the schema. That may save some time at startup. If an instance of one of those database processes doesn't access some tables, that instance may never have to even get those particular tables' schemas into memory.

If you are using a server-based database, you may be connecting to an already-running instance of the server. If "connect to server" is substantially less expensive than "connect to database", that would provide additional savings.

The SQLite choice probably makes "Lite" easier to achieve. Clearly SQLite has to know how to read SQL. Reusing that knowledge at startup may mean writing less code specific to startup.

There is also a good chance that the text-based schema makes backwards compatibility easier. Any structure that never goes to disk can be changed, without having to worry about maintaining code specifically to read the old version.

I believe I've seen Dr. Hipp make similar statements, but I was unable to find those statements with a quick search.

2021-06-16
13:28 Reply: Opening a DB with SQLITE_OPEN_EXCLUSIVE artifact: e5ffa711ab user: billwade
Thanks.

I hadn't noticed that locking_mode provides a way to avoid unlocking the database between one transaction and a following pragma journal_mode.

However, if I kill my process after commit; but before setting wal, I've still got a database that shows my application_id, but which has never been wal. Admittedly, that is a very small window for failure.

To close that window it seems like my choices are

1) A looser definition of "empty" that ignores the current journal mode, and for an empty database I set wal first, and then do my first transaction. This means that by the time I set application_id, the database had already been set to wal (somebody else could unset wal in-between, but since they can do that at any time, anyway, I'm ok with that).

2) Do what you suggested, but have my initial transaction set a "WAL pending" flag that gets cleared after my app (this one, or a later instance) succeeds in setting wal mode.
2021-06-15
16:19 Reply: Opening a DB with SQLITE_OPEN_EXCLUSIVE artifact: b861a9afc5 user: billwade

I'm not sure about the OP's issue.

I'd like my initialization for a new database to set journal_mode=WAL, but if an existing database has another journal mode, I want to respect the owner's decision to change what I had initially picked (even if they made the change outside of my application, perhaps using sqlite3.exe).

However, that means my "complete initialization" can't be done in a single transaction (setting journal_mode=WAL is effectively its own transaction).

I suppose my application could:

  1. Initialize the contents (other than journal mode) in single rollback-mode transaction, and containing a flag saying "still need to do first-time WAL".
  2. Set WAL mode (if that flag is still set).
  3. Clear that flag (in another transaction).

That lets me respect a user's journal_mode setting, unless they managed to get it in before (3), which should be uncommon. In the "700 copies trying to initialize at the same time", I think there is a good chance that none of them manage to successfully set WAL mode and clear the flag.

I don't want to do WAL-mode first (and then the first explicit transaction), because I don't want to modify a non-empty database at all if it has the wrong application_id (which I set in the first transaction for a new or empty file).

Another approach would be to treat empty files as unusable, and do my initialization only for new files (initialize to a temporary file name, and then rename to the user-requested file name). That means I have to close the temporary file, rename it (assumes rename fails if the target exists), and then open it again. With this approach, exactly one of the 700 initializers succeed.

That has its own set of problems (sqlite3.exe creates an empty file, and now that file-name is unusable).

2021-05-27
16:36 Reply: Is there a way to optimize this UPDATE command to run faster? artifact: aa88c9752d user: billwade

You'd expect faster results with an index on snapshot_id or object_id (preferably whichever of those has fewer duplicates).

One could imagine the current plan is roughly

  • Look at every record until you find snapshot_id==3 -- Look at every record until you find matching object_id and snapshot_id==5. Update dst if/when you find the match.

That outer loop executes 24k times, and the inner loop might execute 24k times for each call of the outer loop, for a total of 24k*24k = half a billion times.

If there are very few duplicate object_id's, an index on object_id substantially reduces the work for the inner loop.

If there are very few duplicate snapshot_id, an index there substantially reduces the work for both loops.

2021-05-11
15:06 Reply: Fast way to insert rows in SQLite artifact: c98ce8c601 user: billwade

"CREATE TABLE IF NOT EXISTS user ( id INTEGER not null primary key, area CHAR(6), age INTEGER not null, active INTEGER not null)"

Depending on the id value, its storage might be around 7 bytes.

Assuming ages are in 0 to 127, and active values are 0 or 1 the SQLite storage for those is two or three bytes (1 byte per column "overhead", and another byte holds age when it is more than 1)

You are probably looking at around 17 bytes per row if length(area) is typically 6 bytes (look at your database disk size after you've inserted a million records to see if I'm right).

You could probably save 2 bytes per row by using a BLOB to hold all of the non-key information. The blob could hold age and active in a single byte, and its remaining bytes would hold area.

CREATE TABLE IF NOT EXISTS user (id INTEGER PRIMARY KEY, info BLOB);

You could create a VIEW that extracts value in your preferred multi-column form (and the VIEW could have triggers that support inserts and updates in the multi-column form, but those triggers would probably not be as fast as building the BLOB in your application).

Saving 2 bytes out of 17 might give you a 10% or more speed boost.

2021-04-23
12:53 Post: unqualified rowid as a SELECT column artifact: 9f3a654f99 user: billwade
Some expected and documented rowid behavior

CREATE TABLE a(rowid INTEGER PRIMARY KEY);
CREATE TABLE b(id);
SELECT rowid FROM a; -- works as expected and documented
SELECT rowid FROM b; -- works as expected and documented
CREATE TABLE d(id PRIMARY KEY) WITHOUT ROWID;
SELECT rowid FROM d; -- fails as expected and documented
CREATE VIEW av AS SELECT * FROM a;
SELECT rowid FROM av; -- works as expected

Some rowid behavior which is reasonable (perhaps even expected) but I couldn't find documentation :

SELECT rowid FROM a,b; -- gets a.rowid, reasonable and expected
SELECT rowid FROM b,d; -- fails, even though b.rowid exists (sort-of)
CREATE VIEW bv AS SELECT * FROM b;
SELECT rowid FROM bv;  -- gets null

-- works if exactly one of a,b,... has an explicit rowid column
SELECT rowid from a,b,...;

A bit surprising (but I'm not sure how much it matters) is that for views
      SELECT rowid FROM myview;
works as expected if
      SELECT * FROM myview;
produces a column named rowid, but produces null (rather than reporting an error) otherwise.
2021-04-13
17:58 Reply: Select of text that matches scientific notation format automatically converted to number artifact: d5524555e5 user: billwade

Docs for Determination of Column Affinity, says that if SQLite doesn't otherwise recognize the name, the affinity is NUMERIC. NONE doesn't appear to be recognized.

If you want TEXT, say TEXT (or some alternate spelling, like CHAR).

2021-04-12
12:28 Reply: find in b+-tree node artifact: 65ffbad0c8 user: billwade

When your query is

select * from test where name='tom';

It is likely that the database engine is answering that by doing something similar to:

select * from test where rowid = (select rowid from test where name='tom');

Your index on name is used to answer the inner query. The primary key index on rowid is used to answer the outer query.

However, your original query did all that for you. For this example, you didn't explicitly need to know anything about rowid, or any tree-node (or even to know that those things were used in the implementation).

In principle, SQLite, might notice that the test table is small enough that using the index won't speed things up, and thus decide to avoid it entirely. All you are supposed to care about is that the query was performed correctly and efficiently.

2021-03-30
15:03 Post: Atomically initialize database artifact: 6d397b8f9a user: billwade

Using SQLite as an application file format, and my user says the file name is foo.bar.

I'd like my application to determine (here ignoring permission issues) that the file foo.bar is one of:

  1. No such file
  2. An SQLite file, properly initialized for my application
  3. An empty file (perhaps becomes empty after processing rollback journal). Initialize it.
  4. An SQLite file, left behind before initialization completed. Finish initialization.
  5. Something else, that I don't want to mess with.

Assuming my "proper initialization" requires at least two transactions (for instance I can't put both CREATE TABLE and PRAGMA JOURNAL_MODE = WAL in a single transaction) what are "best practices" for handling (4)?

Strategies that come to mind are:

  1. Initialize files under a different file name (but in the same directory), and rename them to foo.bar when successfully initialized.
  2. In my code, during initialization, implement a locking mechanism that prevents other instances of my code from using the file at all. First initialization transaction writes "initialization in progress" values somewhere. Last initialization transaction writes "initialization done" values somewhere in the database.

If using (2), are magic numbers in the application_id and user_version pragmas reasonable locations for "initalization status" locations, or would a "normal" table value be more appropriate?