SQLite Forum

Timeline
Login

50 most recent forum posts by user wyoung

2021-06-17
19:02 Reply: Unhandled exception on server application (artifact: 83a8cf32f2 user: wyoung)

compiling the SQLite code with that compiler option might help resolve it.

Are you certain that works through a C function pointer? I can see how GCC might unwind a regular C function call while processing a thrown C++ exception, but as loosey-goosey as C treats pointers, I wouldn't bet on it happening in this case.

Is writing portable code even a feasible requirement for most software projects?

I understand your point, but once you go wandering down into UB land, you can't expect a SAR party organized on the SQLite forum to do much more than call out, "Please return to the well-lit path!"

Anyway, wxWidgets on a 9-year-old OS that's been out of mainstream support for 3 years now? And using the obsolete VDS API atop that, without any explanation of what that might have to do with the reported problem? The OP shows many signs of doing questionable things, so it's fair to question whether he's taken care of UB-ish things.

04:25 Reply: Unhandled exception on server application (artifact: ca9983676f user: wyoung)

SQLite is pure C. It couldn’t throw a C++ exception if it wanted to.

2021-06-15
15:07 Reply: Opening a DB with SQLITE_OPEN_EXCLUSIVE (artifact: d8a585ed17 user: wyoung)

Open the DB with SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE from both A and B, then immediately begin a transaction containing the CREATE TABLE/INDEX/VIEW … IF NOT EXISTS calls that establish the desired schema. Only one of the two will create the DB if it’s not already present, and only one of the two will establish the schema. Both will then be free to use the DB in the normal fashion.

2021-06-08
10:36 Reply: Docker JDBC 3.34.0 -> 3.35.0 (artifact: ab5c55e930 user: wyoung)

I get 1106 results for “SQLite JDBC” containers on Docker Hub. Which one do you mean?

I expect that the answer to that question will show that it’s a third-party contribution, not something provided by the SQLite project, so you’d do better to ask that third party to update their contribution.

2021-06-06
04:08 Reply: FileIO extension (artifact: e857017c76 user: wyoung)

You’ve ignored my “from within the SQLite source code tree” qualifier and the link from that to the page that will answer your question.

2021-06-05
11:40 Reply: Opening DB from readonly filesystem (btrfs snapshot) (artifact: d339187fc0 user: wyoung)

You could switch the transaction mode to DELETE or to TRUNCATE prior to doing all of this shutdown and snapshot work you speak of.

However, I wonder why you're bothering given that you're using a snapshotting filesystem. Why not leave the application up and running, take the snapshot, and move on?

Yes, this means that if you have to roll back to a snapshot you might lose the last few uncommitted transactions, but that's also true if the application crashes, the computer gets hit by a meteor, or whatever. The nature of WAL is that it takes such events in stride and gets the DB back into a consistent state despite having interrupted transactions.

Indeed, if you hadn't spoken of btrfs, ZFS, disk virtualization, etc., that's exactly how I'd recommend you solve this case. It's one of the great benefits of filesystem-level snapshots.

11:04 Edit reply: FileIO extension (artifact: 6ea1be078a user: wyoung)

I have been using SQLite3 (3.12.1) on my Mac

I doubt that. I'd bet you've actually been using DB Browser for SQLite 3.12.1 (DB4S). Note the five-year span in release history and the difference in providers.

Beware that your question teeters on the edge of being off-topic here: we don't provide DB4S. It's only the fact that there's a way to answer your question using only SQLite that leads me to not brush you off and tell you to go ask the DB4S people, that being the software you're actually using and having trouble with, not SQLite proper.

I have tried using terminal to no avail.

One wonders why you didn't find these generic instructions, or if you did, why you didn't show those commands in your question and show what they gave you, so we'd have a leg up on helping you.

To compile this particular extension from within the SQLite source code tree, the generic instructions translate to this specific command on macOS:

  $ gcc -g -fPIC -dynamiclib -I. ext/misc/fileio.c -o ~/Desktop/fileio.dylib

The generic instructions don't tell you about the -I., which you need in this case to ensure you get the right version of sqlite3.h, since the system version is found first otherwise, and it doesn't have all of the necessary definitions to compile against current SQLite.

How to load the extension through DB Browser

Select the Tools → Load Extension... menu item once you've got your DB open.

The limitation of needing to have a DB open before you can load an extension is from DB4S, not something SQLite itself imposes, so I've filed a bug for that weakness for you.

provide me with the exact code for loading through terminal;

That's covered in the SQLite documentation link above.

Whether the extension will still be recognised if I copy the database to my PC or to another Mac

A compiled loadable extension is platform-specific by its nature, so yes, you'll need to have a built copy of this extension for each platform you use, if the DB schema or the commands you give to SQLite uses features from the extension.

If you want a given extension to be built statically into SQLite, you'll have to build your own SQLite binaries, which are also be platform-specific.

The DB4S project provides instructions for doing the equivalent work in their docs, though for some reason their macOS guide is currently unwritten. Pursuing all of that is off-topic here.

whether I will need to set the extension up each time.

For SQLite proper, you can list commands that occur on launch of the sqlite3 shell by putting them in ~/.sqliterc. (Not sure what the Windows equivalent is.) I have no idea if DB4S runs the .sqliterc file or if they have a similar but differently-designed alternative. Solving that isn't on-topic here.

I am not technically trained

You wouldn't be the first person to learn some C simply in order to get some actual work done.

10:57 Reply: FileIO extension (artifact: 21b9fd63bb user: wyoung)

I have been using SQLite3 (3.12.1) on my Mac

I doubt that. I'd bet you've actually been using DB Browser for SQLite 3.12.1 (DB4S). Note the five-year span in release history and the difference in providers.

Beware that your question teeters on the edge of being off-topic here: we don't provide DB4S. It's only the fact that there's a way to answer your question using only SQLite that leads me to not brush you off and tell you to go ask the DB4S people, that being the software you're actually using and having trouble with, not SQLite proper.

I have tried using terminal to no avail.

One wonders why you didn't find these generic instructions, or if you did, why you didn't show those commands in your question and show what they gave you, so we'd have a leg up on helping you.

To compile this particular extension from within the SQLite source code tree, the generic instructions translate to this specific command on macOS:

  $ gcc -g -fPIC -dynamiclib -I. ext/misc/fileio.c -o ~/Desktop/fileio.dylib

The generic instructions don't tell you about the -I., which you need in this case to ensure you get the right version of sqlite3.h, since the system version is found first otherwise, and it doesn't have all of the necessary definitions to compile against current SQLite.

How to load the extension through DB Browser

Select the Tools → Load Extension... menu item once you've got your DB open.

The limitation of needing to have a DB open before you load the extension is from DB4S, not something SQLite itself imposes, so I've filed a bug for that weakness for you.

provide me with the exact code for loading through terminal;

That's covered in the SQLite documentation link above.

Whether the extension will still be recognised if I copy the database to my PC or to another Mac

A compiled loadable extension is platform-specific by its nature, so yes, you'll need to have a built copy of this extension for each platform you use, if the DB schema or the commands you give to the DB use features from the extension.

If you want a given extension to be built statically into SQLite, you'll have to build your own SQLite binaries, which are also platform-specific. The DB4S project provides instructions for doing that in their docs, though for some reason their macOS guide is currently unwritten. Pursuing all of that is off-topic here.

whether I will need to set the extension up each time.

For SQLite proper, you can list commands that occur on launch of the sqlite3 shell by putting them in ~/.sqliterc. (Not sure what the Windows equivalent is.) I have no idea if DB4S runs the .sqliterc file or if they have a similar but differently-designed alternative. Solving that isn't on-topic here.

I am not technically trained

You wouldn't be the first person to learned some C simply in order to get some actual work done.

2021-06-02
11:04 Edit reply: memory vs mmap (artifact: c711d0a331 user: wyoung)

If network disks were massively slower than local hard disks, …it would be a well-known problem.

It is a well-known problem. Maybe not well-known to you and yours, but I can dig up references from the dawn of network filesystem based computing in the early 1980s for this, if you require it.

The network team didn't believe that a local disk was faster that a network disk

If you're testing matching storage on both sides of the connection, then of course the local disk will not be any faster than the actual remote disk, but the remote disk isn't attached to the local CPU in the same way the local disk is. There are a whole pile of other considerations stemming from that fact:

  1. Local NVMe latency is going to be orders of magnitude faster than ping time to the remote NAS just on speed-of-light delays alone. The distance between the NVMe connector and the CPU is maybe a few cm long. The Ethernet cabling, switch fabric, internal MAC to CPU link, etc. is likely to come to tens of meters, so you're orders of magnitude off already. Ya canna change the laws o' physics, captin'! A foot of copper is a nanosecond, and at today's speeds, nanoseconds matter.

    Once upon a time, speed-of-light delays only affected computer design at the extreme high end, but we've been at that point down at the low end now too, for a very long time. If you did nothing to a Raspberry Pi but double its physical size, scaling everything on that board evenly, it would cease…to…work!

  2. The "stack" depth of a bare NVMe disk is a lot shallower than for

      Ethernet+ICMP+IP+UDP+kernel+NVMe+UDP+IP+ICMP+Ethernet
    

    Every packet of data to a typical remote NAS has to go clear down through the TCP/IP stack into the fileserver software, then back up that same long stack to return the result.

    And that's only for in-kernel network filesystems. If you're dealing with a userspace NAS technology like Samba, you have a kernel-to-userspace transition and probably a deeper file server code stack besides.

  3. Cooperation delays aren't zero. You speak of having 10 readers, but at bare minimum that must cut per-reader bandwidth by 10 on the remote side, and it's only that low only if there is no arbitration among the readers, so that all readers can proceed in parallel, the ÷10 factor coming only from the shared disk bandwidth.

Let's get concrete. I just did a ping time test here to a nearby NAS on a quiet gigabit LAN and got an average of 350 µs over 25 packets. The read latency on a good NVMe SSD is likely to be 10 times lower than that, and with the ping test, you haven't even considered disk access latency yet.

If your own tests tell you there's an order of magnitude or two difference between two test cases, believe the test result, don't resort to argument-from-incredulity to talk yourself into believing that the test was wrong.

The characteristics of the network disk is that in the beginning, maybe for a couple of minutes, it is about 2.5 times slower than a local disk

I'm stunned it's that low! I'd consider that wild success.

over time it ends up being 20 times slower

Just a one order of magnitude speed drop? If after reading my arguments above you think that's unreasonable, go read them again. Point 1 alone shows you're experiencing an excellent result.

it has no way of knowing if another system updated the file.

As others have shown, that's a solvable problem, but you can't solve it for free. All coordination I/O has to go between the nodes somehow, and that has a cost.

the conclusion has been drawn that it is a problem with NFS3 and can't be changed.

NFS certainly has problems, but you can't charge all of the problems you're seeing to a network protocol alone. You can invent the bestest baddest speediest network file sharing protocol, and you'll still have the speed-of-light delays and coordination overhead among cooperating nodes.

TANSTAAFL.

09:31 Edit reply: memory vs mmap (artifact: 1d811bd607 user: wyoung)

If network disks were massively slower than local hard disks, …it would be a well-known problem.

It is a well-known problem. Maybe not well-known to you and yours, but I can dig up references from the dawn of network filesystem based computing early 1980s for this, if you require it.

The network team didn't believe that a local disk was faster that a network disk

If you're testing matching storage on both sides of the connection, then of course the local disk will not be any faster than the actual remote disk, but the remote disk isn't attached to the local CPU in the same way the local disk is. There are a whole pile of other considerations stemming from that fact:

  1. Local NVMe latency and ping time to the remote NAS is going to be orders of magnitude different just on speed-of-light delays alone. The distance between the NVMe connector and the CPU is maybe a few cm long. The Ethernet cabling, switch fabric, internal MAC to CPU link, etc. is likely to come to tens of meters, so you're orders of magnitude off already. Ya canna change the laws of physics, captin'! A foot of copper is a nanosecond, and at today's speeds, nanoseconds matter.

    Once upon a time, speed-of-light delays only affected computer design at the extreme high end, but we've been there at the low end now too, for a very long time. If you did nothing to a Raspberry Pi but double its physical size, scaling everything on that board evenly, it would cease…to…work!

  2. The "stack" depth of a bare NVMe disk is a lot shallower than for

      Ethernet+ICMP+IP+UDP+kernel+NVMe+UDP+IP+ICMP+Ethernet
    

    Every packet of data to a typical remote NAS has to go clear down through the TCP/IP stack into the fileserver software, then back up that same long stack to return the result.

    And that's only for in-kernel network filesystems. If you're dealing with a userspace NAS technology like Samba, you have a kernel-to-userspace transition and probably a deeper file server code stack besides.

  3. Cooperation delays aren't zero. You speak of having 10 readers, but at bare minimum that must cut per-reader bandwidth by 10 on the remote side, and it's that low only if there is no arbitration among the readers, so that all readers can proceed in parallel, the ÷10 factor coming only from the shared disk bandwidth.

Let's get concrete. I just did a ping time test here to a nearby NAS on a quiet gigabit LAN and got an average of 350 µs over 25 packets. The read latency on a good NVMe SSD is likely to be 10 times lower than that, and with the ping test, you haven't even considered disk access latency yet.

If your own tests tell you there's orders of magnitude difference between two methods, believe the test result, don't resort to argument-from-incredulity to talk yourself into believing that the test was wrong.

The characteristics of the network disk is that in the beginning, maybe for a couple of minutes, it is about 2.5 times slower than a local disk

I'm stunned it's that low! I'd consider that wild success.

over time it ends up being 20 times slowe

Just one order of magnitude speed drop? If after reading my arguments above you think that's unreasonable, go read them again. Point 1 alone shows you're experiencing an excellent result.

it has no way of knowing if another system updated the file.

As others have shown, that's a solvable problem, but you can't solve it for free. All coordination I/O has to go between the nodes somehow, and that has a cost.

the conclusion has been drawn that it is a problem with NFS3 and can't be changed.

NFS certainly has problems, but you can't charge all of the problems you're seeing to a network protocol alone. You can invent the bestest baddest speediest network file sharing protocol, and you'll still have the speed-of-light delays on single links and an O(n²) type cooordination problem among nodes if you want multiple participants.

TANSTAAFL.

09:28 Reply: memory vs mmap (artifact: 0cb5701e14 user: wyoung)

If network disks were massively slower than local hard disks, …it would be a well-known problem.

It is a well-known problem. Maybe not well-known to you and yours, but I can dig up references from the dawn of network filesystem based computing early 1980s for this, if you require it.

The network team didn't believe that a local disk was faster that a network disk

If you're testing matching storage on both sides of the connection, then of course the local disk will not be any faster than the actual remote disk, but the remote disk isn't attached to the local CPU in the same way the local disk is. There are a whole pile of other considerations stemming from that fact:

  1. Local NVMe latency and ping time to the remote NAS is going to be orders of magnitude different just on speed-of-light delays alone. The distance between the NVMe connector and the CPU is maybe a few cm long. The Ethernet cabling, switch fabric, internal MAC to CPU link, etc. is likely to come to tens of meters, so you're orders of magnitude off already. Ya canna change the laws of physics, captin'! A foot of copper is a nanosecond, and at today's speeds, nanoseconds matter.

    Once upon a time, speed-of-light delays only affected computer design at the extreme high end, but we've been there at the low end now too, for a very long time. If you did nothing to a Raspberry Pi but double its physical size, scaling everything on that board evenly, it would cease…to…work!

  2. The "stack" depth of a bare NVMe disk is a lot shallower than for

      Ethernet+ICMP+IP+UDP+kernel+NVMe+UDP+IP+ICMP+Ethernet
    

    Every packet of data to a typical remote NAS has to go clear down through the TCP/IP stack into user space to the fileserver software, then back up that same long stack to return the result.

    And that's only for in-kernel network filesystems. If you're dealing with a userspace NAS technology like Samba, you have a kernel-to-userspace transition and probably a deeper file server code stack besides.

  3. Cooperation delays aren't zero. You speak of having 10 readers, but at bare minimum that must cut per-reader bandwidth by 10 on the remote side, and it's that low only if there is no arbitration among the readers, so that all readers can proceed in parallel, the ÷10 factor coming only from the shared disk bandwidth.

Let's get concrete. I just did a ping time test here to a nearby NAS on a quiet gigabit LAN and got an average of 350 µs over 25 packets. The read latency on a good NVMe SSD is likely to be 10 times lower than that, and with the ping test, you haven't even considered disk access latency yet.

If your own tests tell you there's orders of magnitude difference between two methods, believe the test result, don't resort to argument-from-incredulity to talk yourself into believing that the test was wrong.

The characteristics of the network disk is that in the beginning, maybe for a couple of minutes, it is about 2.5 times slower than a local disk

I'm stunned it's that low! I'd consider that wild success.

over time it ends up being 20 times slowe

Just one order of magnitude speed drop? If after reading my arguments above you think that's unreasonable, go read them again. Point 1 alone shows you're experiencing an excellent result.

it has no way of knowing if another system updated the file.

As others have shown, that's a solvable problem, but you can't solve it for free. All coordination I/O has to go between the nodes somehow, and that has a cost.

the conclusion has been drawn that it is a problem with NFS3 and can't be changed.

NFS certainly has problems, but you can't charge all of the problems you're seeing to a network protocol alone. You can invent the bestest baddest speediest network file sharing protocol, and you'll still have the speed-of-light delays on single links and an O(n²) type cooordination problem among nodes if you want multiple participants.

TANSTAAFL.

2021-05-31
01:33 Reply: SQLITE_IOERR on `pragma journal_mode=wal;` (artifact: a057fde650 user: wyoung)

That's not a strong enough warning: attempting to use WAL over a network link will cause failures. Quoting disadvantage point 2 in the WAL docs: "WAL does not work over a network filesystem." No could-be may-be here at all.

01:30 Reply: SQLITE_IOERR on `pragma journal_mode=wal;` (artifact: 7d686c4fa0 user: wyoung)

You'll be far more likely to get a fix if you can post a short program that reproduces either symptom.

Beware: 10 lines of user-written code that requires half a gig of some non-C development environment to be installed plus who knows how many library dependencies doesn't count as "short program."

I say this because you speak vaguely of "bindings," which tells us you aren't using the SQLite C interface at all. That's fine, but since the SQLite developers can't fix things that aren't among the code bases they maintain, until you can distill the problem demonstration to code that squarely puts the blame on one of those code bases, you're going to be in a finger-pointing situation.

This isn't to say that you need to write your demonstration program in C, but if you can show the symptom that way, it'd be more likely to result in a fix than a demo in some other language that has to dig through multiple layers of "bindings" to get down to the C layer. One of those other layers could be what's actually at fault.

2021-05-28
21:13 Reply: Bug of json_each if object has value field. (artifact: 2277af9e15 user: wyoung)

You're warned against relying on the cases where it does work in the docs, here.

2021-05-22
23:38 Reply: can I insert an 8GB file into sqlite? (artifact: 5b64eb1d24 user: wyoung)

There are plenty of resumable, chunked transfer tools to deal with this

As long as you're willing to shut all conns to the DB down and leave them down while the transfer continues, sure.

Resumability doesn't let you start the app up, modify the DB some, shut it down, and resume the transfer. That'll almost guarantee a corrupt DB at the end of the transfer. It also doesn't let you copy the DB once, do some work on the DB, and then copy only the changes, as zfs send will, given suitable snapshots.

Indeed, snapshotting the DB under ZFS and then doing a zfs send is a much better plan than using userspace-level resumable transfer tools on a SQLite DB. You might lose the last few modifications in uncommitted transactions, but it's no worse than kill the app while it runs. The journal gets replayed, and the DB is back in action again.

it's done all the time in several scientific "big data" disciplines that have a taste for colossal individual files.

Are these individual files random-access, modified internally as SQLite DBs are, or are they written once, then possibly appended to only?

Node.js JIT-compiles your script into native machine code.

Ah, that explains why VSCode feels about as fast as 4.3 BSD vi over a 9600 bps VT220 connection, then.

JIT isn't magic. Dynamic languages have unavoidable costs, no matter how well optimized, which is why statically-compiled languages are still being created and used.

22:47 Reply: can I insert an 8GB file into sqlite? (artifact: 0784435a6d user: wyoung)

copying the database just means copying a single file, and insuring that the checksums match.

This is fine in principle, but when the size of the DB gets so big that it takes days and days to transfer, what happens when there's an interruption?

Even something like zfs send doesn't copy the entire filesystem in a single copy: it collects the changed blocks and sends only those blocks. And it's now resumable. How do you get the same features with SQLite?

What I need is fast IO and fast http transfers

Write speed isn't going to differ greatly between the cases, and as for reading, I doubt you're going to find anything faster than sendfile(2), which doesn't work for files buried inside a DB.

You'll notice that the "faster than FS" article entirely ignores this aspect of things.

I feel like if theres a concern around indexing millions of files in a database, then that problem is exacerbated by storing each file in 1MiB chunks in the database.

Yes, TANSTAAFL.

the number of files stored to be somewhere in the hundreds of thousands, and the total storage size to be hundreds of gigabytes.

That gives an average file size of 1 MB, which feels like a bait-and-switch relative to the original thread title.

If all we're talking about are a few rare exceptions that break the 1 GiB barrier, then why worry about chunking the files at all? It's only when the vast majority of files are larger than that (e.g. a digital video storage archive) that you have to worry about the cost of chunking, the 1 GiB limit, etc.

17:38 Reply: Building a queue based on SQLite (artifact: b047f5ef5b user: wyoung)

Why have you rejected the pre-built and battle-tested message queuing systems? There are many to choose from.

I've built queues in SQLite before, but on the scale of one insert per several minutes, with dequeueing on the order of a few per hour, the queue providing this low-level buffering that lets the sender avoid spamming the receiver.

When you get to the scale you talk about, rolling your own atop a tool not made to do this sort of thing seems like madness to me.

17:28 Edit reply: can I insert an 8GB file into sqlite? (artifact: 34eefe8a7d user: wyoung)

portability

You think a single file with potentially terabytes of content is portable?

I mean, houses are portable by the same logic. All you have to do…all you have to do… is lift them up on massive hydraulic jacks and put them on a container ship to move them across continents. What's the problem?

I think you'll find that a pile of tens of thousands of gigabyte-scale files are more easily portable for the same reason that when moving, leaving the house behind and packing up on the house's contents makes more sense.

I'm not telling you that you shouldn't use SQLite for terabyte-scale databases. I'm just telling you that if you expect to need to move it around a lot, you'll find it easier to do on a pile-of-files so you can retry broken transfers, use rsync sanely, etc.

this article which I found very intriguing

Part of what's going on in that article is that the entire DB can get loaded into RAM, and the OS will treat it as a single file, which changes how caching works relative to thousands of separate files.

I doubt its results are very relevant to your particular application.

avoiding folder count limits

Those limits in filesystems aren't arbitrary: they exist to keep the filesystem fast. It takes too much time to index arbitrary millions of directory entries, so they set a finite limit to encourage you to make use of hierarchy to keep index sizes sane.

The very same forces exist inside a SQLite database. It's just C code, not magic. Millions of DB rows have an indexing overhead, too.

I plan to write this in typescript

…and you're concerned with efficiency?

Decide: do you want the program to be fast to execute, or do you want it to be fast for you to write? You can't get both.

I'm not saying it's wrong to write your service in Typescript. For network-facing applications, the latencies and bandwidth limitations of the network often make the inefficiency of the remote network service irrelevant, which is why we have so much scripting on the server side. However, you'd best be clear about this and stop trying to worry about server-side efficiency.

Having said that, realize that when you start talking about scaling the application, efficiency can start mattering again. The tech giants are rewriting Perl, Python, PHP, etc. services in C++, Go, Rust, etc. in order to save money on data centers. Even counting Silicon Valley engineering salaries, it can pay off.

Another common situation is writing efficient services so you can avoid paying a lot in cloud fees. Back when everything was on-premises, taking 10x the development time to get a service to run in 1% of the total system resources versus 10% made no sense, but now that we're loading many customer loads onto each machine in cloud architectures, that 10x inefficiency cost goes directly into hosting fees.

It's the same tradeoff as for the big guys: is it cheaper to pay for the engineering to get it to be 10x faster or is it cheaper to just pay 10x in hosting fees?

Both approaches have their points. You need to think clearly about which approach matters in your particular situation.

sounds like I am limited to ~2GB max file storage.

Yes, for a single chunk.

So, you could chunk the files, storing 2 GiB or less per chunk. I'd actually suggest something more on the order of 1 MiB. That's enough to saturate any network buffers, so that you're sure to have the second megabyte of data up in RAM before you finish sending the first. Once you reach that point, there's no point making the chunks bigger, if the point is to serve them over the network.

Or, you can do as iTunes, Plex, etc. do and store the actual files outside the DB, keeping only the metadata in the DB itself.

17:19 Reply: can I insert an 8GB file into sqlite? (artifact: 3673bbe0b8 user: wyoung)

portability

You think a single file with potentially terabytes of content is portable?

I mean, houses are portable by the same logic. All you have to do…all you have to do… is lift them up on massive hydraulic jacks and put them on a container ship to move them across continents. What's the problem?

I think you'll find that a pile of tens of thousands of gigabyte-scale files are more easily portable for the same reason that when moving, leaving the house behind and packing up on the house's contents makes more sense.

I'm not telling you that you shouldn't use SQLite for terabyte-scale databases. I'm just telling you that if you expect to need to move it around a lot, you'll find it easier to do on a pile-of-files so you can retry broken transfers, use rsync sanely, etc.

this article which I found very intriguing

Part of what's going on in that article is that the entire DB can get loaded into RAM, and the OS will treat it as a single file, which changes how caching works relative to thousands of separate files.

I doubt its results are very relevant to your particular application.

avoiding folder count limits

Those limits in filesystems aren't arbitrary: they exist to keep the filesystem fast. It takes too much time to index arbitrary millions of directory entries, so they set a finite limit to encourage you to make use of hierarchy to keep index sizes sane.

The very same forces exist inside a SQLite database. It's just C code, not magic. Millions of DB rows have an indexing overhead, too.

I plan to write this in typescript

…and you're concerned with efficiency?

Decide: do you want the program to be fast to execute, or do you want it to be fast for you to write? You can't get both.

I'm not saying it's wrong to write your service in Typescript. For network-facing applications, the latencies and bandwidth limitations of the network often make the inefficiency of the remote network service irrelevant, which is why we have so much scripting on the server side. However, you'd best be clear about this and stop trying to worry about server-side efficiency.

Having said that, realize that when you start talking about scaling the application, efficiency can start mattering again. The tech giants are rewriting Perl, Python, PHP, etc. services in C++, Rust, etc. in order to save money on data centers. Even counting Silicon Valley engineering salaries, it can pay off.

Another common situation is writing efficient services so you can avoid paying a lot in cloud fees. Back when everything was on-premises, taking 10x the development time to get a service to run in 1% of the total system resources versus 10% made no sense, but now that we're loading many customer loads onto each machine in cloud architectures, that 10x inefficiency cost goes directly into hosting fees.

It's the same tradeoff as for the big guys: is it cheaper to pay for the engineering to get it to be 10x faster or is it cheaper to juts pay 10x in hosting fees?

Both approaches have their points. You need to think clearly about which approach matters in your particular situation.

sounds like I am limited to ~2GB max file storage.

Yes, for a single chunk.

So, you could chunk the files, storing 2 GiB or less per chunk. I'd actually suggest something more on the order of 1 MiB. That's enough to saturate any network buffers, so that you're sure to have the second megabyte of data up in RAM before you finish sending the first. Once you reach that point, there's no point making the chunks bigger, if the point is to serve them over the network.

Or, you can do as iTunes, Plex, etc. do and store the actual files outside the DB, keeping only the metadata in the DB itself.

16:44 Reply: Writing arrays (artifact: f79b3367f2 user: wyoung)

You could use the JSON extension for this.

2021-05-20
04:06 Edit reply: How to check if HAVE_USLEEP is activated? (artifact: b1b053b78f user: wyoung)

Autoconf-based projects avoid installing config.h by default on purpose. It's a thorny issue, solvable, but not without effort.

04:06 Edit reply: How to check if HAVE_USLEEP is activated? (artifact: 80292a0bb9 user: wyoung)

Autoconf-based projects avoid installing config.h by default on purpose. It's a thorny issue, solvable, but not without effort..

04:06 Reply: How to check if HAVE_USLEEP is activated? (artifact: 5eeb576a9e user: wyoung)

Autoconf-based projects don't install config.h on purpose. It's a thorny issue, solvable, but not without effort..

2021-05-17
12:56 Reply: memory vs mmap (artifact: d959ab9834 user: wyoung)

No, because once again your assumptions are wrong.

A RAM disk is not the same thing as a brk(2) block assigned persistently to the process. It's on the other side of a kernel wall, so all accesses have to go through the kernel, just as access to any other type of "disk" does. I'd therefore expect a RAM disk to be considerably slower than malloc().

While it's fine to bring assumptions to the table when testing hypotheses, the system isn't going to lie to you about timing. When you benchmark the system in a given way, it tells you the truth about how long that sort of operation takes. If that falsifies your hypothesis, you have to either give up the initial assumption as disproven or you have to dig deeper.

Since digging deeper involves investigating how your OS works, it's off-topic here until you can pull that investigation back full-circle to show that SQLite is operating in an inefficient manner. Most likely, you'll find that SQLite is doing the best it can given the portable facilities available.

Until then, your results showing that different methods give different speeds is a valuable result: it tells you which method is fastest, so if that's all you wanted to know, you're done.

10:58 Reply: memory vs mmap (artifact: b4d65f83c0 user: wyoung)

You're assuming that all memory is equally fast. RAM mapped into the local process by malloc() is not necessarily as fast as memory that was previously allocated only to the OS's buffer cache and is now mmap()ed into your process's memory space.

The first situation requires only a single user-to-kernel space transition per malloc() call at most, possibly less because malloc() implementations call brk(2) less often than they simply assign chunks from a previously allocated slab and return without calling into the kernel.

While it is possible that a given kernel could literally map every page belonging to a file into your process's RAM on the initial mmap() call, there's no actual requirement that it do so. It could simply take your arguments, remember your requirements, and then page chunks of that file's buffer cache space into your process as necessary. Page faults and user-to-kernel space transitions aren't free.

Even if it did map everything into your process's virtual address space as a single unbroken buffer, as you're apparently assuming, it would almost certainly do so in a fragmented way as compared to the malloc() based method used by SQLite's :memory: DB option. That's going to involve the CPU's TLB mechanism, which is also very much not free to use.

You'd have to dig into your OS's implementation of mmap() to find out how it actually operates. Beware: you'll probably find it one of the hairiest parts of the kernel. Memory management on modern systems is complicated.

2021-05-12
15:14 Reply: SQLite 3.7.5 and above versions available in Redhat 6 (artifact: 56ed992ef9 user: wyoung)

Click on any of the release checkin IDs of the version tags here, then click the "Tarball" or "Zip archive" links on the subsequent page.

This is the one for v3.7.5, for example.

13:56 Reply: SQLite 3.7.5 and above versions available in Redhat 6 (artifact: cf9184eb51 user: wyoung)

Red Hat generally does not update versions of a given package once it's released in the first stable version. All changes are made by backporting fixes from later versions into the same version originally released. This is how they get their uncommon stability: by not replacing an existing version of packaged software with a new one. The versions that were in place upon its initial release on November 10, 2010 are the versions shipped in the last release of RHEL 6.

(There are sometimes exceptions to this, but they're quite rare, and generally peripheral besides. e.g. Firefox, upon which no core OS software depends.)

However, it is also the case that RHEL 6 has been completely outside of any support window for nearly half a year, so even if none of the above were true, there would be no way to get Red Hat to give you an upgrade now. They're just going to tell you to upgrade to RHEL 7 or RHEL 8.

If you insist on remaining on this obsolete platform and also wish to have updated software atop it, you'll have to arrange to build it yourself. There are many paths forward, but you can't reasonably expect someone else to take on this pain of 10 years of technical debt for you.

2021-05-06
14:36 Reply: select * from fcc where unique_system_identifier='4350131'; returns 1000+ records (artifact: 1b60decf49 user: wyoung)

You aren’t copying or pasting the hard tabs in the third command. It’s a single 30-element tab-separated line, repeating your field list from your initial post.

The single quotes should preserve the tabs, but perhaps they aren’t being pasted into your terminal properly.

Worst case, you can construct the line manually.

08:16 Reply: select * from fcc where unique_system_identifier='4350131'; returns 1000+ records (artifact: 6256f12a10 user: wyoung)
05:48 Edit reply: select * from fcc where unique_system_identifier='4350131'; returns 1000+ records (artifact: 51050e7568 user: wyoung)

The data can be downloaded here

Okay, that puts a different spin on things. It's simply not well-formed for direct SQLite input.

I was able to get a conversion that allowed your queries to work with:

$ pip3 install --user csvs-to-sqlite

$ dos2unix EN.dat

$ echo 'record_type	unique_system_identifier	uls_file_number	ebf_number	call_sign	entity_type	licensee_id	entity_name	first_name	mi	last_name	suffix	phone	fax	email	street_address	city	state	zip_code	po_box	attention_line	sgin	frn	applicant_type_code	applicant_type_other	status_code	status_date	lic_category_code	linked_license_id	linked_callsign' > fcc.tsv

$ tr '|' '\t' < EN.dat >> fcc.tsv 

$ csvs-to-sqlite fcc.tsv fcc.db -s $'\t'

The third command is necessary to give the TSV file (fcc.tsv) the necessary headers. The hard tabs in the command made it through the forum posting's formatting path, so you should be able to copy the command directly.

This path still causes a few complaints, and it results in a somewhat different schema than the one you posted, but the result seems cleaner.

All of the records that I imported have EN as the record_type.

Sorry, the wrapping in your message made that unclear.

05:45 Edit reply: select * from fcc where unique_system_identifier='4350131'; returns 1000+ records (artifact: 3c25170978 user: wyoung)

The data can be downloaded here

Okay, that puts a different spin on things. It's simply not well-formed for direct SQLite input.

I was able to get a conversion that allowed the first of your queries to work with:

$ pip3 install --user csvs-to-sqlite

$ dos2unix EN.dat

$ echo 'record_type	unique_system_identifier	uls_file_number	ebf_number	call_sign	entity_type	licensee_id	entity_name	first_name	mi	last_name	suffix	phone	fax	email	street_address	city	state	zip_code	po_box	attention_line	sgin	frn	applicant_type_code	applicant_type_other	status_code	status_date	lic_category_code	linked_license_id	linked_callsign' > fcc.tsv

$ tr '|' '\t' < EN.dat >> fcc.tsv 

$ csvs-to-sqlite fcc.tsv fcc.db -s $'\t'

The third command is necessary to give the TSV file the necessary headers. It appears that that the hard tabs will make it through the forum posting, so you should be able to copy the command directly.

That still caused a few complaints, and it results in a somewhat different schema than the one you posted, but it seems cleaner.

All of the records that I imported have EN as the record_type.

Sorry, the wrapping in your message made that unclear.

05:44 Reply: select * from fcc where unique_system_identifier='4350131'; returns 1000+ records (artifact: 7a6dd7f407 user: wyoung)

The data can be downloaded here

Okay, that puts a different spin on things. It's simply not well-formed for direct SQLite input.

I was able to get a conversion that allowed the first of your queries to work with:

$ pip3 install --user csvs-to-sqlite

$ dos2unix EN.dat

$ echo 'record_type	unique_system_identifier	uls_file_number	ebf_number	call_sign	entity_type	licensee_id	entity_name	first_name	mi	last_name	suffix	phone	fax	email	street_address	city	state	zip_code	po_box	attention_line	sgin	frn	applicant_type_code	applicant_type_other	status_code	status_date	lic_category_code	linked_license_id	linked_callsign' > EN.tsv

$ tr '|' '\t' < EN.dat >> fcc.tsv 

$ csvs-to-sqlite fcc.tsv fcc.db -s $'\t'

The third command is necessary to give the TSV file the necessary headers. It appears that that the hard tabs will make it through the forum posting, so you should be able to copy the command directly.

That still caused a few complaints, and it results in a somewhat different schema than the one you posted, but it seems cleaner.

All of the records that I imported have EN as the record_type.

Sorry, the wrapping in your message made that unclear.

2021-05-05
22:57 Reply: select * from fcc where unique_system_identifier='4350131'; returns 1000+ records (artifact: 11cbfcef86 user: wyoung)

Without this EN.dat file you're asking us to speculate on a few points.

However, what immediately catches my eye is that first field, record_type. That's a good sign of a single flat table that's trying to do the job of multiple tables interlinked by some relation, probably this unique_system_identifier value.

What then follows from that is this guess: your SELECT statement should include a WHERE record_type='EN' or similar, to filter out all of the other record types.

Once you sort these initial confusions out, if you're going to use this data long-term, I'd do some serious refactoring on the data to get it into normal form.

22:52 Reply: Table Btree and Index Btree (artifact: f78834853d user: wyoung)

The table is created using a table Btree and the index is created with an index Btree.

2021-05-04
10:59 Reply: "Office Space"... I'm losing all those pennies (artifact: da49bb8c63 user: wyoung)
2021-05-02
18:44 Reply: "Office Space"... I'm losing all those pennies (artifact: 303907ff18 user: wyoung)
2021-04-30
16:23 Reply: WAL files deleted (artifact: 8307c833eb user: wyoung)

If it didn’t “ever” happen, there’s a fair chance you weren’t closing SQLite cleanly. In the default configuration, those files only exist while a program is actively using the DB.

You might want to check for this in the read/write case in case you’re causing SQLite to drop partially-complete transactions and such.

2021-04-25
13:33 Edit reply: zero width space (artifact: 0cadaa7573 user: wyoung)

This u200d seems to be a zero width space.

It's a zero-width joiner, actually. A zero-width space is something else.

convert a unicode entity to an utf-8 character?

All UTF-8 characters are Unicode entities.

Are you perhaps saying you have a literal "u200d" string (5 ASCII characters) in the input and need to turn it into an actual Unicode character, encoded as UTF-8? If so, then presuming the terminal locale is UTF-8:


$ sqlite3 :memory: "select replace('अमिताभ बच्u200dचन', 'u200d', char(0x200D))"
अमिताभ बच्‍चन
$ !! | od -t c
0000000    अ  **  **   म  **  **   ि  **  **   त  **  **   ा  **  **   भ
0000020   **  **       ब  **  **   च  **  **   ्  **  ** 342 200 215   च
0000040   **  **   न  **  **  \n                                        
0000046

That "342 200 215" bit is the octal representation of the character 0x200D encoded in UTF-8.

In other words, you might be looking for a combination of replace() and char().

EDIT: Shortened example code for clarity. Also, bang-bang!

13:27 Edit reply: zero width space (artifact: 6ffa2a8e88 user: wyoung)

This u200d seems to be a zero width space.

It's a zero-width joiner, actually. A zero-width space is something else.

convert a unicode entity to an utf-8 character?

All UTF-8 characters are Unicode entities.

Are you perhaps saying you have a literal "u200d" string (5 ASCII characters) in the input and need to turn it into an actual Unicode character, encoded as UTF-8? If so, then presuming the terminal locale is UTF-8:


$ sqlite3 memory: "select replace('अमिताभ बच्u200dचन', 'u200d', char(0x200D))"
अमिताभ बच्‍चन
$ !! | od -t c
0000000    अ  **  **   म  **  **   ि  **  **   त  **  **   ा  **  **   भ
0000020   **  **       ब  **  **   च  **  **   ्  **  ** 342 200 215   च
0000040   **  **   न  **  **  \n                                        
0000046

That "342 200 215" bit is the octal representation of the character 0x200D encoded in UTF-8.

In other words, you might be looking for a combination of replace() and char().

EDIT: Shortened example code for clarity. Also, bang-bang!

13:01 Edit reply: zero width space (artifact: 83706d2d1d user: wyoung)

This u200d seems to be a zero width space.

It's a zero-width joiner, actually. A zero-width space is something else.

convert a unicode entity to an utf-8 character?

All UTF-8 characters are Unicode entities.

Are you perhaps saying you have a literal "u200d" string (5 ASCII characters) in the input and need to turn it into an actual Unicode character, encoded as UTF-8? If so, then presuming the terminal locale is UTF-8:

$ sqlite3 x.db
SQLite version 3.28.0 2019-04-15 14:49:49
Enter ".help" for usage hints.
sqlite> create table a(a);
sqlite> insert into a values('अमिताभ बच्u200dचन');
sqlite> update a set a=replace(a, 'u200d', char(0x200d));
sqlite> ^D
$ sqlite3 x.db 'select * from a' | od -t c
0000000    अ  **  **   म  **  **   ि  **  **   त  **  **   ा  **  **   भ
0000020   **  **       ब  **  **   च  **  **   ्  **  ** 342 200 215   च
0000040   **  **   न  **  **  \n                                        
0000046

That "342 200 215" bit is the octal representation of the character 0x200D encoded in UTF-8.

In other words, you might be looking for a combination of replace() and char().

12:57 Reply: zero width space (artifact: 43c783240e user: wyoung)

This u200d seems to be a zero width space.

It's a zero-width joiner, actually. A zero-width space is something else.

I need to replace it by an actual character

A zero-width joiner is an actual character. Or rather, in Unicode, it has a representation as a code point. Are you trying to make some finer distinction than this?

convert a unicode entity to an utf-8 character?

All UTF-8 characters are Unicode entities.

Are you perhaps saying you have a literal "u200d" string in the input and need to turn it into an actual Unicode character, encoded as UTF-8? If so, then presuming the terminal locale is UTF-8:

$ sqlite3 x.db
SQLite version 3.28.0 2019-04-15 14:49:49
Enter ".help" for usage hints.
sqlite> create table a(a);
sqlite> insert into a values('अमिताभ बच्u200dचन');
sqlite> update a set a=replace(a, 'u200d', char(0x200d));
sqlite> ^D
$ sqlite3 x.db 'select * from a' | od -t c
0000000    अ  **  **   म  **  **   ि  **  **   त  **  **   ा  **  **   भ
0000020   **  **       ब  **  **   च  **  **   ्  **  ** 342 200 215   च
0000040   **  **   न  **  **  \n                                        
0000046

In other words, you might be looking for a combination of replace() and char().

2021-04-21
17:00 Reply: The NUMERIC data type? (artifact: c3b826e9c3 user: wyoung)

It's really a pity that SQLite doesn't implement a true DECIMAL/NUMERIC data type

Isn't that tantamount to lamenting the fact that SQLite isn't an arbitrary precision arithmetic engine? What's "Lite" about that?

If you want APA, strap one of the many available engines to SQLite, then store the results in SQLite columns. Depending on your needs, you might store the calculated data as:

  • strings to allow stringwize max() and such if you zero-pad the results; or
  • BLOBs containing some binary representation of the APA data structure; or
  • large integers, multiplied up to strip the fractional part (e.g. ×100 for US cents to dollars); or
  • multiple columns containing a serialized form of the APA data structure (e.g. two ints for mantissa and exponent)

binary floating point types are not precise and therefore unusable for commercial calculations

That depends on how it's done. If you use epsilon values carefully, you can keep things quite nicely on the rails using FP arithmetic.

Or, you can do everything with integer math (e.g. calculate on cents rather than dollars for US monetary calculations) and treat fractional dollars as a presentation issue.

2021-04-19
13:32 Reply: Access strategy for a one-thread per connection http server (artifact: 2d2a88dde6 user: wyoung)

What is the actual overhead of opening fresh database connections ?

The entire database schema has to be re-parsed, for one thing.

That's another good reason to adopt a connection pooling strategy alongside a thread pooling strategy. Create only $CORES × 1.5 SQLite conns + threads or so, then reuse them rather than re-create them on each HTTP hit.

If this is a web app you're talking about, realize that the HTTP hit count can be as high as one per static asset on the page, plus one for the basic HTML page, plus one for each dynamic HTTP hit. This can be a hundred hits per "page", easily. You really really do not want to be starting and stopping a thread + SQLite conn per HTTP hit.

are multiple reads worked out in parallel ?

It depends on the concurrency mode.

For your application, you probably want WAL mode. Its concurrency behavior is described in that document. (§2.2)

If you need rollback or another SQLite concurrency mode, then the concurrent performance characteristics differ.

I couldn't find in the docs what happens with prepared statement if multiple threads may try to use concurrently

I'd expect it to crash each time the underlying SQLite schema changed out from under the program, causing SQLite to re-parse each prepared statement still active, thus causing race conditions in all the other threads also using that prepared statement.

So: don't.

10:59 Reply: AVG with GROUP BY returns 0 (artifact: 4a469411eb user: wyoung)

SQLite doesn't have a specific DECIMAL(x, y) data type. It has a fallback for SQL compatibility with other RDBMSes, but it probably doesn't give you the behavior you expect.

You should probably switch to the decimal extension.

07:40 Reply: Access strategy for a one-thread per connection http server (artifact: 52bf9b7e0e user: wyoung)

moderate traffic…each http connection using a dedicated thread

How many HTTP servers designed to operate at scale — your qualifier, mind! — are you aware of that do that?

There's a reason the number is small: the default stack size on "server" class OSes tends to be up in the megabytes. With only 1000 simultaneous conns — a value that some may well consider "moderate" — you're chewing up gigs of RAM just for the per-thread call stacks.

You can lower this value at compile and/or at run time, at an increased risk of blowing the stack if you choose a value that's too low, but that just brings you to the next reason no serious HTTP server does this: managing thousands of active processes wastes time on context switches.

A far more sensible design is to set up a thread pool that's no larger than a small multiple of the CPU core count. For most workloads, a multiple of 1-2 is as high as you want to go. No higher value gives greater performance, and beyond a certain limit, the overheads start actively eating into whatever performance you've gained by going with a multi-threaded design in the first place.

To bring this back on topic for the forum, you would have a single SQLite conn per thread, temporarily lent to the HTTP I/O handler as part of everything else it's temporarily assigned from the pool.

Now realize that you're not far off of asking one of the lowest-numbered items in the SQLite FAQ. Why do you suppose it got answered so early, and in quite that way?

That brings us to Major Option #2: go with an event-driven design and skip threads entirely. If your notions of "moderate" and "scale" are low enough, you may never even peg a single CPU core, and until you do, there's not a whole lot of point taking on the pain a multi-threaded design will cause you.

Yes, will, not may or might. Read the paper linked from the FAQ item before you even think about coming back with an argument against that claim. And if you do decide to, realize you're going up against a Berkeley CS prof in doing so.

2021-04-14
13:14 Reply: usage of SHA-1 (artifact: 225af6f87d user: wyoung)

The operation would result in removing the following protected packages: dnf

On further reflection, I've realized that the above command only reports a short-circuit failure case. While that single dependency problem may be enough reason in and of itself not to be making threats about removing SQLite from RHEL 9 if the SQLite project developers don't accede to IBM/Red Hat's demands, the actual situation is much more fraught.

On my CentOS 8 box, I get this result:

  $ dnf repoquery -q --installed --whatrequires sqlite-libs --recursive | wc -l
  130
12:14 Reply: usage of SHA-1 (artifact: 64a67433e7 user: wyoung)

I would like to ask

As far as I can tell, you're making demands, not asking anything.

Maybe instead of sending around sternly-worded broad-based form letters, you could make an effort to try to understand the projects you're sending them to first and tailor the message accordingly?

Oh, and out of personal interest, would you mind linking us to the copy of this same form letter that your organization has presumably sent to the Git project? I really want to see the resulting firestorm from dropping this little bomblet on 'em.

To avoid your package from breaking due to SHA-1 being disabled...

This is what I'm getting at when I say you haven't done your homework before sending this message off.

Are you sitting down? If not, please do, because you need to let this command's output bake your noodle for a time before you reply:

  $ sudo dnf remove sqlite-libs
  Error: 
   Problem: The operation would result in removing the following protected packages: dnf

With the understanding resulting from this demonstration of the direction the dependency arrow actually points, would you perhaps reconsider coming across all strong-arm on this matter?

— Signed, a CentOS user who's taken a shot to the jewels from y'all once already this year and doesn't want another, thanks.

P.S. I support your desire to remove SHA-1 everywhere possible, but Red Hat needs to realize it's dependent on the FOSS software community to provide a large chunk of the software it makes those billions off. How about you dial that arrogance back a bit, 'kay?

04:48 Reply: Only Column Names read from table (artifact: 851bc475cd user: wyoung)

Upgrade?

2021-04-13
10:42 Reply: Only Column Names read from table (artifact: 2bc0899a1c user: wyoung)

The pragma functions were added in SQLite 3.16.0. Are you using an older version?

07:05 Reply: how to do arithmetic calculations on real/decimal/double in sqlite (artifact: 02c8b795cd user: wyoung)

See https://sqlite.org/floatingpoint.html

Also...

More ↓