'No such table' returned while the database contains it

(1.1) By chouquette on 2020-07-03 13:45:03 edited from 1.0 [link] [source]

Hi,

I'm hitting an issue in my application, which basically sums up to the first request executed by a thread failing with 'No such table', while the table definitely exists in the database. It seems the connection refers to an empty database, but I can see content when browsing the database after the application crashed.

I can only reproduce this by running the same test a few times (sometimes the first one is enough) on a rather powerful machine that's usually used for our CI. Not sure if that has to do with its speed or if it's related to CPU features. For what it's worth, I ran all tests in a docker image (and can't do otherwise on that machine)

Since I'm really not sure what to look for, here's what I tried:

In order to ensure the database isn't recreated for some reasons I explicitly pass the SQLITE_OPEN_CREATE flag to the first connection, and stick to SQLITE_OPEN_READWRITE for later ones. The failing connection is never the first one, only later ones from a different thread.

I usually keep a per-thread cache of prepared requests and reuse those, but disabling this doesn't change the outcome.

Synchronization between threads is handled by the application, and tsan doesn't appear to complain. Each thread has it's own connection to the database, and write requests are executed with an exclusive lock while read requests can be processed in parallel. However even if all requests are executed with an exclusive lock, the problem still occur.

I tried rebuilding sqlite with assertions, but nothing specific to report, but the crash still happens. Same results with tsan enabled for sqlite and my application, no report of anything.

I've been able to reproduce this with sqlite 3.25.3 (manually build), today's snapshot (202007031232) and 3.32.3-1 (packaged by debian)

I'm not really sure what to do/test from here, so any help would be highly appreciated!

Thanks a lot in advance,

(2) By Keith Medcalf (kmedcalf) on 2020-07-03 15:05:02 in reply to 1.1 [link] [source]

while the table definitely exists in the database

Do you mean "before I began" or "after I finished"?

Do you mean "a table I create in connection A is not seen in connection B"?

(3) By chouquette on 2020-07-03 15:09:14 in reply to 2 [source]

Do you mean "before I began" or "after I finished"?

After I finished

Do you mean "a table I create in connection A is not seen in connection B"?

Thread A creates a connection to the database
Thread A creates some tables in the database
Thread B is started
Thread B creates another connection to the database
Thread B doesn't see the tables created by thread A

(4) By Ryan Smith (cuz) on 2020-07-03 15:15:32 in reply to 3 [link] [source]

That's Matrix-style impossible.

When you say "A creates some tables..." do you mean in a transaction (which was successfully committed) or not in a transaction?

When thread B is said to "Create" a connection, you don't mean "takes an existing connection from the connection pool", right? You mean B physically opens the file, and finds it NOT having the same schema as A just created?

Is the file accessed over a Network?

(5) By chouquette on 2020-07-03 15:22:59 in reply to 4 [link] [source]

What I meant by "Thread B creates a connection" is "calls sqlite3_open_v2 with the same path as thread A did"

Tables are indeed created in a transaction which is successfully commited once all tables have been created.

The file is stored locally.

(6) By Larry Brasfield (LarryBrasfield) on 2020-07-03 15:24:24 in reply to 3 [link] [source]

Given your summary of operations, things are not happening as they should.

Yet, those operations are ones which are fundamental, common, well tested, and have been done by many applications using SQLite for many years.

There is some, vanishingly small probability that the library is failing to adhere to the behavior documented for it.

There is a higher probability, approaching 1.000..., that one or more of your operation summaries incorrectly summarizes what is actually being done.

From what you have said so far, it is impossible for forum participants to see how your summaries fail to accurately describe the details of calls into the SQLite API.

Further progress, here, must await your elaboration, in greater detail, of what you are doing rather than what you intended to do. (I presume your summaries reflect intent.)

(7) By Ryan Smith (cuz) on 2020-07-03 20:27:43 in reply to 6 [link] [source]

Don't give away the shmoo, Larry.

You see, Chouquette, what we do here every day is play a fine game among ourselves of "who can guess the most outrageous bug" with an intricate kind of scoring system based on kudos.

We really love getting into the most obscure examples, and first prize is hitting upon that near impossible-to-guess gem before anyone else does.

We really hate if some oaf comes along and make this job trivially easy by posting, oh I don't know, actual code perhaps, or maybe input and physical output from a script. It's so frustrating when someone just opens up the guts of their problem, laying the fault bare for all to see and robbing us of that pleasure of finding out the hard way.

Never make it easy - We live for the puzzle!

##NOCODE

(8) By Keith Medcalf (kmedcalf) on 2020-07-03 21:15:05 in reply to 1.1 [link] [source]

Synchronization between threads is handled by the application, and tsan doesn't appear to complain. Each thread has it's own connection to the database, and write requests are executed with an exclusive lock while read requests can be processed in parallel. However even if all requests are executed with an exclusive lock, the problem still occur.

Exclusive Lock -- Do you mean at the application level (quite outside of SQLite3) or are you somehow referring to SQLite3's exclusive locks, which are a different thing entirely?

Are you checking every return code for errors?

Are you using some strange filesystem (any filesystem can be strange, so which one in particular would be helpful).

Are you using shared_cache or something like that not recommended for use?

What journal_mode are you using?

Have you examined whether "docker" supports POSIX advisory locks and sync operations correctly? I've never used "docker" though I understand it is all the rage with the kids these days, but as I understand it, "docker" works by adding an additional layer of "cruft" between the application and the Operating System. Do you know that this additional layer of cruftiness works properly? (Maybe someone who has used this "docker" thing can weigh in here -- and I mean actually used it -- not just read the glossy brochure).

My money would fall on either the application being in error or "docker" being broken.

(9.1) By Larry Brasfield (LarryBrasfield) on 2020-07-03 22:37:05 edited from 9.0 in reply to 1.1 [link] [source]

Despite my distaste for the Twenty Questions game in troubleshooting situations, I will play for a bit (encouraged by frequent puzzle solver "cuz".) I will point out that the length of this post is increased by the near absence of clues regarding intermediate conditions known to exist between a known, reliable starting point and the indisputable, problematic ending point. (More on that later.)

I'm hitting an issue in my application, which basically sums up to the first request executed by a thread failing with 'No such table', while the table definitely exists in the database. It seems the connection refers to an empty database, but I can see content when browsing the database after the application crashed.

Why do you believe "a thread", when issuing its "first request", has a DB connection accessing the exact same database as the one which you somehow examined to justify saying "while the table definitely exists"? (This is explored in another few posts here, so if your answer is "No database open call is attempted with more than one absolute path designating the DB file.", then you can translate my question to "Why do you believe that, beyond having intended that it be so?")

How can you simultaneously believe (to whatever degree "seems" implies) that "It seems the connection refers to an empty database" and that you "can see content when browsing the database"? Have you taken steps to resolve that conundrum, and if so what were they?

Given that the issue is in your application, what diagnostics have you caused to be emitted or logged by the application to validate your ideas as to what states exist when?

When you say the "application crashed", are you referring to an address fault and process exit, (the usual definition), or something more mundane such as a program exit upon detection of the same, 'No such table' issue?

I can only reproduce this by running the same test a few times (sometimes the first one is enough) on a rather powerful machine that's usually used for our CI.

Does that mean that, sometimes, all DB operations performed by the application succeed? Does it mean that on a different machine they always succeed? Assuming, "yes" and "yes", have you tried reducing the number of independent threads used on the super-duper machine? If so, with what effect? If not, why not?

In order to ensure the database isn't recreated for some reasons I explicitly pass the SQLITE_OPEN_CREATE flag to the first connection, and stick to SQLITE_OPEN_READWRITE for later ones. The failing connection is never the first one, only later ones from a different thread.

What ordering criteria determine what you call "first" and "later" here? If it is time, what leads you to believe the ordering you assume is in fact what happens? Have you considered or tried restricting DB operations to a single thread until the DB is known to exist? Have you considered or tried restricting connections to just one, temporarily, (and tolerating possible increased contention which likely will be much less of a problem than you imagine), while you sort out this problem?

Synchronization between threads is handled by the application, and tsan doesn't appear to complain.

My web search for "tsan" turns up such unlikely results that I cannot believe any of them are relevant. And I have not yet, in my years of software development, heard of it. Why might "tsan" complain? And what does absence of such complaint tell us? How sure are you that no such complaint occurs?

I'm not really sure what to do/test from here, so any help would be highly appreciated!

There are several things that you can and should do, all falling under the guise of traditional, proven debugging techniques. One is to validate your own notions of what is happening and what is not. Emitting or logging diagnostics is in that category. So is running your application under a debugger, breaking where the error is detected, then examining variables and (for on-disk DB issues) using an independent means of examining relevant file contents. (The SQLite CLI shell is good for this.)

At present, you seem to have little idea where things are going wrong, and so the range of possibilities is vast. With improved visibility as to what is or is not true in more limited portions of your application's operation, you will be able to reduce the extent of your focus and deal with a smaller problem. This reduction can usually be applied recursively, until the problem is so small that you can thoroughly research your related assumptions and discover which are incorrect, or (in rare cases) be sure that some code other than yours is not behaving as its documentation says it should.

Good luck, and be sure to come back with more clues if you have diligently narrowed the problem. And by "clues", I mean observable, falsifiable facts rather than suppositions and assumptions as to what you believe or wish is happening.

(10) By Ryan Smith (cuz) on 2020-07-03 23:37:34 in reply to 8 [link] [source]

My money would fall on either the application being in error or "docker" being broken.

If the OP just said "Windows" as opposed to Docker, I would have immediately jumped to the "It's the UAC virtualization that's messing with you." answer, since opening the seemingly same file and ending up with two different contents based on which process opened it (and that process' security status), is so very typical of Windows virtualization.

Could it be Docker has a similar "feature"?

Any Docker experts here?

(11.1) By Larry Brasfield (LarryBrasfield) on 2020-07-04 05:11:26 edited from 11.0 in reply to 10 [link] [source]

(Edited to be OS-agnostic, as Keith's following comment shows this should have been.)

We are told, if not directly at least by inference, that the OP's problem manifests within a single process. He uses the phrases "a thread" (contrast with "the thread"), "application", "per-thread", "different thread" and "threads", which together, on Windows or Unixen, implies that the threads are running within a single process, (because "process" and "running application" are normally used interchangeably for user-level programs, particularly those created by newer programmers.) We are not told what the programming language is, or the execution environment beyond use of Docker. At Docker doc "Images and containers", we can see that "Fundamentally, a container is nothing but a running process, with some added encapsulation features applied to it in order to keep it isolated from the host and from other containers. One of the most important aspects of container isolation is that each container interacts with its own private filesystem; this filesystem is provided by a Docker image." Hence, there would be no chance of the same DB file being accessed by multiple, independently running Docker processes. (And if the OP was so silly as to attempt to use the same DB file from more than one Docker process, his strategy of having all but one sqlite3_open call blocked from creating the DB file would have soon failed much less mysteriously, I surmise.) Given that the execution environment is a single (Docker) process, any UAC intervention (if on Windows, in the form of protected directory faking) should still reach the same file when the same path is provided (as the OP claims is done.)

Of course, as conjectured earlier in this thread, Docker could be doing almost any random thing when the OS file I/O API's are munged to redirect to the Docker per-process isolated file system. That possibility is good reason to become very sure that, in fact, the exact same file path is passed into sqlite3_open_*() for every call intended to reach the same DB file. If that were conclusively established, it would be time to closely investigate what (in Tarnation) Docker's simulacrum of the OS file I/O API is actually doing. There are tools, premature to recommend now, which could be used to convict Docker of extreme chicanery if it cannot access the same file reached via different (munged) file open API calls with the same filename parameter.

However, until such chicanery is carefully pinned upon Docker, I think it is premature to begin investigating why the containerizing tool would do such a thing, and do it inconsistently within the same running Docker image. After all, a great many applications, (including the OP's problematic one at times), manage to use the (faked) OS file I/O API without such goofiness.

Now, with that all said, it would be interesting to discover what the OP's application does when not run in the Docker environment. This is not a good substitute for traditional debugging and problem narrowing, but it could eliminate a mess of conjectural possibilities.

(I do not claim to be a Docker expert.)

(12) By Keith Medcalf (kmedcalf) on 2020-07-04 03:09:57 in reply to 11.0 [link] [source]

Why are you assuming Windows -- I don't think Windows was mentioned other than the observation that if Windows were involved the most likely culprit would be the Windows Virtualization garbage (though why anyone would not immediately disable that is beyond me -- that is its only purpose for existence -- to turn off)

Docker is a "kiddies on Linux" thing. It is far more prevalent on Linux than it is anywhere else. In fact, all the "kiddie stuff" with "kiddie cool" names is primarily designed to run on Linux by people who do not know anything about computers. (Docker, Kubernetes, Chef, Puppet, Donkey, Mule, Zipity-Doo-Dah, the whole lot of them).

As far as I have been able to tell, these things all provide no utility at all except that they promote letting idiots be in charge of the asylum and promote instability, ignorance and complication over simplicity, knowledge, and reliability.

For 99.999% of use cases that is fine, who really cares if the whatever IT system is not working for a few weeks -- it does not do anything of any actual use anyway and it is a relief when it is borked.

(13) By Larry Brasfield (LarryBrasfield) on 2020-07-04 04:58:20 in reply to 12 [link] [source]

Why are you assuming Windows -- I don't think Windows was mentioned other than the observation that if Windows were involved ...

Woops. I let that notion slip into my thinking without adequate cause. Not that it makes any difference -- the threads-in-processes model is pretty much the same across operating systems anymore.

... that is [Windows Virtualization garbage's] only purpose for existence -- to turn off

Amusing. Could be an interesting discussion, and tempting, but quite off-topic here.

Docker is a "kiddies on Linux" thing. It is far more prevalent on Linux ...

I know nothing of its popularity versus platform, but I can see the appeal of putatively isolated applications such that admins need not tailor security provisions for them. (Not that it's hard to do, in principle. But labor versus computing cost has changed immensely.)

As far as I have been able to tell, these things all provide no utility at all except that they promote letting idiots be in charge of the asylum and promote instability, ignorance and complication over simplicity, knowledge, and reliability.

That was (and is) good for an appreciative chuckle. It does look like one of those things that is more of a management sell than a technically compelling solution. And the managers are typically in charge, at least nominally. (I don't know if the low-IQ folks are the "kiddies" or their managers, if those are not the same.)

For 99.999% of use cases that is fine, who really cares if the whatever IT system is not working for a few weeks -- it does not do anything of any actual use anyway and it is a relief when it is borked.

That seems somewhat hyperbolic. Where I've seen IT systems fail, even just partially for days or less, any relief that resulted was pretty well hidden. Of course, my experience may not be representative.

(14) By Ryan Smith (cuz) on 2020-07-04 23:03:41 in reply to 12 [link] [source]

-- that is its only purpose for existence -- to turn off

A chuckle indeed.

Reminds me of the a Software review from long ago, some Windows app which the reviewer found abhorrent to a similar level as quoted above. As was typical for reviews at the time, it had to say "something" in both the "Against" and "For" boxes, to which the frustrated reviewer's only entry in the latter read: "It uninstalls well."

(15) By chouquette on 2020-07-06 10:48:36 in reply to 7 [link] [source]

Which is why I was more looking at what could help you and me understand the issue rather than a solution :)

I don't have a minimal sample, and sending you the entirety of the code probably won't help, and if I were you I'd be annoyed if someone was to send me a big chunk with no details, I totally understand this.

Is there a way to trace the sqlite API calls? AFAICS I can only the requests themselves and the connection being closed (well basically what SQLITE_TRACE allows)

(16) By chouquette on 2020-07-06 11:03:23 in reply to 8 [link] [source]

Exclusive Lock -- Do you mean at the application level (quite outside of SQLite3) or are you somehow referring to SQLite3's exclusive locks, which are a different thing entirely?

Right sorry, the wording was unclear. All locking is done outside of SQLite. What I meant by exclusive here is that instead of allowing concurrent read requests, I switched to "one request at a time" regardless of their read/write status.

Are you checking every return code for errors?

Most but not all, I'll fix that.

Are you using some strange filesystem (any filesystem can be strange, so which one in particular would be helpful).

Docker uses overlay, the underlying filesystem is ext4

Are you using shared_cache or something like that not recommended for use?

What journal_mode are you using?

I don't set any specific mode, so I suppose the default one.

Have you examined whether "docker" supports POSIX advisory locks and sync operations correctly?

I must say I haven't. That being said, it seems like SQLite is using docker in the context of OSS-fuzz.

My money would fall on either the application being in error or "docker" being broken.

Same here, I'm not saying the issue is on sqlite and wouldn't be surprised if the issue was on my end

(17) By Larry Brasfield (LarryBrasfield) on 2020-07-06 18:33:27 in reply to 16 [link] [source]

Have you examined whether "docker" supports POSIX advisory locks and sync operations correctly?

I must say I haven't. That being said, it seems like SQLite is using docker in the context of OSS-fuzz.

SQLite is relying on the filesystem as exposed by the OS API, including its locking, synchronizing and thread-safety, to behave as documented. The library is well tested for a number of implementations of that API, on many platforms. I doubt that Docker's filesystem API, as used by SQLite for its several synchronizing modes, has been tested enough to warrant any faith that it is not subtly different, in a crucial way for the multi-threaded SQLite usage scenario you have outlined. (That is why I suggested keeping SQLite operations to one thread until the database file is created. I might go even further and close the DB connection after that creation and have all threads open it the same way.)

Of all the "It's not my code's fault!" possibilities, a Docker failing of this kind is early in the descending probability list.

My money would fall on either the application being in error or "docker" being broken.

Same here, I'm not saying the issue is on sqlite and wouldn't be surprised if the issue was on my end

There are some obvious experiments that will distinguish whether Docker is failing your expectation [a] or you have some as-yet-unapparent problem in your code.

[a. Your expectation seems to be that SQLite will behave as well on the Docker filesystem as it does on the numerous OS/filesystem combinations it is tested against. I would deem that more of a hope than a reasonable expectation. ]

(18.1) By Rowan Worth (sqweek) on 2020-07-07 02:03:18 edited from 18.0 in reply to 16 [link] [source]

Docker uses overlay, the underlying filesystem is ext4

Does the DB file, or its paired -journal file exist on the host filesystem before docker spins up? And are both threads running in the same docker instance? (I think they belong to the same process but just checking)

I can only reproduce this by running the same test a few times (sometimes the first one is enough) on a rather powerful machine that's usually used for our CI. Not sure if that has to do with its speed or if it's related to CPU features. For what it's worth, I ran all tests in a docker image (and can't do otherwise on that machine)

Can you reproduce the issue if you run the same docker image on a different machine?

Does the app perform further DB operations (in any thread(s)) after hitting the error? Either explicitly or as part of the runtime's cleanup.

Wrapping your app with strace -f -e trace=file,read,write,fsync,fcntl,fcntl64 might be informative.

(19) By Keith Medcalf (kmedcalf) on 2020-07-07 20:11:29 in reply to 16 [link] [source]

Have you examined whether "docker" supports POSIX advisory locks and sync operations correctly?

I must say I haven't. That being said, it seems like SQLite is using docker in the context of OSS-fuzz.

And this OSS-Fuzz thing is multithreaded / multiple connections to the same database file, or just a single connection to a single database file?

I would point out that SQLite3 works perfectly with no locking whatsoever in a "single connection to a single database" scenario even in the face of defective hardware (ie, that do not handle sync operations reliably) and filesystems (that do not support locking reliably) -- so comparisons of "multithreaded apples with mutilple stems" to "single-threaded apples with one stem" are not like comparisons and no inference can be drawn from the latter with respect to the former that involve "multiple stems" or "multiple cores" in the same apple.

Of course, if OSS-Fuzz did do testing of multi-threaded access by multiple connections to the same on-disk database that was hosted on an "overlay" filesystem, then it might be a valid comparison. Of course, according to the OSS-Fuzz docs there can be no comparable comparison since OSS-Fuzz does not allow access to "files" on the overlay filesystem (only /tmp) and the documentation says nothing about whether it tests multiple access via either multiple threads or multiple connection access to the same shared database. The documentation only speaks to single thread access using a single connection to a non-multiple-access database.