Persist query plans?

(1) By biodrool on 2022-10-14 15:55:12 [link] [source]

I have a use case where a query is run 1 time only and I'd love for there to be indexes but unfortunately there are none. So since there's none to begin with, taking the time to create indexes defeats the purpose since the query is only ever run once.

The first time the query runs, it takes ~15 seconds, but if I run it a second time it takes only ~500ms! So that tells me the planner is taking most of the time. When I run an explain, there's about 67 rows!

The data and schema are always nearly the same, literally all I am doing is creating a new (shard) database and then dumping some of the data into it. So, is there a way to persist the same query plan so that there's zero overhead planning the query with each new database I create? Could I somehow "insert" the plan into the new database so it instantly knows how to plan the query? Or can I somehow fake it by running the query with a where clause that is guaranteed to eliminate all rows? Then, the next time remove that predicate and hope it still uses the same plan?

If none of my ideas above are really options, does this feature I describe interest devs as future possibility to support?! It would save so much time in cases like these where you're creating shards and need a one-time query to execute that has so many join expressions that query planning is the bottleneck.

(2) By David Raymond (dvdraymond) on 2022-10-14 16:18:05 in reply to 1 [link] [source]

The planner is definitely not running for 14 seconds. Usually the speedup of subsequent runs is from caching. If you want to know how long planning takes, you can always see how long it takes to run the explain command that you looked at.

(3) By biodrool on 2022-10-14 16:26:31 in reply to 2 [link] [source]

Cool, I didn't know about caching other than statements were cached. Any docs on the caching you mention?

And so in that case is loading the database as an "in memory" is probably my best bet to getting similar performance as cached results?

(4) By Tim Streater (Clothears) on 2022-10-14 16:48:32 in reply to 3 [link] [source]

I'm guessing the caching referred to is of your database, by the OS. Nothing to do with SQLite.

(5) By David Raymond (dvdraymond) on 2022-10-14 16:55:16 in reply to 3 [link] [source]

I'm not really talking about anything special, just standard data caching. (Hopefully I'm not oversimplifying a more complex problem you're having)

SQLite has a page cache that it keeps, so if it needs pages it has recently accessed it can use its own memory rather than bugging the OS for them. You can adjust the size of that with pragma cache_size

And of course the OS has its own cache of recently accessed files so it doesn't have to ask the slow hard drive for them.

So query run 1 SQLite has to ask the OS for all the pages, which in turn has to ask the hard drive for all the pages, and is thus slow. Do it again shortly enough thereafter and SQLite can get most of the pages from its own memory, and the ones it asks the OS for the OS has in memory, and nobody has to go all the way out to the disk to get anything.

Loading the whole database into memory will definitely make the query faster... but the question is are you gonna wind up spending 20 seconds loading all the data to save 14 seconds on a single run of a query that only needed a subset of that data.

(6) By Chris Locke (chrisjlocke1) on 2022-10-14 17:04:26 in reply to 4 [link] [source]

SQLite does cache pages (all data is stored in pages). You can define how many pages to store, and thus, how much memory this consumes.

This is controlled via the PRAGMA cache_size command. https://www.sqlite.org/pragma.html

(7.2) By jose isaias cabrera (jicman) on 2022-10-14 18:38:30 edited from 7.1 in reply to 1 [source]

Alright, I will be the one to ask: why does it matter that it takes 15 seconds, if it's only going to be used once? Are you doing 1000's of these DB creations that it will matter to wait 15 seconds? Maybe you are creating the DB to be used by someone else, or something that we are not understanding. I know you know this, but databases are created for data reuse, and this is the reason why we they indexes, etc., to make a repetitious call faster and faster. However, if you are only going to use it once, and you want the creation to be even faster, this is a very new idea. :-)

(8) By biodrool on 2022-10-15 00:44:09 in reply to 7.2 [link] [source]

yup - you guessed it. We have a terrible database unfit for what we want to use it for, so I can't change it but I can temporarily migrate it to a different schema that allows people to do what they want, then go back to the original untouched. One of those problems caused by avoiding solving the actual problem :)

So the trick is make that migration work in a few seconds. So I figured the only way to do that is to shard it by the same count of available worker threads so it runs several small inserts concurrently. As it turns out, I'm incredibly lucky the database is very shard-able, with there never being any need to query more than one shard at a time. Working well so far and of course other posters were right about the caching.