SQLite Forum

Timeline
Login

50 most recent forum posts by user simonw

2022-01-19
05:00 Reply: Proposed JSON enhancements. (artifact: d61880668d user: simonw)

I have to admit I've never been a big fan of PostgreSQL's -> and ->> JSON operator design - I find them hard to remember, so I usually it to use the function equivalents instead. That's a matter of personal preference though.

PostgreSQL 14 introduced new syntax which, as a Python and JavaScript programmer, I really like. It's described in https://blog.crunchydata.com/blog/better-json-in-postgres-with-postgresql-14 which gives the following example:

SELECT *
FROM shirts 
WHERE details['attributes']['color'] = '"neon yellow"'
  AND details['attributes']['size'] = '"medium"'

I would be thrilled to see that syntax added to SQLite!

04:51 Edit reply: Help me understand the path syntax used by json_extract() (artifact: 229c4e2996 user: simonw)

I had somehow missed the section of the documentation that describes the syntax! It's this bit here: https://sqlite.org/json1.html#path_arguments

I always hop down to the documentation for the individual functions, such as this section for json_extract(): https://sqlite.org/json1.html#the_json_extract_function - I think because those are linked to prominently at the top of the json1.html documentation page.

Two suggestions then: first, it would be great if the documentation for each of the individual JSON functions that accept a path linked back to that definition of the PATH rules.

Secondly, I suggest adding a paragraph that explains how double quotes work - my attempt at that is the middle paragraphs here (otherwise quoting the existing description):

For functions that accept PATH arguments, that PATH must be well-formed or else the function will throw an error. A well-formed PATH is a text value that begins with exactly one '$' character followed by zero or more instances of ".objectlabel" or "[arrayindex]".

The object label can be wrapped in double quotes, which is required if the key that it is selecting contains period or square bracket characters - for example $."key.with.periods".

If a key contains a double quote it can be accessed using a backslash escape, for example $.key with \" double quote.

The arrayindex is usually a non-negative integer N. In that case, the array element selected is the N-th element of the array, starting with zero on the left. The arrayindex can also be of the form "#-N" in which case the element selected is the N-th from the right. The last element of the array is "#-1". Think of the "#" characters as the "number of elements in the array". Then the expression "#-1" evaluates is the integer that corresponds to the last entry in the array.

Feature suggestion for the path implementation itself: It would be neat if you could use escape sequences inside double quotes, so you could use the following:

$."this key.has a period \"and a double quote"
04:48 Edit reply: Help me understand the path syntax used by json_extract() (artifact: 9d69f5f9bf user: simonw)

I had somehow missed the section of the documentation that describes the syntax! It's this bit here: https://sqlite.org/json1.html#path_arguments

I always hop down to the documentation for the individual functions, such as this section for json_extract(): https://sqlite.org/json1.html#the_json_extract_function - I think because those are linked to prominently at the top of the json1.html documentation page.

Two suggestions then: first, it would be great if the documentation for each of the individual JSON functions that accept a path linked back to that definition of the PATH rules.

Secondly, I suggest adding a paragraph that explains how double quotes work - my attempt at that is the middle paragraphs here (otherwise quoting the existing description):

For functions that accept PATH arguments, that PATH must be well-formed or else the function will throw an error. A well-formed PATH is a text value that begins with exactly one '$' character followed by zero or more instances of ".objectlabel" or "[arrayindex]".

The object label can be wrapped in double quotes, which is required if the key that it is selecting contains period or square bracket characters - for example $."key.with.periods".

If a key contains a double quote it can be accessed using a backslash escape, for example $.key with \" double quote.

The arrayindex is usually a non-negative integer N. In that case, the array element selected is the N-th element of the array, starting with zero on the left. The arrayindex can also be of the form "#-N" in which case the element selected is the N-th from the right. The last element of the array is "#-1". Think of the "#" characters as the "number of elements in the array". Then the expression "#-1" evaluates is the integer that corresponds to the last entry in the array.

04:47 Edit reply: Help me understand the path syntax used by json_extract() (artifact: 88a4bfa6bb user: simonw)

I had somehow missed the section of the documentation that describes the syntax! It's this bit here: https://sqlite.org/json1.html#path_arguments

I always hop down to the documentation for the individual functions, such as this section for json_extract(): https://sqlite.org/json1.html#the_json_extract_function - I think because those are linked to prominently at the top of the json1.html documentation page.

Two suggestions then: first, it would be great if the documentation for each of the individual JSON functions that accept a path linked back to that definition of the PATH rules.

Secondly, I suggest adding a paragraph that explains how double quotes work - my attempt at that is the middle paragraph here (otherwise quoting the existing description):

For functions that accept PATH arguments, that PATH must be well-formed or else the function will throw an error. A well-formed PATH is a text value that begins with exactly one '$' character followed by zero or more instances of ".objectlabel" or "[arrayindex]".

The object label can be wrapped in double quotes, which is required if the key that it is selecting contains period or square bracket characters - for example $."key.with.periods".

The arrayindex is usually a non-negative integer N. In that case, the array element selected is the N-th element of the array, starting with zero on the left. The arrayindex can also be of the form "#-N" in which case the element selected is the N-th from the right. The last element of the array is "#-1". Think of the "#" characters as the "number of elements in the array". Then the expression "#-1" evaluates is the integer that corresponds to the last entry in the array.

01:43 Reply: Help me understand the path syntax used by json_extract() (artifact: 47a5d39e25 user: simonw)

I had somehow missed the section of the documentation that describes the syntax! It's this bit here: https://sqlite.org/json1.html#path_arguments

I always hop down to the documentation for the individual functions, such as this section for json_extract(): https://sqlite.org/json1.html#the_json_extract_function - I think because those are linked to prominently at the top of the json1.html documentation page.

Two suggestions then: first, it would be great if the documentation for each of the individual JSON functions that accept a path linked back to that definition of the PATH rules.

Secondly, I suggest adding a paragraph that explains how double quotes work - my attempt at that is the middle paragraph here (otherwise quoting the existing description):

For functions that accept PATH arguments, that PATH must be well-formed or else the function will throw an error. A well-formed PATH is a text value that begins with exactly one '$' character followed by zero or more instances of ".objectlabel" or "[arrayindex]".

The object label can be wrapped in double quotes, which is required if the key that it is selecting contains period or square bracket characters.

The arrayindex is usually a non-negative integer N. In that case, the array element selected is the N-th element of the array, starting with zero on the left. The arrayindex can also be of the form "#-N" in which case the element selected is the N-th from the right. The last element of the array is "#-1". Think of the "#" characters as the "number of elements in the array". Then the expression "#-1" evaluates is the integer that corresponds to the last entry in the array.

2022-01-18
23:05 Reply: Help me understand the path syntax used by json_extract() (artifact: 70ecb114bd user: simonw)

Thanks to tips in this thread I was able to put together some detailed notes on the currently supported syntax here: https://til.simonwillison.net/sqlite/json-extract-path

22:10 Edit reply: Help me understand the path syntax used by json_extract() (artifact: e1d3b6a054 user: simonw)

From reading the C code it looks like both of these should have the same effect:

$.c
$."c"

Presumably the double quote variant is there to support things like the following:

$."key.with.dots.in.it"
It looks to me like there's no mechanism for accessing keys that themselves contain a double quote - so the following JSON:
{
  "key\"with a double quote": 5
}
Could not have the 5 extracted using json_extract().

22:08 Reply: Help me understand the path syntax used by json_extract() (artifact: 82e6fbe333 user: simonw)

From reading the C code it looks like both of these should have the same effect:

$.c
$."c"

Presumably the double quote variant is there to support things like the following:

$."3" # To avoid confusion with $[3] which is array access
$."key.with.dots.in.it"
It looks to me like there's no mechanism for accessing keys that themselves contain a double quote - so the following JSON:
{
  "key\"with a double quote": 5
}
Could not have the 5 extracted using json_extract().

22:00 Edit reply: Help me understand the path syntax used by json_extract() (artifact: 73eb78cb7b user: simonw)

Is that a SQLite-specific extension? I didn't spot that in the MySQL JSON path language.

Found the commit that added that feature in 2019 here: https://sqlite.org/src/info/35ed68a651f - and the tests are https://sqlite.org/src/file?name=test/json105.test

21:59 Edit reply: Help me understand the path syntax used by json_extract() (artifact: f30aa4de13 user: simonw)

Is that a SQLite-specific extension? I didn't spot that in the MySQL JSON path language.

Found the commit that added that feature in 2019 here: https://sqlite.org/src/info/35ed68a651f

21:55 Edit reply: Help me understand the path syntax used by json_extract() (artifact: 6ac60b5d68 user: simonw)

I found some documentation of the equivalent syntax used by MySQL: https://dev.mysql.com/doc/refman/8.0/en/json.html#json-path-syntax

And then more documentation on an advanced form of that syntax here (with things like support for range expressions): https://dev.mysql.com/doc/refman/8.0/en/json.html#json-paths

It looks like the C function that implements path lookups in SQLite is static JsonNode *jsonLookup( in https://www3.sqlite.org/src/file?name=ext/misc/json.c

21:53 Reply: Help me understand the path syntax used by json_extract() (artifact: 2396578f77 user: simonw)

Is that a SQLite-specific extension? I didn't spot that in the MySQL JSON path language.

21:52 Reply: Help me understand the path syntax used by json_extract() (artifact: 92ebc874ef user: simonw)

I found some documentation of the equivalent syntax used by MySQL: https://dev.mysql.com/doc/refman/8.0/en/json.html#json-path-syntax

And then more documentation on an advanced form of that syntax here (with things like support for range expressions): https://dev.mysql.com/doc/refman/8.0/en/json.html#json-paths

21:43 Post: Help me understand the path syntax used by json_extract() (artifact: 451d7f8079 user: simonw)

The path syntax for the json_extract() function is currently documented using examples on https://www.sqlite.org/json1.html#jex

json_extract('{"a":2,"c":[4,5,{"f":7}]}', '$') → '{"a":2,"c":[4,5,{"f":7}]}'
json_extract('{"a":2,"c":[4,5,{"f":7}]}', '$.c') → '[4,5,{"f":7}]'
json_extract('{"a":2,"c":[4,5,{"f":7}]}', '$.c[2]') → '{"f":7}'
json_extract('{"a":2,"c":[4,5,{"f":7}]}', '$.c[2].f') → 7
json_extract('{"a":2,"c":[4,5],"f":7}','$.c','$.a') → '[[4,5],2]'
json_extract('{"a":2,"c":[4,5],"f":7}','$.c[#-1]') → 5

I can follow these up to a point, but the $.c[#-1] one isn't clear to me.

Is there documentation for this syntax somewhere? Alternatively, if people can help me understand it (and maybe point me to some C code) I'd be happy to help contribute improvements to the existing documentation.

2021-11-17
22:34 Edit reply: Surprising (to me) query performance for combined group-by/count using CTE and unions (artifact: 5161c8475c user: simonw)

In case it's relevant (given recent improvements involving materialized CTEs) I ran all of these tests against SQLite 3.34.1.

Oh I think this is relevant! 3.35 release notes include "The default behavior was formerly NOT MATERIALIZED, but is now changed to MATERIALIZED for CTEs that are used more than once."

22:32 Reply: Surprising (to me) query performance for combined group-by/count using CTE and unions (artifact: c0e0fcbe36 user: simonw)

In case it's relevant (given recent improvements involving materialized CTEs) I ran all of these tests against SQLite 3.34.1.

06:49 Edit: Surprising (to me) query performance for combined group-by/count using CTE and unions (artifact: b2f6aa2c6e user: simonw)

My application Datasette has a feature called "faceting", where I display a table with several group-by-count queries to allow people to filter the results.

Here's an example page: https://covid-19.datasettes.com/covid/ny_times_us_counties?state=Kentucky

That page executes the following five queries:

  • select rowid, date, county, state, fips, cases, deaths from ny_times_us_counties where state = 'Kentucky' order by date desc limit 101 - 78ms
  • select count(*) from ny_times_us_counties where state = 'Kentucky' - 7ms
  • select state as value, count(*) as count from ( select rowid, date, county, state, fips, cases, deaths from ny_times_us_counties where state = 'Kentucky' ) where state is not null group by state order by count desc, value limit 31 - 12ms
  • select county as value, count(*) as count from ( select rowid, date, county, state, fips, cases, deaths from ny_times_us_counties where state = 'Kentucky' ) where county is not null group by county order by count desc, value limit 31 - 50ms
  • select fips as value, count(*) as count from ( select rowid, date, county, state, fips, cases, deaths from ny_times_us_counties where state = 'Kentucky' ) where fips is not null group by fips order by count desc, value limit 31 - 52ms

Total for all 5 queries: 199ms

I'm always looking for ways to speed up these kinds of queries, since my application runs them a lot (suggestions very welcome).

Today I had a bright idea: what if I combined all of the above into a single query using a CTE for the initial selection? Could this give me a speed boost by helping SQLite avoid creating the same filtered table multiple times as part of each query?

Here's the SQL query I came up with:

with filtered as (
  select rowid, date, county, state, fips, cases, deaths
  from ny_times_us_counties where state = 'Kentucky'
),
rows as (
  select null as facet, null as favet_name, null as facet_value,
  rowid, date, county, state, fips, cases, deaths
  from filtered order by date desc limit 101
),
count as (
  select 'COUNT' as facet, null as facet_name, count(*) as facet_value,
  null, null, null, null, null, null, null
  from filtered
),
facet_state as (
  select 'state' as facet, state as facet_name, count(*) as facet_value,
  null, null, null, null, null, null, null
  from filtered group by facet_name order by facet_value desc limit 31
),
facet_county as (
  select 'county' as facet, county as facet_name, count(*) as facet_value,
  null, null, null, null, null, null, null
  from filtered group by facet_name order by facet_value desc limit 31
),
facet_fips as (
  select 'fips' as facet, fips as facet_name, count(*) as facet_value,
  null, null, null, null, null, null, null
  from filtered group by facet_name order by facet_value desc limit 31
)
select * from rows
union all
select * from count
union all
select * from facet_state
union all
select * from facet_county
union all
select * from facet_fips

You can try that query here - the clever (I thought) idea here is to use union all to execute all five queries in one do, and hopefully have the query planner take advantage of and reuse the CTE.

This query takes 200ms - and in some cases I've seen it take significantly longer than the 5 queries added up!

I was expecting my clever huge CTE/Union query to beat the separate queries, and instead it's consistently losing to them.

I'd love to understand why, mainly out of curiosity but also to help me figure out if there's an optimization here that I'm missing.

You can read my full research notes on this here: https://github.com/simonw/datasette/issues/1513

06:47 Edit: Surprising (to me) query performance for combined group-by/count using CTE and unions (artifact: ef97a78bbd user: simonw)

My application Datasette has a feature called "faceting", where I display a table with several group-by-count queries to allow people to filter the results.

Here's an example page: https://covid-19.datasettes.com/covid/ny_times_us_counties?state=Kentucky

That page executes the following four queries:

  • select rowid, date, county, state, fips, cases, deaths from ny_times_us_counties where state = 'Kentucky' order by date desc limit 101 - 78ms
  • select count(*) from ny_times_us_counties where state = 'Kentucky' - 7ms
  • select state as value, count(*) as count from ( select rowid, date, county, state, fips, cases, deaths from ny_times_us_counties where state = 'Kentucky' ) where state is not null group by state order by count desc, value limit 31 - 12ms
  • select county as value, count(*) as count from ( select rowid, date, county, state, fips, cases, deaths from ny_times_us_counties where state = 'Kentucky' ) where county is not null group by county order by count desc, value limit 31 - 50ms
  • select fips as value, count(*) as count from ( select rowid, date, county, state, fips, cases, deaths from ny_times_us_counties where state = 'Kentucky' ) where fips is not null group by fips order by count desc, value limit 31 - 52ms

Total for all 5 queries: 199ms

I'm always looking for ways to speed up these kinds of queries, since my application runs them a lot (suggestions very welcome).

Today I had a bright idea: what if I combined all of the above into a single query using a CTE for the initial selection? Could this give me a speed boost by helping SQLite avoid creating the same filtered table multiple times as part of each query?

Here's the SQL query I came up with:

with filtered as (
  select rowid, date, county, state, fips, cases, deaths
  from ny_times_us_counties where state = 'Kentucky'
),
rows as (
  select null as facet, null as favet_name, null as facet_value,
  rowid, date, county, state, fips, cases, deaths
  from filtered order by date desc limit 101
),
count as (
  select 'COUNT' as facet, null as facet_name, count(*) as facet_value,
  null, null, null, null, null, null, null
  from filtered
),
facet_state as (
  select 'state' as facet, state as facet_name, count(*) as facet_value,
  null, null, null, null, null, null, null
  from filtered group by facet_name order by facet_value desc limit 31
),
facet_county as (
  select 'county' as facet, county as facet_name, count(*) as facet_value,
  null, null, null, null, null, null, null
  from filtered group by facet_name order by facet_value desc limit 31
),
facet_fips as (
  select 'fips' as facet, fips as facet_name, count(*) as facet_value,
  null, null, null, null, null, null, null
  from filtered group by facet_name order by facet_value desc limit 31
)
select * from rows
union all
select * from count
union all
select * from facet_state
union all
select * from facet_county
union all
select * from facet_fips

You can try that query here - the clever (I thought) idea here is to use union all to execute all five queries in one do, and hopefully have the query planner take advantage of and reuse the CTE.

This query takes 200ms - and in some cases I've seen it take significantly longer than the 5 queries added up!

I was expecting my clever huge CTE/Union query to beat the separate queries, and instead it's consistently losing to them.

I'd love to understand why, mainly out of curiosity but also to help me figure out if there's an optimization here that I'm missing.

06:46 Post: Surprising (to me) query performance for combined group-by/count using CTE and unions (artifact: 4e44458d15 user: simonw)

My application Datasette has a feature called "faceting", where I display a table with several group-by-count queries to allow people to filter the results.

Here's an example page: https://covid-19.datasettes.com/covid/ny_times_us_counties?state=Kentucky

That page executes the following four queries:

  • select rowid, date, county, state, fips, cases, deaths from ny_times_us_counties where state = 'Kentucky' order by date desc limit 101 - 78ms
  • select count(*) from ny_times_us_counties where state = 'Kentucky' - 7ms
  • select state as value, count(*) as count from ( select rowid, date, county, state, fips, cases, deaths from ny_times_us_counties where state = 'Kentucky' ) where state is not null group by state order by count desc, value limit 31 - 12ms
  • select county as value, count(*) as count from ( select rowid, date, county, state, fips, cases, deaths from ny_times_us_counties where state = 'Kentucky' ) where county is not null group by county order by count desc, value limit 31 - 50ms
  • select fips as value, count(*) as count from ( select rowid, date, county, state, fips, cases, deaths from ny_times_us_counties where state = 'Kentucky' ) where fips is not null group by fips order by count desc, value limit 31 - 52ms

Total for all 5 queries: 199ms

I'm always looking for ways to speed up these kinds of queries, since my application runs them a lot (suggestions very welcome).

Today I had a bright idea: what if I combined all of the above into a single query using a CTE for the initial selection? Could this give me a speed boost by helping SQLite avoid creating the same filtered table multiple times as part of each query?

Here's the SQL query I came up with:

with filtered as (
  select rowid, date, county, state, fips, cases, deaths
  from ny_times_us_counties where state = 'Kentucky'
),
rows as (
  select null as facet, null as favet_name, null as facet_value,
  rowid, date, county, state, fips, cases, deaths
  from filtered order by date desc limit 101
),
count as (
  select 'COUNT' as facet, null as facet_name, count(*) as facet_value,
  null, null, null, null, null, null, null
  from filtered
),
facet_state as (
  select 'state' as facet, state as facet_name, count(*) as facet_value,
  null, null, null, null, null, null, null
  from filtered group by facet_name order by facet_value desc limit 31
),
facet_county as (
  select 'county' as facet, county as facet_name, count(*) as facet_value,
  null, null, null, null, null, null, null
  from filtered group by facet_name order by facet_value desc limit 31
),
facet_fips as (
  select 'fips' as facet, fips as facet_name, count(*) as facet_value,
  null, null, null, null, null, null, null
  from filtered group by facet_name order by facet_value desc limit 31
)
select * from rows
union all
select * from count
union all
select * from facet_state
union all
select * from facet_county
union all
select * from facet_fips

You can try that query here - the clever (I thought) idea here is to use union all to execute all five queries in one do, and hopefully have the query planner take advantage of and reuse the CTE.

This query takes 200ms - and in some cases I've seen it take significantly longer than the 5 queries added up!

I was expecting my clever huge CTE/Union query to beat the separate queries, and instead it's consistently losing to them.

I'd love to understand why, mainly out of curiosity but also to help me figure out if there's an optimization here that I'm missing.

2021-08-07
19:42 Reply: Old Nabble SQLite mailing list site down? (artifact: 29f3f048be user: simonw)

I always find those kinds of conversations useful, provided they are clearly presented in context. The deeper my relationship with a piece of software gets the more I find myself running into questions that can best be answered by conversations that were had many years previously, against much older versions of the software.

I'd love to see the old mailing list content available online in an archive that includes a prominent "Warning: this conversation from 2009 relates to a 12 year old version of the software" warning, rather than leaving that content entirely unavailable.

2021-07-20
00:00 Reply: FTS5 external content tables (artifact: b8927c6476 user: simonw)

For such a small collection of data my hunch is that it won't matter either way. I recommend trying out implementing both patterns and running some micro-benchmarks though to see for yourself - doing that is always a useful exercise for this kind of performance question.

2021-07-19
23:56 Reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: efe43d3f6c user: simonw)

Thanks very much, this is really useful.

23:54 Edit reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 2eebabbbb1 user: simonw)

Do you have a blog post on how you've managed this?

Mainly it's just "use Cloud Run" - to use it you need to package up your (stateless) web application in a Docker container which Cloud Run will then start running (and charging for) when requests come in and switch back off when the traffic stops.

I made some notes on how to use it here: https://til.simonwillison.net/cloudrun/ship-dockerfile-to-cloud-run

I usually deploy to it with my datasette publish cloudrun tool though, which is described here: https://docs.datasette.io/en/stable/publish.html#publishing-to-google-cloud-run

23:54 Reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 6c5ac6ef78 user: simonw)

Do you have a blog post on how you've managed this?

Mainly it's just "use Cloud Run" - to use it you need to package up your (stateless) web application in a Docker container which Cloud Run will then start running (and charging for) when requests come in and switch back off when the traffic stops.

I made some notes on how to use it here: https://til.simonwillison.net/cloudrun/ship-dockerfile-to-cloud-run

I usually deploy to it with my "datasette publish cloudrun" tool though, which is described here: https://docs.datasette.io/en/stable/publish.html#publishing-to-google-cloud-run

02:32 Reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 76a9badc6c user: simonw)

Cloud Run stays "warm" in between requests, so in practice I see a few seconds delay the first time I visit one of my applications but then everything loads quickly (without the cold start) for subsequent requests.

This is a great fit for my stuff that works on top of SQLite, since I'm building interfaces for people to explore a database - so they're likely to continue running different queries against the same tables for a period of several minutes, ideally while those stay in RAM.

On "serverless" - completely agree, it's a terrible name. The definition I've chosen to embrace for it is "scale to zero", which means I don't have to spend any money on hosting my applications unless they are actively serving traffic. Since I have over 100 side-projects running at any given time this characteristic is extremely useful!

2021-07-18
03:20 Edit reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 55f54b7927 user: simonw)

If I run arbitrary SQL SELECT queries against a ~600MB database file the queries take as short a time as possible (in ms) to execute. My hope is that more RAM = more data queried in memory rather than being read from disk = faster executing queries.

03:18 Edit reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 59afc95a3d user: simonw)

This is really useful information, thanks!

I'm confident I'm using immutable and ro correctly in this case - I use ro if there might be other processes updating the SQL database file (often cron scripts). I use immutable when I can absolutely guarantee nothing will update the database - often that's because I'm hosting it on Google Cloud Run which doesn't even provide a persistent writable file-system.

03:17 Reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 79061399b1 user: simonw)

This is really useful information, thanks!

I'm confident I'm using immutable and ro correctly in this case - I use ro if there might be other processes updating the SQL database file (often cron scripts). I use immutable when I can absolutely guarantee nothing will update the database - often that's because I'm hosting it on Google Cloud Run which doesn't even provide a writable file-system.

03:15 Reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 4393a42b3b user: simonw)

If I run arbitrary SQL queries against a ~600MB database file the queries take as shorter time (in ms) to execute. My hope is that more RAM = more data queried in memory rather than being read from disk = faster executing queries.

2021-07-17
17:44 Reply: Should SQLite be able to optimize this query? (artifact: ff57c45af9 user: simonw)

This is fantastic. Really like that set of rules for when this applies. Thank you for this!

17:35 Edit reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 618a2de207 user: simonw)

Thanks, I'll take a look at mmap too: https://www.sqlite.org/pragma.html#pragma_mmap_size and https://www.sqlite.org/mmap.html

Interestingly the Python sqlite3 library (which is written in C) releases the GIL when executing queries, which makes me hopeful that SQLite in Python can take advantage of multiple cores even within a single Python process. I haven't figured out a good way to confirm that this is happening yet though!

17:33 Edit reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 64ba53e03c user: simonw)

Thanks, I'll take a look at mmap too: https://www.sqlite.org/pragma.html#pragma_mmap_size

Interestingly the Python sqlite3 library (which is written in C) releases the GIL when executing queries, which makes me hopeful that SQLite in Python can take advantage of multiple cores even within a single Python process. I haven't figured out a good way to confirm that this is happening yet though!

17:33 Reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 2a1e5cc6a7 user: simonw)

Thanks, I'll take a look at mmap too: https://www.sqlite.org/pragma.html#pragma_mmap_size

Interestingly the Python sqlite3 library (which is written in C) releases the GIL when executing queries, which makes me hopeful that SQLite in Python can take advantage of multiple cores even within a single Python process. I haven't figured out a good way to confirm that this is happening yet though!

2021-07-16
18:41 Edit: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 9a6b9ae8e2 user: simonw)

I'm interested in getting better performance out of SQLite on machines with more RAM, using the Python sqlite3 module from the standard library.

(Tips for more CPUs would be useful too - I'm interested in understanding what the benefits of paying for larger EC2 instances are, in addition to maximizing performance against currently available memory.)

I've seen PRAGMA cache_size in the documentation - is this the main lever I should be looking at or are there other options that might help here?

My current plan is to set up a simple load test that exercises my Python + SQLite code across a number of different queries, then modify the cache_size and see what impact that has on requests/second and request durations.

I'd thoroughly appreciate any advice on ways to scale SQLite up like this.

Most of my workloads are read-only (I open the SQLite file with the ?mode=ro or ?immutable=1 flags) so I'm mainly interested in improving performance of SELECTs against read-only databases.

18:40 Edit: Tips for getting SQLite under Python to perform better with more RAM? (artifact: f6bdd6b503 user: simonw)

I'm interested in getting better performance out of SQLite on machines with more RAM, using the Python sqlite3 module from the standard library.

I've seen PRAGMA cache_size in the documentation - is this the main lever I should be looking at or are there other options that might help here?

My current plan is to set up a simple load test that exercises my Python + SQLite code across a number of different queries, then modify the cache_size and see what impact that has on requests/second and request durations.

I'd thoroughly appreciate any advice on ways to scale SQLite up like this.

Most of my workloads are read-only (I open the SQLite file with the ?mode=ro or ?immutable=1 flags) so I'm mainly interested in improving performance of SELECTs against read-only databases.

18:26 Reply: Should SQLite be able to optimize this query? (artifact: faf982aced user: simonw)

As a really safe alternative, how about if the ignore-order-by only took place for a specific list of aggregations (or even just count)?

18:23 Post: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 9e15765874 user: simonw)

I'm interested in getting better performance out of SQLite on machines with more RAM, using the Python sqlite3 module from the standard library.

I've seen PRAGMA cache_size in the documentation - is this the main lever I should be looking at or are there other options that might help here?

My current plan is to set up a simple load test that exercises my Python + SQLite code across a number of different queries, then modify the cache_size and see what impact that has on requests/second and request durations.

I'd thoroughly appreciate any advice on ways to scale SQLite up like this.

01:50 Reply: Should SQLite be able to optimize this query? (artifact: 6ecf3cab23 user: simonw)

I would definitely expect the order by on the view there to be respected, whether or not that's in the SQL standard.

2021-07-15
19:45 Reply: Should SQLite be able to optimize this query? (artifact: 8ada395a25 user: simonw)

Breaking that group_concat() example certainly looks like an unacceptable change - I wouldn't be at all surprised to see real-world code that relies on that.

In my particular case queries that I was running extremely frequently (often several times for a single HTTP page load) got nearly a 10x speed-up - but I've now changed my code to remove the inner order by (see https://github.com/simonw/datasette/issues/1394) so my project would no longer benefit from this optimization.

17:52 Reply: Should SQLite be able to optimize this query? (artifact: 0212439555 user: simonw)

I'm glad to have posed something that turns out to be an interesting problem!

16:40 Reply: Should SQLite be able to optimize this query? (artifact: 5d14ed56db user: simonw)
explain query plan select country_long, count(*)
from (
  select * from [global-power-plants] order by rowid
)
where country_long is not null
group by country_long
order by count(*) desc

Output:

id,parent,notused,detail
2,0,0,CO-ROUTINE 1
5,2,0,SCAN TABLE global-power-plants
52,0,0,SCAN SUBQUERY 1
57,0,0,USE TEMP B-TREE FOR GROUP BY
92,0,0,USE TEMP B-TREE FOR ORDER BY
And for the query without the order by rowid:
explain query plan select country_long, count(*)
from (
  select * from [global-power-plants]
)
where country_long is not null
group by country_long
order by count(*) desc
Output:
id,parent,notused,detail
8,0,0,"SCAN TABLE global-power-plants USING INDEX ""global-power-plants_country_long"""
40,0,0,USE TEMP B-TREE FOR ORDER BY

16:32 Reply: Should SQLite be able to optimize this query? (artifact: 466c8fec3b user: simonw)

It doesn't look like that speeds things up - I would expect SQLite to spot that optimization already.

16:31 Reply: Should SQLite be able to optimize this query? (artifact: 91c56240d2 user: simonw)
CREATE TABLE "global-power-plants" (
"country" TEXT,
  "country_long" TEXT,
  "name" TEXT,
  "gppd_idnr" TEXT,
  "capacity_mw" REAL,
  "latitude" REAL,
  "longitude" REAL,
  "primary_fuel" TEXT,
  "other_fuel1" TEXT,
  "other_fuel2" TEXT,
  "other_fuel3" TEXT,
  "commissioning_year" REAL,
  "owner" TEXT,
  "source" TEXT,
  "url" TEXT,
  "geolocation_source" TEXT,
  "wepp_id" TEXT,
  "year_of_capacity_data" INTEGER,
  "generation_gwh_2013" REAL,
  "generation_gwh_2014" REAL,
  "generation_gwh_2015" REAL,
  "generation_gwh_2016" REAL,
  "generation_gwh_2017" REAL,
  "generation_data_source" TEXT,
  "estimated_generation_gwh" REAL
);
CREATE INDEX ["global-power-plants_country_long"] ON [global-power-plants]("country_long");
CREATE INDEX ["global-power-plants_owner"] ON [global-power-plants]("owner");

You can download the full database from https://global-power-plants.datasettes.com/global-power-plants.db (11.1 MB)

16:07 Post: Should SQLite be able to optimize this query? (artifact: 2d76f2bcf6 user: simonw)

I spotted something interesting about this query. The following executes in about 47ms (against a table containing 33,000 rows):

select country_long, count(*)
from (
  select * from [global-power-plants] order by rowid
)
where country_long is not null
group by country_long
order by count(*) desc

Note that there's an order by rowid in that inner query which is unnecessary - the outer group by query doesn't need to care about the order of the rows in that inner query.

That same query again, without the unnecessary order by rowid:

select country_long, count(*)
from (
  select * from [global-power-plants]
)
where country_long is not null
group by country_long
order by count(*) desc

This executes in 7ms against the same table, returning the same results.

Would it be possible for SQLite to notice that the inner order by is unnecessary and ignore it when executing this kind of query?

(The reason I'm using a nested query here is that my software is designed to be able to calculate "facet counts" against the results of any arbitrary SQL query, which this pattern lets me do.)

2021-04-11
15:05 Reply: deploying 1_000_000 sqlitedb on AWS? (artifact: e529ede831 user: simonw)

Have you investigated EFS at all for this?

2021-04-05
03:39 Edit reply: For a given SQL query, mechanisms for figuring out which physical columns would be returned? (artifact: e20f3d0861 user: simonw)

I have enough of a proof of concept implementation in https://github.com/simonw/datasette/issues/1293#issuecomment-813162622 that I'm pretty confident I can get this to work. I'm ready to take my chances with regards to stability - this feature is a nice-to-have, so if future changes break it I can work around that.

Thanks for the help!

03:39 Reply: For a given SQL query, mechanisms for figuring out which physical columns would be returned? (artifact: 5b6f7e7ee5 user: simonw)

I have enough of a proof of concept implementation in https://github.com/simonw/datasette/issues/1293#issuecomment-813162622 that I'm pretty confident I can get this to work. I'm ready to take my chances with regards to stability - this feature is a nice-to-have, so if future changes break it I can work around that.

Thanks for the help!

03:38 Reply: For a given SQL query, mechanisms for figuring out which physical columns would be returned? (artifact: 04033829a0 user: simonw)

For those cases I'm fine with it not being able to identify the column that is being used in the expression. I only want to be able to detect columns that are used as-is, without being part of a formula.

01:17 Reply: For a given SQL query, mechanisms for figuring out which physical columns would be returned? (artifact: 2258454563 user: simonw)

That's really useful, thanks! It looks like it might be possible for me to reconstruct where each column came from using the explain select output.

Here's a complex example: https://calands.datasettes.com/calands?sql=explain+select%0D%0A++AsGeoJSON%28geometry%29%2C+*%0D%0Afrom%0D%0A++CPAD_2020a_SuperUnits%0D%0Awhere%0D%0A++PARK_NAME+like+%27%25mini%25%27+and%0D%0A++Intersects%28GeomFromGeoJSON%28%3Afreedraw%29%2C+geometry%29+%3D+1%0D%0A++and+CPAD_2020a_SuperUnits.rowid+in+%28%0D%0A++++select%0D%0A++++++rowid%0D%0A++++from%0D%0A++++++SpatialIndex%0D%0A++++where%0D%0A++++++f_table_name+%3D+%27CPAD_2020a_SuperUnits%27%0D%0A++++++and+search_frame+%3D+GeomFromGeoJSON%28%3Afreedraw%29%0D%0A++%29&freedraw=%7B%22type%22%3A%22MultiPolygon%22%2C%22coordinates%22%3A%5B%5B%5B%5B-122.42202758789064%2C37.82280243352759%5D%2C%5B-122.39868164062501%2C37.823887203271454%5D%2C%5B-122.38220214843751%2C37.81846319511331%5D%2C%5B-122.35061645507814%2C37.77071473849611%5D%2C%5B-122.34924316406251%2C37.74465712069939%5D%2C%5B-122.37258911132814%2C37.703380457832374%5D%2C%5B-122.39044189453125%2C37.690340943717715%5D%2C%5B-122.41241455078126%2C37.680559803205135%5D%2C%5B-122.44262695312501%2C37.67295135774715%5D%2C%5B-122.47283935546876%2C37.67295135774715%5D%2C%5B-122.52502441406251%2C37.68382032669382%5D%2C%5B-122.53463745117189%2C37.6892542140253%5D%2C%5B-122.54699707031251%2C37.690340943717715%5D%2C%5B-122.55798339843751%2C37.72945260537781%5D%2C%5B-122.54287719726564%2C37.77831314799672%5D%2C%5B-122.49893188476564%2C37.81303878836991%5D%2C%5B-122.46185302734376%2C37.82822612280363%5D%2C%5B-122.42889404296876%2C37.82822612280363%5D%2C%5B-122.42202758789064%2C37.82280243352759%5D%5D%5D%5D%7D

It looks like the opcodes I need to inspect are OpenRead, Column and ResultRow.

OpenRead tells me which tables are being opened - the p2 value (in this case 51) corresponds to the rootpage column in sqlite_master here: https://calands.datasettes.com/calands?sql=select+*+from+sqlite_master - it gets assigned to the register in p1.

The Column opcodes tell me which columns are being read - p1 is that table reference, and p2 is the cid of the column within that table.

The ResultRow opcode then tells me which columns are used in the results. 15 16 means start at the 15th and then read the next 16 columns.

I think this might work!

2021-04-04
23:40 Edit: For a given SQL query, mechanisms for figuring out which physical columns would be returned? (artifact: 81976ebc35 user: simonw)

Given any SQL query - such as:

select a.*, b.* from a join b on a.b_id = b.id

I would like to know the columns that would be returned by the query. Not just their names, but also which table those columns live in.

For the above example this might be:

a.id
a.name
a.b_id
b.id
b.name

But you could imagine this getting significantly more complex for joins across many different tables.

I want this so my software, Datasette, can display columns from specific tables in different ways, even when those columns are returned as part of an arbitrary query.

My biggest constraint is that I am using the sqlite3 module from the Python standard library. It looks like this family of C functions: https://www.sqlite.org/c3ref/column_database_name.html can solve my problem, but those are not exposed in the Python library I am using.

I wrote a bit more about my research into this here: https://github.com/simonw/datasette/issues/1293

So three questions:

  1. Is there a mechanism for this that I might be missing?

  2. Would it be feasible to achieve this using a custom SQLite extension? One that exposes the column_database_name() C API functions in a way that I can use them to answer my question? I was thinking something like this:

    select database, table, column from analyze_query(
        'select foo, bar from baz'
    )
    
  3. Feature request: could SQLite itself add some kind of SQL-level mechanism for accessing this information, such that even if the Python sqlite3 module doesn't add support this information would become available via a PRAGMA or similar?

More ↓