SQLite Forum

Surprising (to me) query performance
Login

Surprising (to me) query performance for combined group-by/count using CTE and unions

(1.2) By Simon Willison (simonw) on 2021-11-17 06:49:09 edited from 1.1 [source]

My application Datasette has a feature called "faceting", where I display a table with several group-by-count queries to allow people to filter the results.

Here's an example page: https://covid-19.datasettes.com/covid/ny_times_us_counties?state=Kentucky

That page executes the following five queries:

  • select rowid, date, county, state, fips, cases, deaths from ny_times_us_counties where state = 'Kentucky' order by date desc limit 101 - 78ms
  • select count(*) from ny_times_us_counties where state = 'Kentucky' - 7ms
  • select state as value, count(*) as count from ( select rowid, date, county, state, fips, cases, deaths from ny_times_us_counties where state = 'Kentucky' ) where state is not null group by state order by count desc, value limit 31 - 12ms
  • select county as value, count(*) as count from ( select rowid, date, county, state, fips, cases, deaths from ny_times_us_counties where state = 'Kentucky' ) where county is not null group by county order by count desc, value limit 31 - 50ms
  • select fips as value, count(*) as count from ( select rowid, date, county, state, fips, cases, deaths from ny_times_us_counties where state = 'Kentucky' ) where fips is not null group by fips order by count desc, value limit 31 - 52ms

Total for all 5 queries: 199ms

I'm always looking for ways to speed up these kinds of queries, since my application runs them a lot (suggestions very welcome).

Today I had a bright idea: what if I combined all of the above into a single query using a CTE for the initial selection? Could this give me a speed boost by helping SQLite avoid creating the same filtered table multiple times as part of each query?

Here's the SQL query I came up with:

with filtered as (
  select rowid, date, county, state, fips, cases, deaths
  from ny_times_us_counties where state = 'Kentucky'
),
rows as (
  select null as facet, null as favet_name, null as facet_value,
  rowid, date, county, state, fips, cases, deaths
  from filtered order by date desc limit 101
),
count as (
  select 'COUNT' as facet, null as facet_name, count(*) as facet_value,
  null, null, null, null, null, null, null
  from filtered
),
facet_state as (
  select 'state' as facet, state as facet_name, count(*) as facet_value,
  null, null, null, null, null, null, null
  from filtered group by facet_name order by facet_value desc limit 31
),
facet_county as (
  select 'county' as facet, county as facet_name, count(*) as facet_value,
  null, null, null, null, null, null, null
  from filtered group by facet_name order by facet_value desc limit 31
),
facet_fips as (
  select 'fips' as facet, fips as facet_name, count(*) as facet_value,
  null, null, null, null, null, null, null
  from filtered group by facet_name order by facet_value desc limit 31
)
select * from rows
union all
select * from count
union all
select * from facet_state
union all
select * from facet_county
union all
select * from facet_fips

You can try that query here - the clever (I thought) idea here is to use union all to execute all five queries in one do, and hopefully have the query planner take advantage of and reuse the CTE.

This query takes 200ms - and in some cases I've seen it take significantly longer than the 5 queries added up!

I was expecting my clever huge CTE/Union query to beat the separate queries, and instead it's consistently losing to them.

I'd love to understand why, mainly out of curiosity but also to help me figure out if there's an optimization here that I'm missing.

You can read my full research notes on this here: https://github.com/simonw/datasette/issues/1513

(2.1) By Simon Willison (simonw) on 2021-11-17 22:34:01 edited from 2.0 in reply to 1.2 [link] [source]

In case it's relevant (given recent improvements involving materialized CTEs) I ran all of these tests against SQLite 3.34.1.

Oh I think this is relevant! 3.35 release notes include "The default behavior was formerly NOT MATERIALIZED, but is now changed to MATERIALIZED for CTEs that are used more than once."