making Count() … GROUP BY faster

(1) By punkish on 2022-08-12 13:57:53 [link] [source]

is there any way I can make the following query faster? I could always periodically pre-calculate and store the values, but maybe there is a better way to just get the query to work faster than 22s

sqlite> SELECT year, Count(DISTINCT materialsCitations.materialsCitationId) AS num
   ...> FROM treatments
   ...>     JOIN materialsCitations ON materialsCitations.treatmentId = treatments.treatmentId
   ...> WHERE treatments.deleted = 0 AND materialsCitations.deleted = 0
   ...> GROUP BY 1;
QUERY PLAN
|--SCAN treatments USING INDEX ix_treatments_year
|--SEARCH materialsCitations USING INDEX ix_materialsCitations_treatmentId (deleted=? AND treatmentId=?)
`--USE TEMP B-TREE FOR count(DISTINCT)
year  num
----  -------------
2009  15723
2010  67
2011  565
2012  1048
2013  1220
2014  3358
2015  9783
2016  49968
2017  68684
2018  43529
2019  109450
2020  151704
2021  681503
2022  89462
Run Time: real 22.174 user 1.749229 sys 2.514141
sqlite> .fullschema
CREATE TABLE treatments (
    id INTEGER PRIMARY KEY,
    treatmentId TEXT NOT NULL UNIQUE,
    …
    checkinTime INTEGER DEFAULT NULL,
    deleted INTEGER DEFAULT 0,
    created INTEGER DEFAULT (strftime('%s','now') * 1000), 
    year INTEGER AS (strftime('%Y', datetime(checkinTime/1000, 'unixepoch')));
CREATE TABLE materialsCitations (
    id INTEGER PRIMARY KEY,
    materialsCitationId TEXT NOT NULL,
    treatmentId TEXT NOT NULL,
    …
    UNIQUE (materialsCitationId, treatmentId)
);
CREATE INDEX ix_treatments_treatmentId ON treatments (deleted, treatmentId);
CREATE INDEX ix_treatments_checkinTime  ON treatments (deleted, checkinTime);
CREATE INDEX ix_treatments_deleted ON treatments (deleted);
CREATE INDEX ix_treatments_year ON treatments (year, deleted);
CREATE INDEX ix_materialsCitations_materialsCitationId ON materialsCitations (deleted, materialsCitationId);
CREATE INDEX ix_materialsCitations_treatmentId ON materialsCitations (deleted, treatmentId);
CREATE INDEX ix_materialsCitations_deleted ON materialsCitations (deleted);

(2) By anonymous on 2022-08-12 19:02:27 in reply to 1 [link] [source]

One thing I noticed is that the way the index ix_treaments_year is created will potentially cause the query to jump around for every year entry, that's if there is a sizable number of deleted treatments.

One option is to switch the values deleted and year in the index such that access becomes more or less sequential.

Other than schema and query optimization I would also recommend trying to minimize IO by ensuring you have the right amount of page cache, or better yet, a big enough mmap.

(3) By Keith Medcalf (kmedcalf) on 2022-08-12 19:43:20 in reply to 1 [link] [source]

You could always run the query once and save the results in a table. Then write triggers for treatments and materialsCitations that keep it up-to-date as those tables are updated. That way the time to run the query afresh is amortized over all the insert/update/delete operations that you preform on the database, and the summary results are always available and always current.

(4) By Gunter Hick (gunter_hick) on 2022-08-12 20:13:52 in reply to 1 [link] [source]

Hint 1: Optimise the field order

Look at which fields you are referencing in your queries and indices. Move these to the top of the table definition. Try to move variable length fields to the bottom of the table definition.

Applying this based on your query, you would have id, deleted, year, checkinTime, treatmentId, ... and id, deleted, treatmentId, materialscitationId, ... respectively

Hint 2: Optimise the indices

Create a bunch of indices containing the fields referenced in your queries and in permutated orders. Check which ones the QP selects to run your queries. Time these queries against those with manually crafted plans. Use CROSS JOIN to force table order and INDEXED BY to force the indices.

You need a representative data set and to run ANALYZE beforehand, so the QP has sufficient and correct information about the "shape" of your data.

Keep just the indices that result in the fastest execution, drop the rest. Check to see if the QP selects the fastest plan without Cthe help of ROSS JOIN and INDEXED BY.

Hint 3: Consider using covering indices

Add fiels relevant to the query to the indices from Hint 2. If an index contains all the fields referenced from a table used in a query, there is no need to locate the actual row.

You are using year, deleted and treatmentId from the first table; consider adding treatmentId to the ix_treatments_year.

You are using deleted, treatmentId and materialsCitationId from the second table; consider adding materialsCitationId to the ix_materialsCitations_treatmentId.

(5.1) By Stephan (stephancb) on 2022-08-13 08:03:02 edited from 5.0 in reply to 1 [link] [source]

It looks the generated column year in table treatments is VIRTUAL, not STORED? I. e. it is calculated each time via functions datetime and strftime finally parsing the output to INTEGER.

https://www.sqlite.org/gencol.html: "If the trailing "VIRTUAL" or "STORED" keyword is omitted, then VIRTUAL is the default."

Edit: added "generated"

(6) By anonymous on 2022-08-13 12:42:46 in reply to 5.1 [link] [source]

Does it matter though? Given that the value will be already stored in the index?

(7) By Keith Medcalf (kmedcalf) on 2022-08-13 14:36:01 in reply to 6 [link] [source]

No. It will only make a difference when creating the index or an index entry. The computed value can be (and is) retrieved from the index.

(8) By punkish on 2022-08-14 08:46:40 in reply to 7 [link] [source]

To clarify, that first "No" is in response to "Would it matter though?", correct? In other words, whether the calculated column is VIRTUAL or STORED doesn't matter, because there is an index on it, and when the column is calculated, it is inserted in the index. Subsequently, its value is read from the index.

I am asking because I am not entirely sure about the above. If I am wrong, then I should indeed convert the column to STORED so it can be indexed properly in advance.

(9) By Keith Medcalf (kmedcalf) on 2022-08-14 17:35:12 in reply to 8 [source]

You are correct. Note however this only applies if you are retrieving the value from the index because you are traversing the table via the index.

If you were scanning the table using some other index (or no index) which does not contain the pre-computed column, then the column has to be computed each time it is retrieved.

VIRTUAL and STORED indicate when the value should be generated. VIRTUAL means to generate it every single time the row is retrieved, STORED means to generate the value when the row is stored, and save the value in the row. It will never be recomputed again except when the row is updated such that the data on which the generation of the value depends is changed.