Performance Regression Using Views from 3.38.5 -> 3.39.0
(1) By Jon (gocoder) on 2022-09-19 15:11:00 [source]
We've been using views in many queries and have noted a substantial performance regression from 3.38.5 to 3.39.0 (and up). I will post the following for reference:
- The schema of the tables and views involved.
- The query in question.
- The query planner and timer output from 3.38.5
- The query planner and timer output from 3.39.0
- Workaround
1. Schema of tables and views
CREATE TABLE macro_transactions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
created DATETIME NOT NULL DEFAULT(datetime('now')),
completed DATETIME,
type INTEGER NOT NULL,
tag TEXT,
source_id INTEGER,
heartbeat DATETIME,
checkpoint_id INTEGER,
guid CHAR(36) NOT NULL
);
CREATE TABLE pypi_packages (
id INTEGER PRIMARY KEY AUTOINCREMENT,
cname TEXT NOT NULL,
project_name TEXT NOT NULL,
source_id INTEGER NOT NULL,
summary TEXT
);
CREATE TABLE pypi_packages_macro_association (
id INTEGER PRIMARY KEY AUTOINCREMENT,
package_id INTEGER NOT NULL,
version TEXT NOT NULL,
json_path TEXT NOT NULL,
tran_create INTEGER NOT NULL,
tran_delete INTEGER,
FOREIGN KEY (package_id) REFERENCES pypi_packages(id),
FOREIGN KEY (tran_create) REFERENCES macro_transactions(id),
FOREIGN KEY (tran_delete) REFERENCES macro_transactions(id)
);
CREATE INDEX pypi_packages_source_id_cname ON pypi_packages(source_id, cname);
CREATE INDEX pypi_packages_source_id_project_name ON pypi_packages(source_id, project_name);
CREATE INDEX pypi_packages_macro_association_package_id on pypi_packages_macro_association(package_id);
CREATE VIEW committed_macro_transactions AS
SELECT m.*
FROM macro_transactions m
WHERE m.completed IS NOT NULL;
CREATE VIEW committed_pypi_packages_macro_association AS
SELECT pa.*,
rc.id AS committed_tran_create,
rd.id AS committed_tran_delete
FROM pypi_packages_macro_association pa
-- Eliminates any newly created packages that have not been committed
INNER JOIN committed_macro_transactions rc
ON rc.id = pa.tran_create
-- Helps determine if the package has been deleted
LEFT JOIN committed_macro_transactions rd
ON rd.id = pa.tran_delete;
2. The query in question.
SELECT p.id, p.source_id, p.cname, p.project_name,
pa.version, pa.json_path, p.summary
FROM pypi_packages p
INNER JOIN committed_pypi_packages_macro_association pa
ON pa.package_id = p.id
WHERE p.cname='numpy'
AND p.source_id=2
AND pa.committed_tran_create <= 369
AND (pa.committed_tran_delete is null
OR pa.committed_tran_delete > 369
);
3. The query planner and timer output from 3.38.5
sqlite> explain query plan select p.id, p.source_id, p.cname, p.project_name, pa.version, pa.json_path, p.summary from pypi_packages p inner join committed_pypi_packages_macro_association pa on pa.package_id = p.id where p.cname='numpy' and p.source_id=2 and pa.committed_tran_create <= 369 and (pa.committed_tran_delete is null or pa.committed_tran_delete > 369);
QUERY PLAN
|--SEARCH p USING INDEX pypi_packages_source_id_cname (source_id=? AND cname=?)
|--SEARCH pa USING INDEX pypi_packages_macro_association_package_id (package_id=?)
|--SEARCH m USING INTEGER PRIMARY KEY (rowid=?)
`--SEARCH m USING INTEGER PRIMARY KEY (rowid=?)
sqlite> .timer on
sqlite> select p.id, p.source_id, p.cname, p.project_name, pa.version, pa.json_path, p.summary from pypi_packages p inner join committed_pypi_packages_macro_association pa on pa.package_id = p.id where p.cname='numpy' and p.source_id=2 and pa.committed_tran_create <= 369 and (pa.committed_tran_delete is null or pa.committed_tran_delete > 369);
14207|2|numpy|numpy|1.23.3|numpy/1662989427.json|NumPy is the fundamental package for array computing with Python.
Run Time: real 0.000 user 0.000451 sys 0.000000
4. The query planner and timer output from 3.39.0
sqlite> explain query plan select p.id, p.source_id, p.cname, p.project_name, pa.version, pa.json_path, p.summary from pypi_packages p inner join committed_pypi_packages_macro_association pa on pa.package_id = p.id where p.cname='numpy' and p.source_id=2 and pa.committed_tran_create <= 369 and (pa.committed_tran_delete is null or pa.committed_tran_delete > 369);
QUERY PLAN
|--MATERIALIZE committed_pypi_packages_macro_association
| |--SCAN pa
| |--SEARCH m USING INTEGER PRIMARY KEY (rowid=?)
| `--SEARCH m USING INTEGER PRIMARY KEY (rowid=?) LEFT-JOIN
|--SEARCH p USING INDEX pypi_packages_source_id_cname (source_id=? AND cname=?)
`--SEARCH pa USING AUTOMATIC PARTIAL COVERING INDEX (package_id=?)
sqlite> .timer on
sqlite> select p.id, p.source_id, p.cname, p.project_name, pa.version, pa.json_path, p.summary from pypi_packages p inner join committed_pypi_packages_macro_association pa on pa.package_id = p.id where p.cname='numpy' and p.source_id=2 and pa.committed_tran_create <= 369 and (pa.committed_tran_delete is null or pa.committed_tran_delete > 369);
14207|2|numpy|numpy|1.23.3|numpy/1662989427.json|NumPy is the fundamental package for array computing with Python.
Run Time: real 0.890 user 0.648108 sys 0.241639
Workaround
We've been able to work around the performance regression by eliminating the views altogether in favor of JOINs. I tried replacing the views with CTEs to no avail. Example:
SELECT p.id, p.source_id, p.cname, p.project_name,
pa.version, pa.json_path, p.summary
FROM pypi_packages p
INNER JOIN pypi_packages_macro_association pa
ON pa.package_id = p.id
INNER JOIN macro_transactions rc
ON rc.id = pa.tran_create
AND rc.completed is not null
LEFT JOIN macro_transactions rd
ON rd.id = pa.tran_delete
AND rd.completed is not null
WHERE p.cname='numpy' AND p.source_id=2
AND pa.tran_create <= 369
AND (pa.tran_delete is null OR pa.tran_delete > 369);
This modification performs well and has very similar query planner output to the 3.38.5 output from the original query. However, eliminating the view usage from ALL of our application queries will be difficult and cause considerable complication.
Is there any way to force SQLite to use the original behavior?
Thanks!
(2) By Chris Locke (chrisjlocke1) on 2022-09-19 16:07:13 in reply to 1 [link] [source]
Not that it really matters (I don't think) but just a data point that the macro_transactions table uses a datetime and char datatype that aren't official types.
(3) By Jon (gocoder) on 2022-09-19 16:09:42 in reply to 2 [link] [source]
Thanks for pointing that out. I don't think that's related to the issue, but it's definitely an optimization we could look at.
(4) By Richard Hipp (drh) on 2022-09-19 17:35:58 in reply to 1 [link] [source]
On quick work-around is to move the ON clause into the WHERE clause:
SELECT p.id, p.source_id, p.cname, p.project_name, pa.version, pa.json_path, p.summary FROM pypi_packages p INNER JOIN committed_pypi_packages_macro_association pa WHERE pa.package_id = p.id AND p.cname='numpy' AND p.source_id=2 AND pa.committed_tran_create <= 369 AND (pa.committed_tran_delete is null OR pa.committed_tran_delete > 369);
(5) By Jon (gocoder) on 2022-09-19 17:50:24 in reply to 4 [link] [source]
Thanks. That's a big improvement over our workaround. :)
I assume this behavior changed with "Defer materializing views" in 3.39.0? Should we consider this workaround to be the correct and permanent solution, or is this something that is likely to change/be fixed in a future release?
(6) By Richard Hipp (drh) on 2022-09-19 18:21:28 in reply to 5 [link] [source]
The problem arises due to the added support for RIGHT JOIN. Due to changes in how queries are analyzed, I had to add a restriction to the query optimizer that prevents the view from being "flattened" into the outer query if there there is an ON clause on the join of the view.
This is query flattener restriction 29b. By moving the constraint from an ON clause into the WHERE clause (which makes no difference for your particular query) that restriction is bypassed.
I'm trying to figure out a way to relax that restriction automatically for cases, such as yours, where it is not required.
(7) By Richard Hipp (drh) on 2022-09-19 20:21:54 in reply to 1 [link] [source]
Simplified test script (submitted here so that it will become part of the historical record):
.echo on .mode qbox CREATE TABLE t1(a INTEGER PRIMARY KEY, b INT, c INT); CREATE TABLE t2(d INTEGER PRIMARY KEY, e INT); CREATE VIEW t3(a,b,c,d,e) AS SELECT * FROM t1 LEFT JOIN t2 ON d=c; CREATE TABLE t4(x INT, y INT); INSERT INTO t1 VALUES(1,2,3); INSERT INTO t4 VALUES(1,4); .eqp on SELECT a, b, y FROM t4 JOIN t3 ON a=x;
The correct answer is always returned, though the query plan is different in 3.39.0 versus 3.38.5. Version 3.38.5 gives:
QUERY PLAN |--SCAN t4 `--SEARCH t1 USING INTEGER PRIMARY KEY (rowid=?)
Version 3.39.0 yields:
QUERY PLAN |--MATERIALIZE t3 | |--SCAN t1 | `--SEARCH t2 USING INTEGER PRIMARY KEY (rowid=?) LEFT-JOIN |--SCAN t4 `--SEARCH t3 USING AUTOMATIC COVERING INDEX (a=?)
The 3.38.5 plan is clearly better.
(8) By Richard Hipp (drh) on 2022-09-21 09:59:50 in reply to 1 [link] [source]
This issue should now be fixed by the latest trunk check-in. Please try out the new code and confirm whether or not the problem has been resolved.
(9.1) By Jon (gocoder) on 2022-09-21 13:47:12 edited from 9.0 in reply to 8 [link] [source]
I pulled the latest source code, built, and tried the query again. I can confirm that the issue is fixed! Thanks!
.version SQLite 3.40.0 2022-09-21 09:54:41 501609eddf2a46d51e8cd56477a22d4df142a166e78fe4322c0db11d1eed6687 zlib version 1.2.11 gcc-11.2.0
.timer on SELECT p.id, p.source_id, p.cname, p.project_name, pa.version, pa.json_path, p.summary FROM pypi_packages p INNER JOIN committed_pypi_packages_macro_association pa ON pa.package_id = p.id WHERE p.cname='numpy' AND p.source_id=2 AND pa.committed_tran_create <= 369 AND (pa.committed_tran_delete is null OR pa.committed_tran_delete > 369 ); 78496|2|numpy|numpy|1.23.3|numpy/1662989427.json|NumPy is the fundamental package for array computing with Python. Run Time: real 0.000 user 0.000482 sys 0.000000
Query plan:
QUERY PLAN |--SEARCH p USING INDEX pypi_packages_source_id_cname (source_id=? AND cname=?) |--SEARCH pa USING INDEX pypi_packages_macro_association_package_id (package_id=?) |--SEARCH m USING INTEGER PRIMARY KEY (rowid=?) `--SEARCH m USING INTEGER PRIMARY KEY (rowid=?) LEFT-JOIN