Poor performance with bloom filter ?
(1) By EricG (EriccG) on 2022-05-23 19:22:24 [link] [source]
Hi, I have encountered a case where the query planner (3.38.4 win64 DLL) chooses to use a bloom filter, with a rather poor performance. The query that evaluates slowly is like select count(distinct addr_id) from vout join transactions t on t.id = vout.tx_id where t.block_id < 10000 and explains to id parent notused detail 6 0 0 SCAN vout USING COVERING INDEX vout_addr_tx_idx 9 0 0 BLOOM FILTER ON t (id=?) 17 0 0 SEARCH t USING INTEGER PRIMARY KEY (rowid=?) it runs in about 21 seconds. Reversing vout & transactions tables in the query has no effect on the query planner. When formulated to force t first with a cross join select count(distinct addr_id) from transactions t cross join vout on t.id = vout.tx_id where t.block_id < 10000 the query now takes just 72 msec, this variant explains to id parent notused detail 3 0 0 USE TEMP B-TREE FOR count(DISTINCT) 7 0 0 SEARCH t USING COVERING INDEX transactions_block_id_idx (block_id<?) 12 0 0 SEARCH vout USING INDEX vout_tx_id_idx (tx_id=?) which intellectually seems more logical (taking advantage of the filtering condition first) the sqlite_stat1 for the relevant tables contains tbl idx stat transactions transactions_block_id_idx 25342624 20 vout vout_tx_id_idx 84748018 4 vout vout_addr_tx_idx 93370966 2 2 (the names follow the convention of table then indexed fields in order of index) In terms of data, each entry in the transactions table has a few entries in the vout table. The database is fairly large (38 GB), I can provide it on request to SQLite devs for testing if needed (just let me know how/where).
(2) By Richard Hipp (drh) on 2022-05-23 19:26:30 in reply to 1 [source]
We need to see the database schema. Including the indexes.
(3) By EricG (EriccG) on 2022-05-23 19:43:02 in reply to 2 [link] [source]
Here is the sql from sqlite_master: https://pastebin.com/e8tifWGz
If you want more, let me know where/how I can make the database available.
Thanks!
(4.1) By Richard Hipp (drh) on 2022-05-23 19:57:49 edited from 4.0 in reply to 3 [link] [source]
So, your complaint is that query at the end of the following code block should not be using a Bloom filter because it slows down performance rather than speeding it up?
CREATE TABLE vout ( id INTEGER PRIMARY KEY AUTOINCREMENT, tx_id INTEGER NOT NULL, n INTEGER NOT NULL, addr_id INTEGER, amount FLOAT, script_id INTEGER, FOREIGN KEY(tx_id) REFERENCES transactions(id) ); CREATE INDEX vout_tx_id_idx ON vout(tx_id); CREATE INDEX vout_addr_tx_idx ON vout(addr_id, tx_id); CREATE TABLE transactions ( id INTEGER PRIMARY KEY AUTOINCREMENT, block_id INTEGER, hash BLOB UNIQUE, outstanding FLOAT, block_order INTEGER, FOREIGN KEY(block_id) REFERENCES blocks(id) ); CREATE INDEX transactions_block_id_idx ON transactions(block_id); ANALYZE; INSERT INTO sqlite_stat1 VALUES ('transactions','transactions_block_id_idx','25342624 20'), ('vout','vout_tx_id_idx','84748018 4'), ('vout','vout_addr_tx_idx','93370966 2 2'); ANALYZE sqlite_schema; .eqp on SELECT count(DISTINCT addr_id) FROM vout JOIN transactions AS t ON t.id = vout.tx_id WHERE t.block_id < 10000;
(5) By EricG (EriccG) on 2022-05-23 20:31:25 in reply to 4.1 [link] [source]
I'm afraid that would be overstating my understanding of the query planner :) I do not know if the query planner is picking this plan because the Bloom filter should make it more efficient, or if it's using the Bloom filter after deciding to use the vout_addr_tx_idx and that is the best choice in that case. I was a bit surprised the first plan was chosen, as it does not seem to take advantage of the condition in the "where" (but that may be me misunderstanding the "explain" output). Is it something to do with the limitations of sqlite_stat1 data ? Using cross join is just fine, but I had not yet seen the Bloom filter in the explainer before that query, hence my curiosity.
(6) By Richard Hipp (drh) on 2022-05-23 21:02:08 in reply to 5 [link] [source]
You may be right. We were about to postulate the Bloom filters have nothing whatsoever to do with this.
Try this:
Use the official SQLite CLI, not whatever third-party tool you are using now.
Before you run your queries, first do: "
.testctrl optimizations 0x80000
". That will disable the use of Bloom filters.Rerun your performance tests. Let us know what changes.
(7) By EricG (EriccG) on 2022-05-23 21:37:17 in reply to 6 [link] [source]
Hehe, 3rd party tool is the Win64 precompiled DLL downloaded from sqlite.org. First run with Bloom filter active to confirm the timing: SQLite version 3.38.5 2022-05-06 15:25:27 Enter ".help" for usage hints. sqlite> .timer on sqlite> select count(distinct addr_id) from vout join transactions t on t.id = vout.tx_id where t.block_id < 10000; 19599 Run Time: real 18.219 user 12.671875 sys 5.531250 and second run with Bloom filter turned off SQLite version 3.38.5 2022-05-06 15:25:27 Enter ".help" for usage hints. sqlite> .timer on sqlite> .testctrl optimizations 0x80000 sqlite> select count(distinct addr_id) from vout join transactions t on t.id = vout.tx_id where t.block_id < 10000; 19599 Run Time: real 240.033 user 71.250000 sys 167.984375 sqlite> select count(distinct addr_id) from transactions t cross join vout on t.id = vout.tx_id where t.block_id < 10000; 19599 Run Time: real 0.040 user 0.031250 sys 0.015625 So Bloom filter does indeed makes the best of an unlucky index choice. (this is on a Xeon system with SSD and lots of RAM for cache in case that matters)