Poor performance with bloom filter ?

(1) By EricG (EriccG) on 2022-05-23 19:22:24 [link] [source]

Hi,

I have encountered a case where the query planner (3.38.4 win64 DLL) chooses to use a bloom filter, with a rather poor performance.

The query that evaluates slowly is like

   select count(distinct addr_id)
   from vout
   join transactions t on t.id = vout.tx_id
   where t.block_id < 10000

and explains to

   id	parent	notused	detail
   6	0	0	SCAN vout USING COVERING INDEX vout_addr_tx_idx
   9	0	0	BLOOM FILTER ON t (id=?)
   17	0	0	SEARCH t USING INTEGER PRIMARY KEY (rowid=?)

it runs in about 21 seconds.
Reversing vout & transactions tables in the query has no effect on the query planner.

When formulated to force t first with a cross join

   select count(distinct addr_id)
   from transactions t
   cross join vout on t.id = vout.tx_id
   where t.block_id < 10000

the query now takes just 72 msec, this variant explains to

   id	parent	notused	detail
   3	0	0	USE TEMP B-TREE FOR count(DISTINCT)
   7	0	0	SEARCH t USING COVERING INDEX transactions_block_id_idx (block_id<?)
   12	0	0	SEARCH vout USING INDEX vout_tx_id_idx (tx_id=?)

which intellectually seems more logical (taking advantage of the filtering condition first)

the sqlite_stat1 for the relevant tables contains

   tbl	idx	stat
   transactions	transactions_block_id_idx	25342624 20
   vout	vout_tx_id_idx	84748018 4
   vout	vout_addr_tx_idx	93370966 2 2

(the names follow the convention of table then indexed fields in order of index)

In terms of data, each entry in the transactions table has a few entries in the vout table.

The database is fairly large (38 GB), I can provide it on request to SQLite devs for testing if needed (just let me know how/where).

(2) By Richard Hipp (drh) on 2022-05-23 19:26:30 in reply to 1 [source]

We need to see the database schema. Including the indexes.

(3) By EricG (EriccG) on 2022-05-23 19:43:02 in reply to 2 [link] [source]

Here is the sql from sqlite_master: https://pastebin.com/e8tifWGz

If you want more, let me know where/how I can make the database available.

Thanks!

(4.1) By Richard Hipp (drh) on 2022-05-23 19:57:49 edited from 4.0 in reply to 3 [link] [source]

So, your complaint is that query at the end of the following code block should not be using a Bloom filter because it slows down performance rather than speeding it up?

CREATE TABLE vout (
   id INTEGER PRIMARY KEY AUTOINCREMENT,
   tx_id INTEGER NOT NULL,
   n INTEGER NOT NULL,
   addr_id INTEGER,
   amount FLOAT, script_id INTEGER,
   FOREIGN KEY(tx_id) REFERENCES transactions(id)
);
CREATE INDEX vout_tx_id_idx ON vout(tx_id);
CREATE INDEX vout_addr_tx_idx ON vout(addr_id, tx_id);
CREATE TABLE transactions (
   id INTEGER PRIMARY KEY AUTOINCREMENT,
   block_id INTEGER,
   hash BLOB UNIQUE,
   outstanding FLOAT, block_order INTEGER,
   FOREIGN KEY(block_id) REFERENCES blocks(id)
);
CREATE INDEX transactions_block_id_idx ON transactions(block_id);
ANALYZE;
INSERT INTO sqlite_stat1 VALUES
  ('transactions','transactions_block_id_idx','25342624 20'),
  ('vout','vout_tx_id_idx','84748018 4'),
  ('vout','vout_addr_tx_idx','93370966 2 2');
ANALYZE sqlite_schema;
.eqp on
SELECT count(DISTINCT addr_id)
  FROM vout
  JOIN transactions AS t ON t.id = vout.tx_id
 WHERE t.block_id < 10000;

(5) By EricG (EriccG) on 2022-05-23 20:31:25 in reply to 4.1 [link] [source]

I'm afraid that would be overstating my understanding of the query planner :) I do not know if the query planner is picking this plan because the Bloom filter should make it more efficient, or if it's using the Bloom filter after deciding to use the vout_addr_tx_idx and that is the best choice in that case.

I was a bit surprised the first plan was chosen, as it does not seem to take advantage of the condition in the "where" (but that may be me misunderstanding the "explain" output).

Is it something to do with the limitations of sqlite_stat1 data ?

Using cross join is just fine, but I had not yet seen the Bloom filter in the explainer before that query, hence my curiosity.

(6) By Richard Hipp (drh) on 2022-05-23 21:02:08 in reply to 5 [link] [source]

You may be right. We were about to postulate the Bloom filters have nothing whatsoever to do with this.

Try this:

Use the official SQLite CLI, not whatever third-party tool you are using now.
Before you run your queries, first do: ".testctrl optimizations 0x80000". That will disable the use of Bloom filters.
Rerun your performance tests. Let us know what changes.

(7) By EricG (EriccG) on 2022-05-23 21:37:17 in reply to 6 [link] [source]

Hehe, 3rd party tool is the Win64 precompiled DLL downloaded from sqlite.org.

First run with Bloom filter active to confirm the timing:

   SQLite version 3.38.5 2022-05-06 15:25:27
   Enter ".help" for usage hints.
   sqlite> .timer on
   sqlite> select count(distinct addr_id) from vout join transactions t on t.id = vout.tx_id where t.block_id < 10000;
   19599
   Run Time: real 18.219 user 12.671875 sys 5.531250

and second run with Bloom filter turned off

   SQLite version 3.38.5 2022-05-06 15:25:27
   Enter ".help" for usage hints.
   sqlite> .timer on
   sqlite> .testctrl optimizations 0x80000
   sqlite> select count(distinct addr_id) from vout join transactions t on t.id = vout.tx_id where t.block_id < 10000;
   19599
   Run Time: real 240.033 user 71.250000 sys 167.984375
   sqlite> select count(distinct addr_id) from transactions t cross join vout on t.id = vout.tx_id where t.block_id < 10000;
   19599
   Run Time: real 0.040 user 0.031250 sys 0.015625

So Bloom filter does indeed makes the best of an unlucky index choice.
(this is on a Xeon system with SSD and lots of RAM for cache in case that matters)