Any use case for the Bloom filter for "large analytic queries"

(1) By anonymous on 2022-01-13 20:13:56 [link] [source]

Hi,

I have noticed the 3.38.0 draft note about the query planner new ability to use a Bloom filter for "large analytic queries", and I am bit curious.

What is the target use case, and what does "large analytic queries" mean exactly ?

What kind of queries may benefit from it, and which would not ?

Or si it a bit too early in the 3.38.0 release cycle ? :)

Eric

(2) By Richard Hipp (drh) on 2022-01-13 21:30:24 in reply to 1 [link] [source]

Version 3.38.0 runs much faster on many of the queries in the Star Schema Benchmark. That said, I've yet to find a real-world use-case that is helped by the new Bloom filters. If you know of one, please share it.

The latest trunk check-in of SQLite is stable and quite usable. It is already being used, for example, in the software that runs this forum. I encourage you to download the latest trunk check-in of SQLite and use it for whatever applications you have that make use of SQLite and let us know if you encounter any problems.

(3) By ddevienne on 2022-01-14 08:20:12 in reply to 2 [link] [source]

Hi. Just curious, what's the ballpark improvement of much faster? 50%? 2x? 10x? More?

I'd also be interested in more information on how / when Bloom filtering kicks in in SQLite.
I have basic notions of Bloom filters, it's the SQLite specific uses I wonder on.

(4) By Richard Hipp (drh) on 2022-01-14 12:44:31 in reply to 3 [link] [source]

For Q2.2, the run-time on my tests went from 46.261 seconds to 3.402. That was the best improvement. 2x is more typical. Your mileage may vary.

For some table X, if the query planner sees that it is going to need to do N searches (binary look-ups) against X and N is larger than the number of rows in X, and if the query planner has some reason to believe that many of those searches will end with "not found", then it goes to to the trouble of computing the Bloom filter. Computing the Bloom filter involves doing an initial scan of the entire X table, so there is a significant set-up cost. Hence, the query planner wants to ensure that there will be some overall benefit before incurring that cost. If the query planner messes that decision up, there could be a slowdown.

Bloom filters are also deployed whenever an automatic index is created, in as much as the incremental cost of computing the Bloom filter is minimal. (The use of an "automatic index" is what most other RDBMSes would call a "hash join".)

I haven't yet run across a case where the query planner adds a Bloom filter which ends up slowing things down. But that doesn't mean no such case exists.

(5) By ddevienne on 2022-01-14 13:55:26 in reply to 4 [link] [source]

Thanks! Very interesting Richard.

(6) By ddevienne on 2022-01-14 14:15:10 in reply to 4 [link] [source]

Bloom filters are also deployed whenever an automatic index is created

Does that mean they could also be useful on regular indexes? In an opt-in manner?

(7) By Richard Hipp (drh) on 2022-01-14 14:26:20 in reply to 6 [link] [source]

Probably not, because the overhead of computing the Bloom filter would overwhelm any performance advantage gained in using it.

(9) By anonymous on 2022-01-14 14:45:45 in reply to 6 [link] [source]

CREATE TABLE t2(x,y); CREATE TABLE t1(x,y); CREATE INDEX i2 on t2(x); insert into t1(x,y) select value, hex(randomblob(8)) from generate_series(1,10000); insert into t2(x,y) select value, hex(randomblob(8)) from generate_series(1,100); analyze;

explain query plan select * from t1,t2 where t1.x = t2.x and t2.y like ?;

QUERY PLAN |--SCAN t1 |--BLOOM FILTER ON t2 (x=?) `--SEARCH t2 USING INDEX i2 (x=?)

(8) By AlexJ (CompuRoot) on 2022-01-14 14:41:51 in reply to 4 [link] [source]

I haven't yet run across a case where the query planner adds a Bloom filter which ends up slowing things down. But that doesn't mean no such case exists.

I read sometime ago ( I believe on cloudflare blog) that if Bloom filter doesn't fit in L3 CPU cache then it became slow due to overhead to accessing "slow" RAM randomly.

P.S. Found link: https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

(11) By Anton Azanov (AAzanov) on 2022-01-16 07:51:37 in reply to 4 [link] [source]

I have just such a case, but I have a rather large query, it first selects from virtual tables, and then left joins the sqlite table merged 9 times with different join options. The bloom filter is calculated for the tables that have passed through 7 times. As a result, the speed drops from 22ms to 250ms.

The query text is not very important, I can't reproduce this behavior without my virtual tables, and when the query is simplified, the bloom filter is turned off and the execution speed returns to normal.

The question is the following: will it be possible to disable the bloom filter, similar to disabling the use of an index using "+".
Perhaps "*" or "++" or using a pragma?

(12.1) By giszmo on 2022-07-10 22:36:39 edited from 12.0 in reply to 4 [link] [source]

Deleted

(10) By anonymous on 2022-01-14 17:35:38 in reply to 2 [source]

Thanks!

I will give it a try.

Eric