Full bloom filter blob capacity is never used?
(1) By J. Schaefer (jschaefer) on 2023-06-11 21:34:45 [source]
Hello,
I am currently trying to understand how the bloom filter hashing works in sqlite (latest trunk).
In where.c an OP_Blob
is set up with at least 80K bits of space for the bloom filter:
sz = sqlite3LogEstToInt(pTab->nRowLogEst);
if( sz<10000 ){
sz = 10000;
// ...
sqlite3VdbeAddOp2(v, OP_Blob, (int)sz, pLevel->regFilter);
So there is now a Blob with a size of at least 10000 bytes (80K bit).
Then in vdbe.c in case OP_FilterAdd
the filter is filled.
Here h
is the result of a hashing function and now a single 1 needs to be set in the filter:
h = filterHash(aMem, pOp);
// ...
h %= pIn1->n;
pIn1->z[h/8] |= 1<<(h&7);
I think pIn1->z
is the filter-blob and pIn1->n
is the size of the filter blob.
In case of e.g. the filter having the minimal size of 80K bits pIn1->n
is 10000 bytes.
Thus after the modulo operation h
can only be at max 9999.
Then pIn1->z[h/8]
indicates that only bytes 0 to 1249 (==9999/8) can ever be addressed, and bytes 1250 to 9999 will always stay zeroed.
The same is of course happening case OP_Filter
.
If I am understanding this correctly this means for variable sized filters that 87.5% of the reserved filter space can never be used, thus raising the probability of false positives from ideally 11.75% to 63.2%?
Or have I overlooked something?
Thanks and best regards
J. Schaefer
(2.1) Originally by Dan Kennedy (dan) with edits by Larry Brasfield (larrybr) on 2023-06-12 15:42:45 from 2.0 in reply to 1 [link] [source]
You're quite right of course. Thanks for reporting this. Now fixed here.
Dan.
(3) By Spindrift (spindrift) on 2023-06-12 15:14:47 in reply to 2.0 [link] [source]
Hi Dan - I think you've accidentally linked back to this forum post rather than the check-in.