Full bloom filter blob capacity is never used?

(1) By J. Schaefer (jschaefer) on 2023-06-11 21:34:45 [source]

Hello,

I am currently trying to understand how the bloom filter hashing works in sqlite (latest trunk).

In where.c an OP_Blob is set up with at least 80K bits of space for the bloom filter: sz = sqlite3LogEstToInt(pTab->nRowLogEst); if( sz<10000 ){ sz = 10000; // ... sqlite3VdbeAddOp2(v, OP_Blob, (int)sz, pLevel->regFilter);

So there is now a Blob with a size of at least 10000 bytes (80K bit).

Then in vdbe.c in case OP_FilterAdd the filter is filled. Here h is the result of a hashing function and now a single 1 needs to be set in the filter:

h = filterHash(aMem, pOp); // ... h %= pIn1->n; pIn1->z[h/8] |= 1<<(h&7);

I think pIn1->z is the filter-blob and pIn1->n is the size of the filter blob. In case of e.g. the filter having the minimal size of 80K bits pIn1->n is 10000 bytes. Thus after the modulo operation h can only be at max 9999. Then pIn1->z[h/8] indicates that only bytes 0 to 1249 (==9999/8) can ever be addressed, and bytes 1250 to 9999 will always stay zeroed.

The same is of course happening case OP_Filter.

If I am understanding this correctly this means for variable sized filters that 87.5% of the reserved filter space can never be used, thus raising the probability of false positives from ideally 11.75% to 63.2%?

Or have I overlooked something?

Thanks and best regards

J. Schaefer

(2.1) Originally by Dan Kennedy (dan) with edits by Larry Brasfield (larrybr) on 2023-06-12 15:42:45 from 2.0 in reply to 1 [link] [source]

You're quite right of course. Thanks for reporting this. Now fixed here.

Dan.

(3) By Spindrift (spindrift) on 2023-06-12 15:14:47 in reply to 2.0 [link] [source]

Hi Dan - I think you've accidentally linked back to this forum post rather than the check-in.