Approximate COUNT(*) using sqlite_stat4

(1) By antimatter15 on 2021-01-31 00:00:43 [link] [source]

People are starting to use SQLite as an open data publication format with projects such as Datasette.

A lot of summary exploration of data involves running COUNT on various fields. However, SQLite always does a full table scan to resolve a COUNT— which can end up being very expensive and slow to run.

SQLite already has the ability to run ANALYZE and store statistics about the various tables and indices in sqlite_stat4 and sqlite_stat1. In the data publishing situation, it's reasonable to expect that ANALYZE has been run and all the gathered statistics are up-to-date.

My question is whether nor not there are any plans (or any strong objections to) to addding a PRAGMA approximate_count that would make COUNT return estimates from the sqlite_stat tables instead of running a full table scan.

Additionally, I'd appreciate pointers for how to get started if one were trying to implement this feature themselves.

(2.1) By curmudgeon on 2021-01-31 16:40:10 edited from 2.0 in reply to 1 [link] [source]

I'm sure there's some way of getting an approximate row count but I cannot remember what it is despite having spent the best part of an hour searching for a post I made on the very subject on the old nabble forum. I'm guessing it's something to do with the DBSTAT virtual table. Something to do with counting leafs and using a sample to approximate the average number of rows on a leaf.

Hopefully someone with a better memory will step in because I'd like to make use of it myself.

Edit: Not sure if this is related

https://sqlite.org/srcx/timeline?r=est_count_pragma&c=2016-12-16+15%3A57%3A40

(3) By anonymous on 2021-01-31 18:48:33 in reply to 2.1 [link] [source]

Do you mean this one from Keith Medcalf?

http://sqlite.1065341.n5.nabble.com/Row-length-in-SQLITE-tp110579p110627.html

(4) By antimatter15 on 2021-01-31 20:26:25 in reply to 2.1 [link] [source]

If you're just interested in the equivalent of SELECT COUNT(*) FROM table1 after ANALYZE was run, this should do the trick select cast(substr(stat, 0, instr(stat, ' ')) as integer) as "COUNT(*)" from sqlite_stat1 where tbl = 'table1' limit 1

Looking a bit more into how sqlite_stat4 and sqlite_stat1 work, I'm beginning to think that they can't be used to approximate the count of any more sophisticated queries.

I think it would still be helpful to have some PRAGMA that would automatically interpret such COUNT(*) expressions as the corresponding sqlite_stat1 lookup.

(5) By curmudgeon on 2021-02-01 10:05:43 in reply to 3 [link] [source]

No, it wasn't that one. Having went through all my posts on that forum I'm starting to think I'm havering. I remain convinced however that I either saw a post from drh or an explanation in the documentation about getting an approximate count.

I think it was something to do with counting the lower level nodes and estimating the number of nodes leading from each of them via a sample. To be honest I've little knowledge of b-trees so I don't even know if that makes sense.

(6) By Richard Hipp (drh) on 2021-02-01 13:09:47 in reply to 2.1 [link] [source]

The "PRAGMA est_count" command was an experimental design to help speed up ANALYZE for large (tera-byte sized) databases. The idea was subsequently folded into trunk. See the "PRAGMA analyze_limit" command. The table size estimation logic is in the source tree, but it is not accessible to application code at this time.

(7) By curmudgeon on 2021-02-01 15:16:47 in reply to 6 [source]

Thanks Richard although I'm sure what I read was more suited to dummies like myself.

(8) By TripeHound on 2021-02-01 21:02:39 in reply to 5 [link] [source]

There's this thread which ends with a post from Richard stating that (at the time of writing, 2017) ANALYZE does an exact count, but he has some code on a branch that does an approximate count, which might make it to trunk (no idea whether it did).

There's also an older thread (2014) that talks about counting rows in large tables, and includes this post from Richard that discusses a potential (but file-format-breaking) technique to do count(*) in logarithmic time by being able to store/compute a "rank" for entries in the b-tree table.

(9.1) By curmudgeon on 2021-02-02 12:12:00 edited from 9.0 in reply to 8 [link] [source]

Thanks TripleHound, interesting reading, but it wasn't any of those threads. From what I can determine "est_count" appears in several test files in the trunk but doesn't feature in the amalgamation.

tom@sp4:~/sqlite$ grep -Rl 'est_count' .
./ext/fts5/test/fts5aa.test
./src/test_func.c
./test/fuzz_malloc.test
./test/memleak.test
./test/walcrash3.test
./test/tester.tcl
./test/soak.test
./test/backcompat.test

EDIT: Ignore the above list. I didn't include the grep -w option so it was showing up files containing "test_counter". It returned nothing with -w set.