SQLite Forum

Approximate COUNT(*) using sqlite_stat4
Login

Approximate COUNT(*) using sqlite_stat4

(1) By antimatter15 on 2021-01-31 00:00:43 [link]

People are starting to use SQLite as an open data publication format with projects such as [Datasette](https://datasette.io/examples). 

A lot of summary exploration of data involves running `COUNT` on various fields. However, SQLite always does a full table scan to resolve a COUNT— which can end up being very expensive and slow to run. 

SQLite already has the ability to run `ANALYZE` and store statistics about the various tables and indices in sqlite_stat4 and sqlite_stat1. In the data publishing situation, it's reasonable to expect that `ANALYZE` has been run and all the gathered statistics are up-to-date. 

My question is whether nor not there are any plans (or any strong objections to) to addding a `PRAGMA approximate_count` that would make `COUNT` return estimates from the sqlite_stat tables instead of running a full table scan. 

Additionally, I'd appreciate pointers for how to get started if one were trying to implement this feature themselves.

(2.1) By curmudgeon on 2021-01-31 16:40:10 edited from 2.0 in reply to 1 [link]

I'm sure there's some way of getting an approximate row count but I cannot remember what it is despite having spent the best part of an hour searching for a post I made on the very subject on the old nabble forum. I'm guessing it's something to do with the DBSTAT virtual table. Something to do with counting leafs and using a sample to approximate the average number of rows on a leaf.

Hopefully someone with a better memory will step in because I'd like to make use of it myself.

Edit: Not sure if this is related

https://sqlite.org/srcx/timeline?r=est_count_pragma&c=2016-12-16+15%3A57%3A40

(3) By anonymous on 2021-01-31 18:48:33 in reply to 2.1 [link]

Do you mean this one from Keith Medcalf? 

http://sqlite.1065341.n5.nabble.com/Row-length-in-SQLITE-tp110579p110627.html

(5) By curmudgeon on 2021-02-01 10:05:43 in reply to 3 [link]

No, it wasn't that one. Having went through all my posts on that forum I'm starting to think I'm havering. I remain convinced however that I either saw a post from drh or an explanation in the documentation about getting an approximate count.

I think it was something to do with counting the lower level nodes and estimating the number of nodes leading from each of them via a sample. To be honest I've little knowledge of b-trees so I don't even know if that makes sense.

(8) By TripeHound on 2021-02-01 21:02:39 in reply to 5

There's [this thread](http://sqlite.1065341.n5.nabble.com/A-CTE-to-count-the-records-rows-for-each-table-td94615.html) which ends with a post from Richard stating that (at the time of writing, 2017) ANALYZE does an exact count, but he has some code on a branch that does an approximate count, which _might_ make it to trunk (no idea whether it did).

There's also an [older thread](http://sqlite.1065341.n5.nabble.com/Counting-rows-td79573.html) (2014) that talks about counting rows in large tables, and includes [this post from Richard](http://sqlite.1065341.n5.nabble.com/Counting-rows-tp79573p79604.html) that discusses a potential (but file-format-breaking) technique to do `count(*)` in logarithmic time by being able to store/compute a "rank" for entries in the b-tree table.

(9.1) By curmudgeon on 2021-02-02 12:12:00 edited from 9.0 in reply to 8 [link]

Thanks TripleHound, interesting reading, but it wasn't any of those threads. From what I can determine "est_count" appears in several test files in the trunk but doesn't feature in the amalgamation.

tom@sp4:~/sqlite$ grep -Rl 'est_count' .
./ext/fts5/test/fts5aa.test
./src/test_func.c
./test/fuzz_malloc.test
./test/memleak.test
./test/walcrash3.test
./test/tester.tcl
./test/soak.test
./test/backcompat.test

EDIT: Ignore the above list. I didn't include the grep -w option so it was showing up files containing "test_counter". It returned nothing with -w set.

(4) By antimatter15 on 2021-01-31 20:26:25 in reply to 2.1 [link]

If you're just interested in the equivalent of `SELECT COUNT(*) FROM table1` after `ANALYZE` was run, this should do the trick `select cast(substr(stat, 0, instr(stat, ' ')) as integer) as "COUNT(*)" from sqlite_stat1 where tbl = 'table1'  limit 1`

Looking a bit more into how `sqlite_stat4` and `sqlite_stat1` work, I'm beginning to think that they can't be used to approximate the count of any more sophisticated queries. 

I think it would still be helpful to have some PRAGMA that would automatically interpret such `COUNT(*)` expressions as the corresponding sqlite_stat1 lookup.

(6) By Richard Hipp (drh) on 2021-02-01 13:09:47 in reply to 2.1 [link]

The "[PRAGMA est_count][1]" command was an experimental design to help speed
up ANALYZE for large (tera-byte sized) databases.  The idea was subsequently
folded into trunk.  See the "[PRAGMA analyze_limit][2]" command.  The table
[size estimation logic][3] is in the source tree, but it is not accessible to
application code at this time.


[1]: https://www.sqlite.org/src/timeline?r=est_count_pragma
[2]: https://www.sqlite.org/pragma.html#pragma_analysis_limit
[3]: https://www.sqlite.org/src/info/4da25694985ac8f5?ln=5704-5727

(7) By curmudgeon on 2021-02-01 15:16:47 in reply to 6 [link]

Thanks Richard although I'm sure what I read was more suited to dummies like myself.