FTS index size
(1) By Spaced Cowboy (SpacedCowboy) on 2020-08-06 05:13:24 [link] [source]
Hi :)
I've been playing with making an FTS5 index on an email corpus (about 80k emails), and I was kind of surprised about the size of the generated index. Here's some files:
@tanelorn dbtests % ls -l total*.db
-rw-r--r-- 1 cowboy staff 11120689152 Aug 5 16:45 total_complete.db
-rw-r--r-- 1 cowboy staff 4610375680 Aug 5 17:44 total_content.db
-rw-r--r-- 1 cowboy staff 1976729600 Aug 5 17:44 total_fts.db
-rw-r--r-- 1 cowboy staff 27185152 Aug 5 17:45 total_without_fts.db
total_without_fts.db is the starting database. It's just the metadata without an FTS index on the body, and doesn't contain the actual body content. It's a good representation of what we have at the moment.
total_fts.db is after creating the FTS index with content set to "", detail=none and tokenize='porter unicode61'. The content (the mail body) is stored in the separate database 'total_content.db'
total_complete.db is the default 'detail' setting and the 'body' of the email stored in the same database (it's basically everything all together).
In our setup, the email content is typically stored on a different server to the database (and URLs are generated for clients to fetch stuff directly), so usually we'd only have the equivalent of 'total_without_fts.db' stored. I'm kind of surprised that the size balloons to ~70x the original when adding the FTS index (or almost half the size of the content). Am I being naive or does that mean I'm doing something wrong.
Is there a best-practice for getting FTS functionality (really, I only want the inverted index features) and keeping the size down ?
(2) By Dan Kennedy (dan) on 2020-08-06 14:04:59 in reply to 1 [link] [source]
I would have thought the index would be smaller than this. You could try running:
INSERT INTO ft(ft) VALUES('optimize');
where "ft" is the name of your fts5 table. Then run a VACUUM and see if the db shrinks any.
There might be something funny going on though. Years ago I was using the Enron email corpus for testing. In that case with default settings (largest index) the FTS data was just under 32% of the total database (i.e. of your "total_complete.db" file). With "detail=none" that dropped to around 7.6%.
If you could run [sqlite3_analyzer] on one of the databases and post the number of pages in each database table we might be able to say more.
https://www.sqlite.org/download.html
Dan.
(3) By Spaced Cowboy (SpacedCowboy) on 2020-08-09 00:16:04 in reply to 2 [source]
So to circle back to this (yes, it's taken this long to properly rewrite it ... :) after reading your reply I went back and looked at my email-ingest code. Turns out I'd goofed there, and we were ingesting (considering it to be text) a whole bunch of binary data within the email corpus, which was causing the database size to grow significantly.
After rewriting the ingest-to-the-database code, I got an altogether more reasonable
@atlantis mdir % ls -l
total 558568
-rwxr--r--@ 1 cowboy staff 533 Aug 8 16:38 INGEST.wrap
-rwxr-xr-x 1 cowboy staff 1390208 Aug 8 16:39 ingest
drwxr-xr-x 236 cowboy staff 7552 Aug 6 10:21 src
-rw-r--r-- 1 cowboy staff 283463680 Aug 8 16:45 total_complete.db
Only a saving of roughly 39x :) vs the one in the original post...
-rw-r--r-- 1 cowboy staff 11120689152 Aug 5 16:45 total_complete.db
So this post continues a long and storied history of PEBCAK whenever I've had an issue with SQLite. Keep it up guys, it's brilliant :)