FTS index size

(1) By Spaced Cowboy (SpacedCowboy) on 2020-08-06 05:13:24 [link] [source]

Hi :)

I've been playing with making an FTS5 index on an email corpus (about 80k emails), and I was kind of surprised about the size of the generated index. Here's some files:

 @tanelorn dbtests % ls -l total*.db
 -rw-r--r--  1 cowboy  staff  11120689152 Aug  5 16:45 total_complete.db
 -rw-r--r--  1 cowboy  staff   4610375680 Aug  5 17:44 total_content.db
 -rw-r--r--  1 cowboy  staff   1976729600 Aug  5 17:44 total_fts.db
 -rw-r--r--  1 cowboy  staff     27185152 Aug  5 17:45 total_without_fts.db

total_without_fts.db is the starting database. It's just the metadata without an FTS index on the body, and doesn't contain the actual body content. It's a good representation of what we have at the moment.

total_fts.db is after creating the FTS index with content set to "", detail=none and tokenize='porter unicode61'. The content (the mail body) is stored in the separate database 'total_content.db'

total_complete.db is the default 'detail' setting and the 'body' of the email stored in the same database (it's basically everything all together).

In our setup, the email content is typically stored on a different server to the database (and URLs are generated for clients to fetch stuff directly), so usually we'd only have the equivalent of 'total_without_fts.db' stored. I'm kind of surprised that the size balloons to ~70x the original when adding the FTS index (or almost half the size of the content). Am I being naive or does that mean I'm doing something wrong.

Is there a best-practice for getting FTS functionality (really, I only want the inverted index features) and keeping the size down ?

(2) By Dan Kennedy (dan) on 2020-08-06 14:04:59 in reply to 1 [link] [source]

I would have thought the index would be smaller than this. You could try running:

INSERT INTO ft(ft) VALUES('optimize');

where "ft" is the name of your fts5 table. Then run a VACUUM and see if the db shrinks any.

There might be something funny going on though. Years ago I was using the Enron email corpus for testing. In that case with default settings (largest index) the FTS data was just under 32% of the total database (i.e. of your "total_complete.db" file). With "detail=none" that dropped to around 7.6%.

If you could run [sqlite3_analyzer] on one of the databases and post the number of pages in each database table we might be able to say more.

https://www.sqlite.org/download.html

Dan.

(3) By Spaced Cowboy (SpacedCowboy) on 2020-08-09 00:16:04 in reply to 2 [source]

So to circle back to this (yes, it's taken this long to properly rewrite it ... :) after reading your reply I went back and looked at my email-ingest code. Turns out I'd goofed there, and we were ingesting (considering it to be text) a whole bunch of binary data within the email corpus, which was causing the database size to grow significantly.

After rewriting the ingest-to-the-database code, I got an altogether more reasonable

 @atlantis mdir % ls -l
 total 558568
 -rwxr--r--@   1 cowboy  staff          533 Aug  8 16:38 INGEST.wrap
 -rwxr-xr-x    1 cowboy  staff      1390208 Aug  8 16:39 ingest
 drwxr-xr-x  236 cowboy  staff         7552 Aug  6 10:21 src
 -rw-r--r--    1 cowboy  staff    283463680 Aug  8 16:45 total_complete.db

Only a saving of roughly 39x :) vs the one in the original post...

  -rw-r--r--   1 cowboy  staff  11120689152 Aug  5 16:45 total_complete.db

So this post continues a long and storied history of PEBCAK whenever I've had an issue with SQLite. Keep it up guys, it's brilliant :)