SQLite User Forum

SQLite FTS5 table with 7 rows has _fts_docsize table with 9,141 rows
Login

SQLite FTS5 table with 7 rows has _fts_docsize table with 9,141 rows

(1.1) By Simon Willison (simonw) on 2020-09-07 16:52:59 edited from 1.0 [source]

I'm seeing a weird issue with some of the SQLite databases that I am using with the FTS5 module.

I have a database with a licenses table that contains 7 rows: https://github-to-sqlite.dogsheep.net/github/licenses

I created a licenses_fts table for this using the following SQL:

CREATE VIRTUAL TABLE [licenses_fts] USING FTS5 (
    [name],
    content=[licenses]
);

That table also has 7 rows: https://github-to-sqlite.dogsheep.net/github/licenses_fts

Somehow the accompanying licenses_fts_docsize shadow table now has 9,141 rows in it! https://github-to-sqlite.dogsheep.net/github/licenses_fts_docsize

And licenses_fts_data has 41 rows - should I expect that to have 7 rows? https://github-to-sqlite.dogsheep.net/github/licenses_fts_data

Anyone got any ideas what's going on here? The full database (22MB) can be downloaded from https://github-to-sqlite.dogsheep.net/github.db

I have a hunch that it might be a problem with the triggers. These are the triggers that are updating that FTS table: https://github-to-sqlite.dogsheep.net/github?sql=select+*+from+sqlite_master+where+type+%3D+%27trigger%27+and+tbl_name+%3D+%27licenses%27

type name tbl_name rootpage sql
trigger licenses_ai licenses 0 CREATE TRIGGER [licenses_ai] AFTER INSERT ON [licenses] BEGIN INSERT INTO [licenses_fts] (rowid, [name]) VALUES (new.rowid, new.[name]); END
trigger licenses_ad licenses 0 CREATE TRIGGER [licenses_ad] AFTER DELETE ON [licenses] BEGIN INSERT INTO [licenses_fts] ([licenses_fts], rowid, [name]) VALUES('delete', old.rowid, old.[name]); END
trigger licenses_au licenses 0 CREATE TRIGGER [licenses_au] AFTER UPDATE ON [licenses] BEGIN INSERT INTO [licenses_fts] ([licenses_fts], rowid, [name]) VALUES('delete', old.rowid, old.[name]); INSERT INTO [licenses_fts] (rowid, [name]) VALUES (new.rowid, new.[name]); END

(2) By Dan Kennedy (dan) on 2020-09-07 17:40:57 in reply to 1.1 [link] [source]

The FTS5 data structures are internally consistent and there is no corruption at the database level but, as you say, the "licenses_fts" index does not match the contents of the "licenses" table.

Your triggers look correct to me.

Are you using "REPLACE INTO", or "UPDATE OR REPLACE" on the "licenses" table without having first executed "PRAGMA recursive_triggers = 1"? The docs note that delete triggers will not be fired in this case, which would explain things. Second paragraph under "REPLACE" here:

https://www.sqlite.org/lang_conflict.html

Other than that, could there be any code that writes to "licenses_fts" directly - bypassing "licences" altogether.

Dan.

(3) By Simon Willison (simonw) on 2020-09-07 17:59:14 in reply to 2 [link] [source]

Thanks - that's really useful. I didn't know about recursive_triggers.

I just noticed that the triggers are operating on rowid but the license table has a text primary key, so I'm thinking maybe that could be part of the problem:

CREATE TABLE [licenses] (
   [key] TEXT PRIMARY KEY,
   [name] TEXT,
   [spdx_id] TEXT,
   [url] TEXT,
   [node_id] TEXT
);

(4) By Simon Willison (simonw) on 2020-09-07 18:22:47 in reply to 3 [link] [source]

Another possibility: I noticed on https://www.sqlite.org/rowidtable.html this note:

If the rowid is not aliased by INTEGER PRIMARY KEY then it is not persistent and might change. In particular the VACUUM command will change rowids for tables that do not declare an INTEGER PRIMARY KEY. Therefore, applications should not normally access the rowid directly, but instead use an INTEGER PRIMARY KEY.

I ran VACUUM on that database quite often. Is it possible that running VACUUM is rewriting the rowid values for the licenses table but leaving the values in the licenses_fts table unchanged? If so, maybe I get a new set of orphaned duplicate rows every time I vacuum?

(5) By Dan Kennedy (dan) on 2020-09-07 18:22:57 in reply to 3 [link] [source]

the triggers are operating on rowid but the license table has a text primary key

Shouldn't be a problem. The table still has a rowid.

(6) By Dan Kennedy (dan) on 2020-09-07 18:38:01 in reply to 4 [link] [source]

Forgot about that. You should probably fix that.

In practice running VACUUM won't change rowids for any table that has at least one index on it, and your "TEXT PRIMARY KEY" column counts as an index. But maybe if you're doing the VACUUM with a very old shell tool or something.

If you run VACUUM do the rowids in "licenses" change?

(7) By Simon Willison (simonw) on 2020-09-07 19:11:53 in reply to 6 [link] [source]

Not as far as I can tell:

/tmp % sqlite-utils github.db 'select rowid, * from licenses' -t
  rowid  key           name                                            spdx_id       url                                           node_id
-------  ------------  ----------------------------------------------  ------------  --------------------------------------------  ----------------
     58  cc-by-4.0     Creative Commons Attribution 4.0 International  CC-BY-4.0     https://api.github.com/licenses/cc-by-4.0     MDc6TGljZW5zZTI1
    109  unlicense     The Unlicense                                   Unlicense     https://api.github.com/licenses/unlicense     MDc6TGljZW5zZTE1
    112  bsd-3-clause  BSD 3-Clause "New" or "Revised" License         BSD-3-Clause  https://api.github.com/licenses/bsd-3-clause  MDc6TGljZW5zZTU=
   6189  other         Other                                           NOASSERTION                                                 MDc6TGljZW5zZTA=
   8860  gpl-3.0       GNU General Public License v3.0                 GPL-3.0       https://api.github.com/licenses/gpl-3.0       MDc6TGljZW5zZTk=
   8932  mit           MIT License                                     MIT           https://api.github.com/licenses/mit           MDc6TGljZW5zZTEz
   9150  apache-2.0    Apache License 2.0                              Apache-2.0    https://api.github.com/licenses/apache-2.0    MDc6TGljZW5zZTI=
/tmp % sqlite-utils github.db 'vacuum'                          
[{"rows_affected": -1}]
/tmp % sqlite-utils github.db 'select rowid, * from licenses' -t
  rowid  key           name                                            spdx_id       url                                           node_id
-------  ------------  ----------------------------------------------  ------------  --------------------------------------------  ----------------
     58  cc-by-4.0     Creative Commons Attribution 4.0 International  CC-BY-4.0     https://api.github.com/licenses/cc-by-4.0     MDc6TGljZW5zZTI1
    109  unlicense     The Unlicense                                   Unlicense     https://api.github.com/licenses/unlicense     MDc6TGljZW5zZTE1
    112  bsd-3-clause  BSD 3-Clause "New" or "Revised" License         BSD-3-Clause  https://api.github.com/licenses/bsd-3-clause  MDc6TGljZW5zZTU=
   6189  other         Other                                           NOASSERTION                                                 MDc6TGljZW5zZTA=
   8860  gpl-3.0       GNU General Public License v3.0                 GPL-3.0       https://api.github.com/licenses/gpl-3.0       MDc6TGljZW5zZTk=
   8932  mit           MIT License                                     MIT           https://api.github.com/licenses/mit           MDc6TGljZW5zZTEz
   9150  apache-2.0    Apache License 2.0                              Apache-2.0    https://api.github.com/licenses/apache-2.0    MDc6TGljZW5zZTI=

I'm going to do some deeper digging into what happens when I run the FTS-related code (this is actually all happening when I run my https://github.com/dogsheep/github-to-sqlite script). I'll report back if I find anything useful.

(8.2) By Simon Willison (simonw) on 2020-09-07 20:48:11 edited from 8.1 in reply to 7 [link] [source]

I've managed to replicate the issue!

(github-to-sqlite) /tmp % sqlite-utils tables --counts github.db | grep licenses
 {"table": "licenses", "count": 7},
 {"table": "licenses_fts_data", "count": 35},
 {"table": "licenses_fts_idx", "count": 16},
 {"table": "licenses_fts_docsize", "count": 9151},
 {"table": "licenses_fts_config", "count": 1},
 {"table": "licenses_fts", "count": 7},
(github-to-sqlite) /tmp % github-to-sqlite repos github.db dogsheep       
(github-to-sqlite) /tmp % sqlite-utils tables --counts github.db | grep licenses
 {"table": "licenses", "count": 7},
 {"table": "licenses_fts_data", "count": 45},
 {"table": "licenses_fts_idx", "count": 26},
 {"table": "licenses_fts_docsize", "count": 9161},
 {"table": "licenses_fts_config", "count": 1},
 {"table": "licenses_fts", "count": 7},
Note how the number of rows in licenses_fts_docsize goes from 9151 to 9161.

I dug into this and it turns out that github-to-sqlite repos command does indeed run the following SQL 10 times:

INSERT OR REPLACE INTO [licenses] ([key], [name], [node_id], [spdx_id], [url]) VALUES 
   (?, ?, ?, ?, ?);

So I like your PRAGMA recursive_triggers theory. I'm going to give that a try.

(9) By Simon Willison (simonw) on 2020-09-07 20:00:56 in reply to 8.1 [link] [source]

I can confirm that using PRAGMA recursive_triggers=on; appears to fix this issue - running the script no longer adds ten new records to the licenses_fts_docsize table.

This leaves me with a smaller problem: I now have a whole bunch of tables with enormous amounts of obsolete data in their *_fts_docsize tables. I'll write myself a utility for cleaning those up, probably by running something like this:

delete from licenses_fts_docsize
where rowid not in (select rowid from licenses_fts)

(10) By Simon Willison (simonw) on 2020-09-07 20:18:28 in reply to 9 [link] [source]

I'm not sure if I should delete data from any of the other tables - https://github-to-sqlite.dogsheep.net/github/licenses_fts_data for example has 41 rows in (when licenses_fts has only 7) but those rows don't seem to have IDs that correspond to the rowid column in licenses_fts.

(11) By Simon Willison (simonw) on 2020-09-07 23:16:26 in reply to 10 [link] [source]

Thanks for your help Dan - it turned out to be the INSERT OR REPLACE issue you suggested. I shipped a new release of my software with a fix: https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v2-17

(12) By Dan Kennedy (dan) on 2020-09-08 08:20:09 in reply to 9 [link] [source]

Easiest thing would be to run a 'rebuild' to rebuild the FTS index from scratch based on the contents of the content table. i.e.

    INSERT INTO licenses_fts(licenses_fts) VALUES('rebuild');

https://www.sqlite.org/fts5.html#the_rebuild_command

Dan.

(13) By Simon Willison (simonw) on 2020-09-08 23:41:12 in reply to 12 [link] [source]

That's a much better fix than the hack I had where I would delete spare rows from that table.

I just released sqlite-utils 2.18 which adds a sqlite-utils rebuild-fts data.db command-line command and a db["my_table"].rebuild_fts() Python API method: https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v2-18