SQLite FTS5 table with 7 rows has _fts_docsize table with 9,141 rows
(1.1) By Simon Willison (simonw) on 2020-09-07 16:52:59 edited from 1.0 [source]
I'm seeing a weird issue with some of the SQLite databases that I am using with the FTS5 module.
I have a database with a licenses
table that contains 7 rows: https://github-to-sqlite.dogsheep.net/github/licenses
I created a licenses_fts
table for this using the following SQL:
CREATE VIRTUAL TABLE [licenses_fts] USING FTS5 (
[name],
content=[licenses]
);
That table also has 7 rows: https://github-to-sqlite.dogsheep.net/github/licenses_fts
Somehow the accompanying licenses_fts_docsize
shadow table now has 9,141 rows in it! https://github-to-sqlite.dogsheep.net/github/licenses_fts_docsize
And licenses_fts_data
has 41 rows - should I expect that to have 7 rows? https://github-to-sqlite.dogsheep.net/github/licenses_fts_data
Anyone got any ideas what's going on here? The full database (22MB) can be downloaded from https://github-to-sqlite.dogsheep.net/github.db
I have a hunch that it might be a problem with the triggers. These are the triggers that are updating that FTS table: https://github-to-sqlite.dogsheep.net/github?sql=select+*+from+sqlite_master+where+type+%3D+%27trigger%27+and+tbl_name+%3D+%27licenses%27
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
trigger | licenses_ai | licenses | 0 | CREATE TRIGGER [licenses_ai] AFTER INSERT ON [licenses] BEGIN INSERT INTO [licenses_fts] (rowid, [name]) VALUES (new.rowid, new.[name]); END |
trigger | licenses_ad | licenses | 0 | CREATE TRIGGER [licenses_ad] AFTER DELETE ON [licenses] BEGIN INSERT INTO [licenses_fts] ([licenses_fts], rowid, [name]) VALUES('delete', old.rowid, old.[name]); END |
trigger | licenses_au | licenses | 0 | CREATE TRIGGER [licenses_au] AFTER UPDATE ON [licenses] BEGIN INSERT INTO [licenses_fts] ([licenses_fts], rowid, [name]) VALUES('delete', old.rowid, old.[name]); INSERT INTO [licenses_fts] (rowid, [name]) VALUES (new.rowid, new.[name]); END |
(2) By Dan Kennedy (dan) on 2020-09-07 17:40:57 in reply to 1.1 [link] [source]
The FTS5 data structures are internally consistent and there is no corruption at the database level but, as you say, the "licenses_fts" index does not match the contents of the "licenses" table.
Your triggers look correct to me.
Are you using "REPLACE INTO", or "UPDATE OR REPLACE" on the "licenses" table without having first executed "PRAGMA recursive_triggers = 1"? The docs note that delete triggers will not be fired in this case, which would explain things. Second paragraph under "REPLACE" here:
https://www.sqlite.org/lang_conflict.html
Other than that, could there be any code that writes to "licenses_fts" directly - bypassing "licences" altogether.
Dan.
(3) By Simon Willison (simonw) on 2020-09-07 17:59:14 in reply to 2 [link] [source]
Thanks - that's really useful. I didn't know about recursive_triggers
.
I just noticed that the triggers are operating on rowid
but the license
table has a text primary key, so I'm thinking maybe that could be part of the problem:
CREATE TABLE [licenses] (
[key] TEXT PRIMARY KEY,
[name] TEXT,
[spdx_id] TEXT,
[url] TEXT,
[node_id] TEXT
);
(4) By Simon Willison (simonw) on 2020-09-07 18:22:47 in reply to 3 [link] [source]
Another possibility: I noticed on https://www.sqlite.org/rowidtable.html this note:
If the rowid is not aliased by INTEGER PRIMARY KEY then it is not persistent and might change. In particular the VACUUM command will change rowids for tables that do not declare an INTEGER PRIMARY KEY. Therefore, applications should not normally access the rowid directly, but instead use an INTEGER PRIMARY KEY.
I ran VACUUM
on that database quite often. Is it possible that running VACUUM
is rewriting the rowid
values for the licenses
table but leaving the values in the licenses_fts
table unchanged? If so, maybe I get a new set of orphaned duplicate rows every time I vacuum?
(6) By Dan Kennedy (dan) on 2020-09-07 18:38:01 in reply to 4 [link] [source]
Forgot about that. You should probably fix that.
In practice running VACUUM won't change rowids for any table that has at least one index on it, and your "TEXT PRIMARY KEY" column counts as an index. But maybe if you're doing the VACUUM with a very old shell tool or something.
If you run VACUUM do the rowids in "licenses" change?
(7) By Simon Willison (simonw) on 2020-09-07 19:11:53 in reply to 6 [link] [source]
Not as far as I can tell:
/tmp % sqlite-utils github.db 'select rowid, * from licenses' -t
rowid key name spdx_id url node_id
------- ------------ ---------------------------------------------- ------------ -------------------------------------------- ----------------
58 cc-by-4.0 Creative Commons Attribution 4.0 International CC-BY-4.0 https://api.github.com/licenses/cc-by-4.0 MDc6TGljZW5zZTI1
109 unlicense The Unlicense Unlicense https://api.github.com/licenses/unlicense MDc6TGljZW5zZTE1
112 bsd-3-clause BSD 3-Clause "New" or "Revised" License BSD-3-Clause https://api.github.com/licenses/bsd-3-clause MDc6TGljZW5zZTU=
6189 other Other NOASSERTION MDc6TGljZW5zZTA=
8860 gpl-3.0 GNU General Public License v3.0 GPL-3.0 https://api.github.com/licenses/gpl-3.0 MDc6TGljZW5zZTk=
8932 mit MIT License MIT https://api.github.com/licenses/mit MDc6TGljZW5zZTEz
9150 apache-2.0 Apache License 2.0 Apache-2.0 https://api.github.com/licenses/apache-2.0 MDc6TGljZW5zZTI=
/tmp % sqlite-utils github.db 'vacuum'
[{"rows_affected": -1}]
/tmp % sqlite-utils github.db 'select rowid, * from licenses' -t
rowid key name spdx_id url node_id
------- ------------ ---------------------------------------------- ------------ -------------------------------------------- ----------------
58 cc-by-4.0 Creative Commons Attribution 4.0 International CC-BY-4.0 https://api.github.com/licenses/cc-by-4.0 MDc6TGljZW5zZTI1
109 unlicense The Unlicense Unlicense https://api.github.com/licenses/unlicense MDc6TGljZW5zZTE1
112 bsd-3-clause BSD 3-Clause "New" or "Revised" License BSD-3-Clause https://api.github.com/licenses/bsd-3-clause MDc6TGljZW5zZTU=
6189 other Other NOASSERTION MDc6TGljZW5zZTA=
8860 gpl-3.0 GNU General Public License v3.0 GPL-3.0 https://api.github.com/licenses/gpl-3.0 MDc6TGljZW5zZTk=
8932 mit MIT License MIT https://api.github.com/licenses/mit MDc6TGljZW5zZTEz
9150 apache-2.0 Apache License 2.0 Apache-2.0 https://api.github.com/licenses/apache-2.0 MDc6TGljZW5zZTI=
I'm going to do some deeper digging into what happens when I run the FTS-related code (this is actually all happening when I run my https://github.com/dogsheep/github-to-sqlite script). I'll report back if I find anything useful.
(8.2) By Simon Willison (simonw) on 2020-09-07 20:48:11 edited from 8.1 in reply to 7 [link] [source]
I've managed to replicate the issue!
(github-to-sqlite) /tmp % sqlite-utils tables --counts github.db | grep licenses
{"table": "licenses", "count": 7},
{"table": "licenses_fts_data", "count": 35},
{"table": "licenses_fts_idx", "count": 16},
{"table": "licenses_fts_docsize", "count": 9151},
{"table": "licenses_fts_config", "count": 1},
{"table": "licenses_fts", "count": 7},
(github-to-sqlite) /tmp % github-to-sqlite repos github.db dogsheep
(github-to-sqlite) /tmp % sqlite-utils tables --counts github.db | grep licenses
{"table": "licenses", "count": 7},
{"table": "licenses_fts_data", "count": 45},
{"table": "licenses_fts_idx", "count": 26},
{"table": "licenses_fts_docsize", "count": 9161},
{"table": "licenses_fts_config", "count": 1},
{"table": "licenses_fts", "count": 7},
Note how the number of rows in licenses_fts_docsize
goes from 9151 to 9161.
I dug into this and it turns out that github-to-sqlite repos
command does indeed run the following SQL 10 times:
INSERT OR REPLACE INTO [licenses] ([key], [name], [node_id], [spdx_id], [url]) VALUES
(?, ?, ?, ?, ?);
So I like your PRAGMA recursive_triggers
theory. I'm going to give that a try.
(9) By Simon Willison (simonw) on 2020-09-07 20:00:56 in reply to 8.1 [link] [source]
I can confirm that using PRAGMA recursive_triggers=on;
appears to fix this issue - running the script no longer adds ten new records to the licenses_fts_docsize
table.
This leaves me with a smaller problem: I now have a whole bunch of tables with enormous amounts of obsolete data in their *_fts_docsize
tables. I'll write myself a utility for cleaning those up, probably by running something like this:
delete from licenses_fts_docsize
where rowid not in (select rowid from licenses_fts)
(10) By Simon Willison (simonw) on 2020-09-07 20:18:28 in reply to 9 [link] [source]
I'm not sure if I should delete data from any of the other tables - https://github-to-sqlite.dogsheep.net/github/licenses_fts_data for example has 41 rows in (when licenses_fts
has only 7) but those rows don't seem to have IDs that correspond to the rowid
column in licenses_fts
.
(11) By Simon Willison (simonw) on 2020-09-07 23:16:26 in reply to 10 [link] [source]
Thanks for your help Dan - it turned out to be the INSERT OR REPLACE
issue you suggested. I shipped a new release of my software with a fix: https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v2-17
(12) By Dan Kennedy (dan) on 2020-09-08 08:20:09 in reply to 9 [link] [source]
Easiest thing would be to run a 'rebuild' to rebuild the FTS index from scratch based on the contents of the content table. i.e.
INSERT INTO licenses_fts(licenses_fts) VALUES('rebuild');
https://www.sqlite.org/fts5.html#the_rebuild_command
Dan.
(13) By Simon Willison (simonw) on 2020-09-08 23:41:12 in reply to 12 [link] [source]
That's a much better fix than the hack I had where I would delete spare rows from that table.
I just released sqlite-utils 2.18 which adds a sqlite-utils rebuild-fts data.db
command-line command and a db["my_table"].rebuild_fts()
Python API method: https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v2-18
(5) By Dan Kennedy (dan) on 2020-09-07 18:22:57 in reply to 3 [link] [source]
the triggers are operating on rowid but the license table has a text primary key
Shouldn't be a problem. The table still has a rowid.