SQLite Forum

Timeline
Login

47 forum posts by user simonw occurring on or before 2020-09-28 01:45:01.

More ↑
2020-09-28
01:45 Reply: Efficiently checking if a TEXT column exclusively contains integers or floating point numbers (as strings) (artifact: 6952c2332c user: simonw)

That's really useful, thank you for the example code!

2020-09-27
20:43 Reply: Efficiently checking if a TEXT column exclusively contains integers or floating point numbers (as strings) (artifact: b445365ed0 user: simonw)
20:29 Reply: Efficiently checking if a TEXT column exclusively contains integers or floating point numbers (as strings) (artifact: b4179cb0b6 user: simonw)

I think I've figured out the floating point equivalent:

select
  cast(cast(:value AS REAL) AS TEXT) in (:value, :value || '.0')

I tried this first:

select
  cast(cast(:value AS REAL) AS TEXT) = :value

But that failed on values like 1 because they cast turned them into 1.0.

Adding that comparison against :value || '.0' seems to catch that case nicely. Demo:

select
  value,
  cast(cast(value AS REAL) AS TEXT) in (value, value || '.0') as is_valid_float
from
  (
    select
      '1' as value
    union
    select
      '1.1' as value
    union
    select
      'dog' as value
    union
    select
      null as value
  )

https://latest.datasette.io/fixtures?sql=select%0D%0A++value%2C%0D%0A++cast%28cast%28value+AS+REAL%29+AS+TEXT%29+in+%28value%2C+value+%7C%7C+%27.0%27%29+as+is_valid_float%0D%0Afrom%0D%0A++%28%0D%0A++++select%0D%0A++++++%271%27+as+value%0D%0A++++union%0D%0A++++select%0D%0A++++++%271.1%27+as+value%0D%0A++++union%0D%0A++++select%0D%0A++++++%27dog%27+as+value%0D%0A++++union%0D%0A++++select%0D%0A++++++null+as+value%0D%0A++%29

Results:

value	is_valid_float
  	 
1 	1
1.1 	1
dog 	0
20:22 Reply: Efficiently checking if a TEXT column exclusively contains integers or floating point numbers (as strings) (artifact: 054532e0ce user: simonw)

Unfortunately it looks like abs() gets confused by strings that start with an integer but then contain non-integer text:

select abs('101'), abs('3.2'), abs('dog'), abs('1 dog');

Returns:

abs('101') abs('3.2') abs('dog') abs('1 dog')
101.0 3.2 0.0 1.0

https://latest.datasette.io/fixtures?sql=select+abs%28%27101%27%29%2C+abs%28%273.2%27%29%2C+abs%28%27dog%27%29%2C+abs%28%271+dog%27%29%3B

20:20 Edit reply: Efficiently checking if a TEXT column exclusively contains integers or floating point numbers (as strings) (artifact: efea14aa4f user: simonw)

I don't want to retrieve all of the data into my programs memory if I can at all avoid it, because it could be millions of rows. That's why I'm looking for a solution that executes in SQLite as much as possible.

My integer detecting query seems to work fine in that respect, but I'm checking to see if there are even more efficient queries I can use.

20:17 Reply: Efficiently checking if a TEXT column exclusively contains integers or floating point numbers (as strings) (artifact: 5a669534c3 user: simonw)

I don't want to retrieve all of the data into my programs memory, because it could be millions of rows. That's why I'm looking for a solution that executes in SQLite as much as possible.

My integer detecting query seems to work fine in that respect, but I'm checking to see if there are even more efficient queries I can use.

05:02 Edit reply: Efficiently checking if a TEXT column exclusively contains integers or floating point numbers (as strings) (artifact: 20bdcda226 user: simonw)

Yup, I understand column affinities.

The problem I'm trying to solve is, given a CSV file produced by someone else entirely, how can I efficiently detect "every single value in the category column consists of a string of integer characters" - so I can automatically create a new table with the correct affinities and then execute that conversion via in insert (using the transform tool I wrote about here: https://simonwillison.net/2020/Sep/23/sqlite-advanced-alter-table/ )

So for example say I have a table called people that looks like this, where every column is of TEXT affinity:

name category score
Joan 1 3.5
Rahul 2 4.5
Amanda 1 1.1

I plan to run SQL queries that detect that the category column contains only strings that are valid integers, and the score column contains only strings that are floating point numbers - so I can then automatically create a new table with the correct affinities and copy the data across.

This query is the best I've come up with for efficiently answering the question "does the category column contain any strings that do NOT work as integers" - but I'm not sure if it's the best approach, and I'm also interested in figuring out an equivalent mechanism that deals with floating point numbers.

select
  'contains_non_integer' as result
from
  people
where
  cast(cast(category AS INTEGER) AS TEXT) != category
limit
  1
05:02 Reply: Efficiently checking if a TEXT column exclusively contains integers or floating point numbers (as strings) (artifact: d9ab9a47e7 user: simonw)

Yup, I understand column affinities.

The problem I'm trying to solve is, given a CSV file produced by someone else entirely, how can I efficiently detect "every single value in the category column consists of a string of integer characters" - so I can automatically create a new table with the correct affinities and then execute that conversion via in insert (using the transform tool I wrote about here: https://simonwillison.net/2020/Sep/23/sqlite-advanced-alter-table/ )

So for example say I have a table called people that looks like this, where every column is of TEXT affinity:

name category score
Joan 1 3.5
Rahul 2 4.5
Amanda 1 1.1

I plan to run SQL queries that detect that the category column contains only strings that are valid integers, and the score column contains only strings that are floating point numbers - so I can then automatically create a new table with the correct affinities and copy the data across.

This query is the best I've come up with for efficiently answering the question "does the category column contain any strings that do NOT work as integers" - but I'm not sure if it's the best approach, and I'm also interested in figuring out an equivalent mechanism that deals with floating point numbers.

select
  'contains_non_integer' as result
from
  people
where
  cast(cast(category AS INTEGER) AS TEXT) != category
limit
  1
2020-09-26
17:43 Post: Efficiently checking if a TEXT column exclusively contains integers or floating point numbers (as strings) (artifact: ab0dcd66ef user: simonw)

I often import data from CSV files (usually from government open data portal websites).

This results in a table where every column is of type TEXT - but I'd like to convert columns that appear to exclusively contain integers or floating models to INTEGER or REAL column.

So I'm writing code to detect the suggested type for a column, and I want to do it as efficiently as possible/

I've worked out the following recipe for detecting a text column which is full of integers:

select
  'contains_non_integer' as result
from
  mytable
where
  cast(cast(mycolumn AS INTEGER) AS TEXT) != mycolumn
limit
  1

This query will scan through the table in (I think) primary key order and will terminate and return a single row with a single result in it the moment it runs across the first row that contains a text value that doesn't represent an integer.

If the column does contain exclusively integers, it will scan the entire column and eventually return zero rows at the end.

So two questions:

  1. Is this the most efficient way to do this? Is there another query I can use that will be more effective?
  2. What's an equivalent recipe for floating point numbers? I've not yet figured one out.
2020-09-24
17:52 Reply: sqlite-utils transform - command-line tool implementing the advanced ALTER TABLE pattern (artifact: e65caafb51 user: simonw)

Running extract against a table with 680,000 rows was taking 12 minutes to run. I did some work on optimizing it and came up with an approach that knocked it down from 12 minutes to just 4 seconds!

Details on the optimization here: https://github.com/simonw/sqlite-utils/issues/172

2020-09-23
17:48 Reply: sqlite-utils transform - command-line tool implementing the advanced ALTER TABLE pattern (artifact: 125001b7aa user: simonw)

Also in sqlite-utils 2.20 is the new sqlite-utils extract command, which can be used to extract columns into a separate table.

For example, if your table looks like this:

CREATE TABLE "salaries" (
   [rowid] INTEGER PRIMARY KEY,
   [Year Type] TEXT,
   [Year] TEXT,
   [Organization Group Code] TEXT,
   [Organization Group] TEXT,
   [Department Code] TEXT,
   [Department] TEXT,
   [Union Code] TEXT,
   [Union] TEXT,
   [Job Family Code] TEXT,
   [Job Family] TEXT,
   [Job Code] TEXT,
   [Job] TEXT,
   [Employee Identifier] INTEGER,
   [Salaries] FLOAT,
   [Overtime] FLOAT,
   [Other Salaries] FLOAT,
   [Total Salary] FLOAT,
   [Retirement] FLOAT,
   [Health and Dental] FLOAT,
   [Other Benefits] FLOAT,
   [Total Benefits] FLOAT,
   [Total Compensation] FLOAT
);

You can run commands that will transform it into this:

CREATE TABLE [organization_groups] (
   [id] INTEGER PRIMARY KEY,
   [code] TEXT,
   [name] TEXT
);
CREATE TABLE [departments] (
   [id] INTEGER PRIMARY KEY,
   [code] TEXT,
   [name] TEXT
);
CREATE TABLE [unions] (
   [id] INTEGER PRIMARY KEY,
   [code] TEXT,
   [name] TEXT
);
CREATE TABLE [job_families] (
   [id] INTEGER PRIMARY KEY,
   [code] TEXT,
   [name] TEXT
);
CREATE TABLE [jobs] (
   [id] INTEGER PRIMARY KEY,
   [code] TEXT,
   [name] TEXT
);
CREATE TABLE IF NOT EXISTS "salaries" (
   [rowid] INTEGER PRIMARY KEY,
   [Year Type] TEXT,
   [Year] TEXT,
   [organization_group_id] INTEGER REFERENCES [organization_groups]([id]),
   [department_id] INTEGER REFERENCES [departments]([id]),
   [union_id] INTEGER REFERENCES [unions]([id]),
   [job_family_id] INTEGER REFERENCES [job_families]([id]),
   [job_id] INTEGER REFERENCES jobs(id),
   [Employee Identifier] INTEGER,
   [Salaries] FLOAT,
   [Overtime] FLOAT,
   [Other Salaries] FLOAT,
   [Total Salary] FLOAT,
   [Retirement] FLOAT,
   [Health and Dental] FLOAT,
   [Other Benefits] FLOAT,
   [Total Benefits] FLOAT,
   [Total Compensation] FLOAT
);

I wrote about that in detail here: https://simonwillison.net/2020/Sep/23/sqlite-utils-extract/

03:04 Edit: sqlite-utils transform - command-line tool implementing the advanced ALTER TABLE pattern (artifact: 16412b1c5f user: simonw)

The SQLite ALTER TABLE documentation describes a pattern for running advanced alter table operations (dropping columns, changing column types etc). The recommendation is to create a new table with the desired shape, copy the data across from the old table, then drop the old table and rename the new one to replace it.

Today I added support for this pattern to my sqlite-utils package, which now offers this capability as both a command-line utility and a Python library method.

Here's an example of the command-line utility in action:

% wget https://latest.datasette.io/fixtures.db
% sqlite3 fixtures.db '.schema roadside_attractions'
CREATE TABLE roadside_attractions (
    pk integer primary key,
    name text,
    address text,
    latitude real,
    longitude real
);
% sqlite-utils transform fixtures.db roadside_attractions \
    --rename pk id \
    --default name Untitled \
    --drop address
% sqlite3 fixtures.db '.schema roadside_attractions'       
CREATE TABLE IF NOT EXISTS "roadside_attractions" (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT DEFAULT 'Untitled',
   [latitude] FLOAT,
   [longitude] FLOAT
);
You can use the --sql command to output the SQL that would be executed, which is useful for understanding how it works:
% sqlite-utils transform fixtures.db roadside_attractions \
    --rename pk id \
    --default name Untitled \
    --drop address \
    --sql
CREATE TABLE [roadside_attractions_new_d98d349ab698] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT DEFAULT 'Untitled',
   [latitude] FLOAT,
   [longitude] FLOAT
);
INSERT INTO [roadside_attractions_new_d98d349ab698] ([id], [name], [latitude], [longitude])
   SELECT [pk], [name], [latitude], [longitude] FROM [roadside_attractions];
DROP TABLE [roadside_attractions];
ALTER TABLE [roadside_attractions_new_d98d349ab698] RENAME TO [roadside_attractions];

Here's how to do the same thing using the Python library:

import sqlite_utils

db = sqlite_utils.Database("fixtures.db")
db["roadside_attractions"].transform(
  rename={"pk": "id"},
  defaults={"name": "Untitled"},
  drop={"address"}
)

Take a look at the documentation for the CLI tool and for the Python method for more details.

01:54 Post: sqlite-utils transform - command-line tool implementing the advanced ALTER TABLE pattern (artifact: 447f11d6fe user: simonw)

The SQLite ALTER TABLE documentation describes a pattern for running advanced alter table operations (renaming and dropping columns, changing column types etc). The recommendation is to create a new table with the desired shape, copy the data across from the old table, then drop the old table and rename the new one to replace it.

Today I added support for this pattern to my sqlite-utils package, which now offers this capability as both a command-line utility and a Python library method.

Here's an example of the command-line utility in action:

% wget https://latest.datasette.io/fixtures.db
% sqlite3 fixtures.db '.schema roadside_attractions'
CREATE TABLE roadside_attractions (
    pk integer primary key,
    name text,
    address text,
    latitude real,
    longitude real
);
% sqlite-utils transform fixtures.db roadside_attractions \
    --rename pk id \
    --default name Untitled \
    --drop address
% sqlite3 fixtures.db '.schema roadside_attractions'       
CREATE TABLE IF NOT EXISTS "roadside_attractions" (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT DEFAULT 'Untitled',
   [latitude] FLOAT,
   [longitude] FLOAT
);
You can use the --sql command to output the SQL that would be executed, which is useful for understanding how it works:
% sqlite-utils transform fixtures.db roadside_attractions \
    --rename pk id \
    --default name Untitled \
    --drop address \
    --sql
CREATE TABLE [roadside_attractions_new_d98d349ab698] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT DEFAULT 'Untitled',
   [latitude] FLOAT,
   [longitude] FLOAT
);
INSERT INTO [roadside_attractions_new_d98d349ab698] ([id], [name], [latitude], [longitude])
   SELECT [pk], [name], [latitude], [longitude] FROM [roadside_attractions];
DROP TABLE [roadside_attractions];
ALTER TABLE [roadside_attractions_new_d98d349ab698] RENAME TO [roadside_attractions];

Here's how to do the same thing using the Python library:

import sqlite_utils

db = sqlite_utils.Database("fixtures.db")
db["roadside_attractions"].transform(
  rename={"pk": "id"},
  defaults={"name": "Untitled"},
  drop={"address"}
)

Take a look at the documentation for the CLI tool and for the Python method for more details.

2020-09-18
23:19 Reply: Concurrent Multiple Write Transactions (artifact: 92a3c9808a user: simonw)

This feature is currently in a separate branch. There's a little bit of information about the branch and how to use it on this page: https://www3.sqlite.org/cgi/src/doc/begin-concurrent/doc/begin_concurrent.md

20:58 Reply: Trigram indexes for SQLite (artifact: 69731fbe35 user: simonw)

Do you mind if I create a GitHub repo for this, crediting you as the author of the code and adding some documentation on how to build it? Is the same license as SQLite OK?

20:56 Reply: Trigram indexes for SQLite (artifact: c9134df64e user: simonw)

Whoa. I just got that working on my Mac - I had to copy ./sqlite/ext/fts5/fts5.h into the same directory as that ftstri.c file but once I did that the following:

gcc -I. -g -fPIC -shared ftstri.c -o ftstri.so

Did indeed produce a ftstri.so extension which I successfully loaded into SQLite using Python!

ftstri % python3
Python 3.8.5 (default, Jul 21 2020, 10:48:26) 
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
>>> c = sqlite3.connect(":memory:")
>>> c
<sqlite3.Connection object at 0x107e137b0>
>>> c.enable_load_extension(True)
>>> c.load_extension("ftstri.so")
>>> c
<sqlite3.Connection object at 0x107e137b0>
>>> c.execute("CREATE VIRTUAL TABLE dict USING fts5(word, tokenize=tri);")
<sqlite3.Cursor object at 0x107e9f880>
>>> c.execute('INSERT INTO dict values ("simon")')
<sqlite3.Cursor object at 0x107e9f8f0>
>>> c.execute('INSERT INTO dict values ("cleo")')
<sqlite3.Cursor object at 0x107e9f880>
>>> c.execute('INSERT INTO dict values ("natalie")')
<sqlite3.Cursor object at 0x107e9f8f0>
>>> c.execute('select * from dict(?)', ['simon']).fetchall()
[('simon',)]
>>> c.execute('select * from dict(?)', ['sim']).fetchall()
[('simon',)]
>>> c.execute('select * from dict(?)', ['imo']).fetchall()
[('simon',)]
>>> 
>>> 
>>> 
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['sim']).fetchall()
[('(sim)on',)]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['simon']).fetchall()
[('(simon)',)]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['mon']).fetchall()
[('si(mon)',)]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['cl']).fetchall()
[]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['cleo']).fetchall()
[('(cleo)',)]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['cle']).fetchall()
[('(cle)o',)]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['cleop']).fetchall()
[]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['nat']).fetchall()
[('(nat)alie',)]
2020-09-17
14:59 Reply: Trigram indexes for SQLite (artifact: ca5e2aa8dd user: simonw)

That's really clever, and seems like it should be pretty easy to implement. Thanks!

01:52 Reply: Trigram indexes for SQLite (artifact: 0909d9f483 user: simonw)

... though now I'm wondering if one could use a custom FTS tokenizer to create a trigram index.

01:49 Reply: Trigram indexes for SQLite (artifact: 36e0d67867 user: simonw)

Yes exactly - FTS5 is amazing but it doesn't solve internal substrings in the same way that trigrams do.

2020-09-16
22:07 Edit: Trigram indexes for SQLite (artifact: c230760fdf user: simonw)

One of my favourite little-known PostgreSQL features is trigram indexes. They let you speed up queries that need to match substrings within a text column - queries like this:

select * from articles where title like '%smil%'

There's a great explanation of how these work in PostgreSQL here: https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-trigram-indexes/#trigram-indexes

I would love to have a version of this functionality in SQLite. There's an experimental module for that here, but it's not seen any work in seven years: https://github.com/jonasfj/trilite

Has anyone seen a SQLite-oriented solution for this?

22:06 Post: Trigram indexes for SQLite (artifact: 0387aa9b10 user: simonw)

One of my favourite little-known PostgreSQL features is trigram indexes. They let you speed up queries that need to match substrings within a text column - queries like this:

select * from articles where title like '%smil%'

There's a great explanation of how these work in PostgreSQL here: https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-trigram-indexes/#trigram-indexes

I would love to have a version of this functionality in SQLite. There's an experimental module for that here, but it's not seen any work in seven years: https://github.com/jonasfj/trilite

Has anyone seen a SQLite-oriented solution for this?

2020-09-08
23:41 Reply: SQLite FTS5 table with 7 rows has _fts_docsize table with 9,141 rows (artifact: d9ed879935 user: simonw)

That's a much better fix than the hack I had where I would delete spare rows from that table.

I just released sqlite-utils 2.18 which adds a sqlite-utils rebuild-fts data.db command-line command and a db["my_table"].rebuild_fts() Python API method: https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v2-18

17:08 Reply: sqlite-utils: a command-line tool and Python utility library (artifact: 76cc460c6c user: simonw)

That's by me too! I'm evolving sqlite-utils to replace it - csvs-to-sqlite works by loading the entire file into memory which gets a bit slow with 100MB+ files. sqlite-utils streams the file into SQLite instead, but doesn't have as many useful options (yet) as csvs-to-sqlite in terms of things like extracting columns into separate tables.

2020-09-07
23:16 Reply: SQLite FTS5 table with 7 rows has _fts_docsize table with 9,141 rows (artifact: c6b2477889 user: simonw)

Thanks for your help Dan - it turned out to be the INSERT OR REPLACE issue you suggested. I shipped a new release of my software with a fix: https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v2-17

22:52 Post: sqlite-utils: a command-line tool and Python utility library (artifact: ff5b6198eb user: simonw)

People on this forum may find this project useful: https://sqlite-utils.readthedocs.io/

It's a combination command-line tool and Python library that I've been working on for the last two years to help me work with SQLite databases.

As a command-line tool

Once installed (pip install sqlite-utils if you have a working Python 3 environment) you can use the sqlite-utils command-line tool to perform all kinds of useful actions.

Here are some examples running against my example database of data from GitHub: https://github-to-sqlite.dogsheep.net/github.db

The sqlite-utils tables my.db command shows you the tables in the database. Add --counts to get their row counts:

/tmp % sqlite-utils tables github.db --counts
[{"table": "users", "count": 4286},
 {"table": "repos", "count": 253},
 {"table": "licenses", "count": 7},
 {"table": "licenses_fts_data", "count": 34},
 # ...

Default output is JSON, but you can use -t to get a readable table, or --csv for CSV for TSV:

/tmp % sqlite-utils tables github.db --counts --csv 
table,count
users,4286
repos,253
licenses,7

There are a ton of other command-line options listed in the documentation. but my favourite is the ability to pipe JSON into the tool and have it automatically create a SQLite table with the correct columns to match the incoming JSON. For example:

echo '[
    {
        "id": 1,
        "name": "Cleo",
        "age": 4
    },
    {
        "id": 2,
        "name": "Pancakes",
        "age": 2
    },
    {
        "id": 3,
        "name": "Toby",
        "age": 6
    }
]' | sqlite-utils insert dogs.db dogs - --pk=id

This creates a dogs.db database with the following schema:

/tmp % sqlite3 dogs.db .schema 
CREATE TABLE [dogs] (
    [id] INTEGER PRIMARY KEY,
    [name] TEXT,
    [age] INTEGER
);

A few other useful commands:

  • sqlite-utils enable-fts can configure FTS for a table against specified columns, with a --create-triggers option to create triggers needed to keep the FTS table up-to-date with changes.
  • sqlite-utils add-foreign-key can add a foreign key to an existing table. SQLite doesn't support this natively but the command achieves it through careful use of the writable schema feature.
  • sqlite-utils insert-files can read files from disk and write them into a SQLite table as BLOB columns. More about this in my blog entry Fun with binary data and SQLite.

As a Python library

sqlite-utils started life as a Python library - I added the CLI part later on. It was initially designed to be used in a Jupyter notebook environment for quickly building SQLite databases.

I've since used it to build a whole family of tools for ingesting data from different sources (such as the GitHub or Twitter APIs, or Apple HealthKit data from my Apple Watch) - most of those are in my https://github.com/dogsheep GitHub collection.

A quick example script:

from sqlite_utils import Database

db = Database("books.db")
db["authors"].insert_all([
    {"id": 1, "name": "Sally"},
    {"id": 2, "name": "Asheesh"}
], pk="id")
db["books"].insert_all([
    {"title": "Hedgehogs of the world", "author_id": 1},
    {"title": "How to train your wolf", "author_id": 2},
], foreign_keys=[
    ("author_id", "authors")
])

This creates a table with the following schema populated with that example data:

CREATE TABLE [authors] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT
);
CREATE TABLE [books] (
   [id] INTEGER PRIMARY KEY,
   [title] TEXT,
   [author_id] INTEGER REFERENCES [authors]([id])
);
The full Python API documentation is here: https://sqlite-utils.readthedocs.io/en/stable/python-api.html - it includes all kinds of methods for introspecting existing databases, modifying foreign keys, configuring full-text search and more.

I continue to actively develop this library, so I'm always keen to hear feature requests for things that could be useful additions to it.

20:48 Edit reply: SQLite FTS5 table with 7 rows has _fts_docsize table with 9,141 rows (artifact: 58364d06b8 user: simonw)

I've managed to replicate the issue!

(github-to-sqlite) /tmp % sqlite-utils tables --counts github.db | grep licenses
 {"table": "licenses", "count": 7},
 {"table": "licenses_fts_data", "count": 35},
 {"table": "licenses_fts_idx", "count": 16},
 {"table": "licenses_fts_docsize", "count": 9151},
 {"table": "licenses_fts_config", "count": 1},
 {"table": "licenses_fts", "count": 7},
(github-to-sqlite) /tmp % github-to-sqlite repos github.db dogsheep       
(github-to-sqlite) /tmp % sqlite-utils tables --counts github.db | grep licenses
 {"table": "licenses", "count": 7},
 {"table": "licenses_fts_data", "count": 45},
 {"table": "licenses_fts_idx", "count": 26},
 {"table": "licenses_fts_docsize", "count": 9161},
 {"table": "licenses_fts_config", "count": 1},
 {"table": "licenses_fts", "count": 7},
Note how the number of rows in licenses_fts_docsize goes from 9151 to 9161.

I dug into this and it turns out that github-to-sqlite repos command does indeed run the following SQL 10 times:

INSERT OR REPLACE INTO [licenses] ([key], [name], [node_id], [spdx_id], [url]) VALUES 
   (?, ?, ?, ?, ?);

So I like your PRAGMA recursive_triggers theory. I'm going to give that a try.

20:18 Reply: SQLite FTS5 table with 7 rows has _fts_docsize table with 9,141 rows (artifact: 61ef49c3aa user: simonw)

I'm not sure if I should delete data from any of the other tables - https://github-to-sqlite.dogsheep.net/github/licenses_fts_data for example has 41 rows in (when licenses_fts has only 7) but those rows don't seem to have IDs that correspond to the rowid column in licenses_fts.

20:00 Reply: SQLite FTS5 table with 7 rows has _fts_docsize table with 9,141 rows (artifact: 857cedf396 user: simonw)

I can confirm that using PRAGMA recursive_triggers=on; appears to fix this issue - running the script no longer adds ten new records to the licenses_fts_docsize table.

This leaves me with a smaller problem: I now have a whole bunch of tables with enormous amounts of obsolete data in their *_fts_docsize tables. I'll write myself a utility for cleaning those up, probably by running something like this:

delete from licenses_fts_docsize
where rowid not in (select rowid from licenses_fts)
19:53 Edit reply: SQLite FTS5 table with 7 rows has _fts_docsize table with 9,141 rows (artifact: d7f0ea4e66 user: simonw)

I've managed to replicate the issue!

(github-to-sqlite) /tmp % sqlite-utils tables --counts github.db | grep licenses
 {"table": "licenses", "count": 7},
 {"table": "licenses_fts_data", "count": 35},
 {"table": "licenses_fts_idx", "count": 16},
 {"table": "licenses_fts_docsize", "count": 9151},
 {"table": "licenses_fts_config", "count": 1},
 {"table": "licenses_fts", "count": 7},
(github-to-sqlite) /tmp % github-to-sqlite repos github.db dogsheep       
(github-to-sqlite) /tmp % sqlite-utils tables --counts github.db | grep licenses
 {"table": "licenses", "count": 7},
 {"table": "licenses_fts_data", "count": 45},
 {"table": "licenses_fts_idx", "count": 26},
 {"table": "licenses_fts_docsize", "count": 9161},
 {"table": "licenses_fts_config", "count": 1},
 {"table": "licenses_fts", "count": 7},
Note how the number of rows in licenses_fts_docsize goes from 9151 to 9161.

I dug into this and it turns out that github-to-sqlite repos command does indeed run the following SQL 10 times:

                    INSERT OR REPLACE INTO [licenses] ([key], [name], [node_id], [spdx_id], [url]) VALUES 
                        (?, ?, ?, ?, ?)
                    ;

So I like your PRAGMA recursive_triggers theory. I'm going to give that a try.

19:53 Reply: SQLite FTS5 table with 7 rows has _fts_docsize table with 9,141 rows (artifact: 927889ee3b user: simonw)

I've managed to replicate the issue!

(github-to-sqlite) /tmp % sqlite-utils tables --counts github.db | grep licenses
 {"table": "licenses", "count": 7},
 {"table": "licenses_fts_data", "count": 35},
 {"table": "licenses_fts_idx", "count": 16},
 {"table": "licenses_fts_docsize", "count": 9151},
 {"table": "licenses_fts_config", "count": 1},
 {"table": "licenses_fts", "count": 7},
(github-to-sqlite) /tmp % github-to-sqlite repos github.db dogsheep       
(github-to-sqlite) /tmp % sqlite-utils tables --counts github.db | grep licenses
 {"table": "licenses", "count": 7},
 {"table": "licenses_fts_data", "count": 45},
 {"table": "licenses_fts_idx", "count": 26},
 {"table": "licenses_fts_docsize", "count": 9161},
 {"table": "licenses_fts_config", "count": 1},
 {"table": "licenses_fts", "count": 7},
I dug into this and it turns out that github-to-sqlite repos command does indeed run the following SQL 10 times:

                    INSERT OR REPLACE INTO [licenses] ([key], [name], [node_id], [spdx_id], [url]) VALUES 
                        (?, ?, ?, ?, ?)
                    ;

So I like your PRAGMA recursive_triggers theory. I'm going to give that a try.

19:11 Reply: SQLite FTS5 table with 7 rows has _fts_docsize table with 9,141 rows (artifact: 619b428c37 user: simonw)

Not as far as I can tell:

/tmp % sqlite-utils github.db 'select rowid, * from licenses' -t
  rowid  key           name                                            spdx_id       url                                           node_id
-------  ------------  ----------------------------------------------  ------------  --------------------------------------------  ----------------
     58  cc-by-4.0     Creative Commons Attribution 4.0 International  CC-BY-4.0     https://api.github.com/licenses/cc-by-4.0     MDc6TGljZW5zZTI1
    109  unlicense     The Unlicense                                   Unlicense     https://api.github.com/licenses/unlicense     MDc6TGljZW5zZTE1
    112  bsd-3-clause  BSD 3-Clause "New" or "Revised" License         BSD-3-Clause  https://api.github.com/licenses/bsd-3-clause  MDc6TGljZW5zZTU=
   6189  other         Other                                           NOASSERTION                                                 MDc6TGljZW5zZTA=
   8860  gpl-3.0       GNU General Public License v3.0                 GPL-3.0       https://api.github.com/licenses/gpl-3.0       MDc6TGljZW5zZTk=
   8932  mit           MIT License                                     MIT           https://api.github.com/licenses/mit           MDc6TGljZW5zZTEz
   9150  apache-2.0    Apache License 2.0                              Apache-2.0    https://api.github.com/licenses/apache-2.0    MDc6TGljZW5zZTI=
/tmp % sqlite-utils github.db 'vacuum'                          
[{"rows_affected": -1}]
/tmp % sqlite-utils github.db 'select rowid, * from licenses' -t
  rowid  key           name                                            spdx_id       url                                           node_id
-------  ------------  ----------------------------------------------  ------------  --------------------------------------------  ----------------
     58  cc-by-4.0     Creative Commons Attribution 4.0 International  CC-BY-4.0     https://api.github.com/licenses/cc-by-4.0     MDc6TGljZW5zZTI1
    109  unlicense     The Unlicense                                   Unlicense     https://api.github.com/licenses/unlicense     MDc6TGljZW5zZTE1
    112  bsd-3-clause  BSD 3-Clause "New" or "Revised" License         BSD-3-Clause  https://api.github.com/licenses/bsd-3-clause  MDc6TGljZW5zZTU=
   6189  other         Other                                           NOASSERTION                                                 MDc6TGljZW5zZTA=
   8860  gpl-3.0       GNU General Public License v3.0                 GPL-3.0       https://api.github.com/licenses/gpl-3.0       MDc6TGljZW5zZTk=
   8932  mit           MIT License                                     MIT           https://api.github.com/licenses/mit           MDc6TGljZW5zZTEz
   9150  apache-2.0    Apache License 2.0                              Apache-2.0    https://api.github.com/licenses/apache-2.0    MDc6TGljZW5zZTI=

I'm going to do some deeper digging into what happens when I run the FTS-related code (this is actually all happening when I run my https://github.com/dogsheep/github-to-sqlite script). I'll report back if I find anything useful.

18:22 Reply: SQLite FTS5 table with 7 rows has _fts_docsize table with 9,141 rows (artifact: c867fe740c user: simonw)

Another possibility: I noticed on https://www.sqlite.org/rowidtable.html this note:

If the rowid is not aliased by INTEGER PRIMARY KEY then it is not persistent and might change. In particular the VACUUM command will change rowids for tables that do not declare an INTEGER PRIMARY KEY. Therefore, applications should not normally access the rowid directly, but instead use an INTEGER PRIMARY KEY.

I ran VACUUM on that database quite often. Is it possible that running VACUUM is rewriting the rowid values for the licenses table but leaving the values in the licenses_fts table unchanged? If so, maybe I get a new set of orphaned duplicate rows every time I vacuum?

17:59 Reply: SQLite FTS5 table with 7 rows has _fts_docsize table with 9,141 rows (artifact: 51aada1b45 user: simonw)

Thanks - that's really useful. I didn't know about recursive_triggers.

I just noticed that the triggers are operating on rowid but the license table has a text primary key, so I'm thinking maybe that could be part of the problem:

CREATE TABLE [licenses] (
   [key] TEXT PRIMARY KEY,
   [name] TEXT,
   [spdx_id] TEXT,
   [url] TEXT,
   [node_id] TEXT
);
16:52 Edit: SQLite FTS5 table with 7 rows has _fts_docsize table with 9,141 rows (artifact: e05a261524 user: simonw)

I'm seeing a weird issue with some of the SQLite databases that I am using with the FTS5 module.

I have a database with a licenses table that contains 7 rows: https://github-to-sqlite.dogsheep.net/github/licenses

I created a licenses_fts table for this using the following SQL:

CREATE VIRTUAL TABLE [licenses_fts] USING FTS5 (
    [name],
    content=[licenses]
);

That table also has 7 rows: https://github-to-sqlite.dogsheep.net/github/licenses_fts

Somehow the accompanying licenses_fts_docsize shadow table now has 9,141 rows in it! https://github-to-sqlite.dogsheep.net/github/licenses_fts_docsize

And licenses_fts_data has 41 rows - should I expect that to have 7 rows? https://github-to-sqlite.dogsheep.net/github/licenses_fts_data

Anyone got any ideas what's going on here? The full database (22MB) can be downloaded from https://github-to-sqlite.dogsheep.net/github.db

I have a hunch that it might be a problem with the triggers. These are the triggers that are updating that FTS table: https://github-to-sqlite.dogsheep.net/github?sql=select+*+from+sqlite_master+where+type+%3D+%27trigger%27+and+tbl_name+%3D+%27licenses%27

type name tbl_name rootpage sql
trigger licenses_ai licenses 0 CREATE TRIGGER [licenses_ai] AFTER INSERT ON [licenses] BEGIN INSERT INTO [licenses_fts] (rowid, [name]) VALUES (new.rowid, new.[name]); END
trigger licenses_ad licenses 0 CREATE TRIGGER [licenses_ad] AFTER DELETE ON [licenses] BEGIN INSERT INTO [licenses_fts] ([licenses_fts], rowid, [name]) VALUES('delete', old.rowid, old.[name]); END
trigger licenses_au licenses 0 CREATE TRIGGER [licenses_au] AFTER UPDATE ON [licenses] BEGIN INSERT INTO [licenses_fts] ([licenses_fts], rowid, [name]) VALUES('delete', old.rowid, old.[name]); INSERT INTO [licenses_fts] (rowid, [name]) VALUES (new.rowid, new.[name]); END
16:47 Post: SQLite FTS5 table with 7 rows has _fts_docsize table with 9,141 rows (artifact: 74916cdf46 user: simonw)

I'm seeing a weird issue with some of the SQLite databases that I am using with the FTS5 module.

I have a database with a licenses table that contains 7 rows: https://github-to-sqlite.dogsheep.net/github/licenses

I created a licenses_fts table for this using the following SQL:

CREATE VIRTUAL TABLE [licenses_fts] USING FTS5 (
    [name],
    content=[licenses]
);

That table also has 7 rows: https://github-to-sqlite.dogsheep.net/github/licenses_fts

Somehow the accompanying licenses_fts_docsize shadow table now has 9,141 rows in it! https://github-to-sqlite.dogsheep.net/github/licenses_fts_docsize

Anyone got any ideas what's going on here? The full database (22MB) can be downloaded from https://github-to-sqlite.dogsheep.net/github.db

I have a hunch that it might be a problem with the triggers. These are the triggers that are updating that FTS table: https://github-to-sqlite.dogsheep.net/github?sql=select+*+from+sqlite_master+where+type+%3D+%27trigger%27+and+tbl_name+%3D+%27licenses%27

type name tbl_name rootpage sql
trigger licenses_ai licenses 0 CREATE TRIGGER [licenses_ai] AFTER INSERT ON [licenses] BEGIN INSERT INTO [licenses_fts] (rowid, [name]) VALUES (new.rowid, new.[name]); END
trigger licenses_ad licenses 0 CREATE TRIGGER [licenses_ad] AFTER DELETE ON [licenses] BEGIN INSERT INTO [licenses_fts] ([licenses_fts], rowid, [name]) VALUES('delete', old.rowid, old.[name]); END
trigger licenses_au licenses 0 CREATE TRIGGER [licenses_au] AFTER UPDATE ON [licenses] BEGIN INSERT INTO [licenses_fts] ([licenses_fts], rowid, [name]) VALUES('delete', old.rowid, old.[name]); INSERT INTO [licenses_fts] (rowid, [name]) VALUES (new.rowid, new.[name]); END
2020-08-28
06:23 Reply: Word search in FTS5 (artifact: 52e8fa533f user: simonw)

SQLite FTS supports suffix wildcards, so running WHERE body MATCH 'st*' should do what you want.

2020-08-12
05:44 Reply: Anyone know how to compile RTree and Geopoly as shared libraries on macOS? (artifact: 893ef9f02b user: simonw)

Thanks! You've saved me a lot of time. I've managed to compile them using the amalgamation so I guess I'll stick with that.

2020-08-11
14:52 Reply: Anyone know how to compile RTree and Geopoly as shared libraries on macOS? (artifact: 1e36f19eb7 user: simonw)

Unfortunately it doesn't. I read that and tried to figure out how to apply that to compiling the rtree and geopoly modules for macOS and could not figure out how to do it - hence my question here! I'm hoping someone who's better at working with C and macOS can help me work out exactly what I need to do.

2020-08-10
03:24 Reply: Using WAL mode with multiple processes (artifact: 72e838bc3d user: simonw)

Thanks, that's exactly what I needed to know! I wrote about this here: Enabling WAL mode for SQLite database files

03:02 Edit reply: Using WAL mode with multiple processes (artifact: e6c238e854 user: simonw)

That's the bit that confused me - it sounded like it meant that it would change for connections that are currently open, but I wasn't sure if it would affect connections that open later on.

If WAL is actually a persistent property of the database file itself then I guess the answer is that any connection can "turn on" WAL for a database file and it will stay as a WAL database file for the rest of its life (unless another connection turns it off again).

03:01 Post: Anyone know how to compile RTree and Geopoly as shared libraries on macOS? (artifact: 283f030213 user: simonw)

I'd love to be able to easily use the RTree and Geopoly modules on systems that already have SQLite installed but didn't include those in their installation.

Ideally I want to build a .so module that I can load like this (in Python):

import sqlite3
conn = sqlite3.connect("places.db")
conn.enable_load_extension(True)
conn.load_extension("rtree.so")
conn.load_extension("geopoly.so")

Does anyone know the right incantations on macOS (and ideally on Linux too) to compile the source code for these modules into a library that I can load in this way?

02:30 Reply: Using WAL mode with multiple processes (artifact: 02f22fd538 user: simonw)

That's the bit that confused me - it sounded like it meant that it would change for connections that are currently open, but I wasn't sure if it would affect connections that open later on.

If WAL is actually a persistent proporty of the database file itself then I guess the answer is that any connection can "turn on" WAL for a database file and it will stay as a WAL database file for the rest of its life (unless another connection turns it off again).

2020-08-09
20:59 Edit: Using WAL mode with multiple processes (artifact: c4dbf6ca17 user: simonw)

The documentation at https://www.sqlite.org/wal.html says that you can start WAL mode using:

PRAGMA journal_mode=WAL;

If I have multiple processes running against the same file - one process that is just reading it, another one (or more) that are writing to it, how should I use this pragma?

If I've run it once in a process, will the other processes automatically pick it up?

Do I need to run the pragma before the other processes open their connections to the database file?

Does it even matter which process first executes the PRAGMA?

If my read-only process executes it, will it affect any other processes that attempt to write to that file?

20:58 Post: Using WAL mode with multiple processes (artifact: c95e4286e5 user: simonw)

The documentation at https://www.sqlite.org/wal.html says that you can start WAL mode using:

PRAGMA journal_mode=WAL;

If I have multiple processes running against the same file - one process that is just reading it, another one (or more) that are writing to it, how should I use this pragma?

If I've run it once in a process, will the other processes automatically pick it up?

Do I need to run the pragma before the other processes open their connections to the database file?

Does it even matter which process first executes the PRAGMA?

If my read-only process executes it, will it affect any other processes that attempt to write to that file?

03:25 Reply: Potential bug? = comparison on columns defined with no type (artifact: 37519de98c user: simonw)

That makes sense, thanks for the explanation! I was confused because I'd been benefiting from type coercion on my other projects, since I was comparing values to a column with a defined type affinity.

2020-08-08
23:54 Reply: SQLite 3.33.0 beta-1 (artifact: 4424872e1c user: simonw)

The changelog mentions an increase in the maximum database size from 140TB to 281TB.

Out of interest: are there SQLite users out there who have run into the 140TB size limit?

23:36 Post: Potential bug? = comparison on columns defined with no type (artifact: 8a388072a1 user: simonw)

I ran into this while working with the parlgov-development.db (<- download link) database from http://www.parlgov.org/

If you create a table where the columns do not have types (which isn't something I would ever do, but I write software that works with databases created by other people) the following SQL queries return different results:

select * from no_column_types where country_id = 44;

Compared to:

select * from no_column_types where country_id = '44';

Here's a full demonstration session:

SQLite version 3.28.0 2019-04-15 14:49:49
sqlite> create table no_column_types (id, name, country_id);
sqlite> insert into no_column_types (id, name, country_id) values (1, 'Bob', 44);
sqlite> insert into no_column_types (id, name, country_id) values (2, 'Karl', 44);
sqlite> select * from no_column_types;
1|Bob|44
2|Karl|44
sqlite> select * from no_column_types where country_id = 44;
1|Bob|44
2|Karl|44
sqlite> select * from no_column_types where country_id = '44';
sqlite> 

Is this a SQLite bug, or is this intended behaviour?

Here's a full issue thread where I explore this bug: Figure out why an {eq: value} fails where where: "x = value" succeeds