SQLite Forum

Timeline
Login

50 most recent forum posts by user simonw

2021-08-07
19:42 Reply: Old Nabble SQLite mailing list site down? (artifact: 29f3f048be user: simonw)

I always find those kinds of conversations useful, provided they are clearly presented in context. The deeper my relationship with a piece of software gets the more I find myself running into questions that can best be answered by conversations that were had many years previously, against much older versions of the software.

I'd love to see the old mailing list content available online in an archive that includes a prominent "Warning: this conversation from 2009 relates to a 12 year old version of the software" warning, rather than leaving that content entirely unavailable.

2021-07-20
00:00 Reply: FTS5 external content tables (artifact: b8927c6476 user: simonw)

For such a small collection of data my hunch is that it won't matter either way. I recommend trying out implementing both patterns and running some micro-benchmarks though to see for yourself - doing that is always a useful exercise for this kind of performance question.

2021-07-19
23:56 Reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: efe43d3f6c user: simonw)

Thanks very much, this is really useful.

23:54 Edit reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 2eebabbbb1 user: simonw)

Do you have a blog post on how you've managed this?

Mainly it's just "use Cloud Run" - to use it you need to package up your (stateless) web application in a Docker container which Cloud Run will then start running (and charging for) when requests come in and switch back off when the traffic stops.

I made some notes on how to use it here: https://til.simonwillison.net/cloudrun/ship-dockerfile-to-cloud-run

I usually deploy to it with my datasette publish cloudrun tool though, which is described here: https://docs.datasette.io/en/stable/publish.html#publishing-to-google-cloud-run

23:54 Reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 6c5ac6ef78 user: simonw)

Do you have a blog post on how you've managed this?

Mainly it's just "use Cloud Run" - to use it you need to package up your (stateless) web application in a Docker container which Cloud Run will then start running (and charging for) when requests come in and switch back off when the traffic stops.

I made some notes on how to use it here: https://til.simonwillison.net/cloudrun/ship-dockerfile-to-cloud-run

I usually deploy to it with my "datasette publish cloudrun" tool though, which is described here: https://docs.datasette.io/en/stable/publish.html#publishing-to-google-cloud-run

02:32 Reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 76a9badc6c user: simonw)

Cloud Run stays "warm" in between requests, so in practice I see a few seconds delay the first time I visit one of my applications but then everything loads quickly (without the cold start) for subsequent requests.

This is a great fit for my stuff that works on top of SQLite, since I'm building interfaces for people to explore a database - so they're likely to continue running different queries against the same tables for a period of several minutes, ideally while those stay in RAM.

On "serverless" - completely agree, it's a terrible name. The definition I've chosen to embrace for it is "scale to zero", which means I don't have to spend any money on hosting my applications unless they are actively serving traffic. Since I have over 100 side-projects running at any given time this characteristic is extremely useful!

2021-07-18
03:20 Edit reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 55f54b7927 user: simonw)

If I run arbitrary SQL SELECT queries against a ~600MB database file the queries take as short a time as possible (in ms) to execute. My hope is that more RAM = more data queried in memory rather than being read from disk = faster executing queries.

03:18 Edit reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 59afc95a3d user: simonw)

This is really useful information, thanks!

I'm confident I'm using immutable and ro correctly in this case - I use ro if there might be other processes updating the SQL database file (often cron scripts). I use immutable when I can absolutely guarantee nothing will update the database - often that's because I'm hosting it on Google Cloud Run which doesn't even provide a persistent writable file-system.

03:17 Reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 79061399b1 user: simonw)

This is really useful information, thanks!

I'm confident I'm using immutable and ro correctly in this case - I use ro if there might be other processes updating the SQL database file (often cron scripts). I use immutable when I can absolutely guarantee nothing will update the database - often that's because I'm hosting it on Google Cloud Run which doesn't even provide a writable file-system.

03:15 Reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 4393a42b3b user: simonw)

If I run arbitrary SQL queries against a ~600MB database file the queries take as shorter time (in ms) to execute. My hope is that more RAM = more data queried in memory rather than being read from disk = faster executing queries.

2021-07-17
17:44 Reply: Should SQLite be able to optimize this query? (artifact: ff57c45af9 user: simonw)

This is fantastic. Really like that set of rules for when this applies. Thank you for this!

17:35 Edit reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 618a2de207 user: simonw)

Thanks, I'll take a look at mmap too: https://www.sqlite.org/pragma.html#pragma_mmap_size and https://www.sqlite.org/mmap.html

Interestingly the Python sqlite3 library (which is written in C) releases the GIL when executing queries, which makes me hopeful that SQLite in Python can take advantage of multiple cores even within a single Python process. I haven't figured out a good way to confirm that this is happening yet though!

17:33 Edit reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 64ba53e03c user: simonw)

Thanks, I'll take a look at mmap too: https://www.sqlite.org/pragma.html#pragma_mmap_size

Interestingly the Python sqlite3 library (which is written in C) releases the GIL when executing queries, which makes me hopeful that SQLite in Python can take advantage of multiple cores even within a single Python process. I haven't figured out a good way to confirm that this is happening yet though!

17:33 Reply: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 2a1e5cc6a7 user: simonw)

Thanks, I'll take a look at mmap too: https://www.sqlite.org/pragma.html#pragma_mmap_size

Interestingly the Python sqlite3 library (which is written in C) releases the GIL when executing queries, which makes me hopeful that SQLite in Python can take advantage of multiple cores even within a single Python process. I haven't figured out a good way to confirm that this is happening yet though!

2021-07-16
18:41 Edit: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 9a6b9ae8e2 user: simonw)

I'm interested in getting better performance out of SQLite on machines with more RAM, using the Python sqlite3 module from the standard library.

(Tips for more CPUs would be useful too - I'm interested in understanding what the benefits of paying for larger EC2 instances are, in addition to maximizing performance against currently available memory.)

I've seen PRAGMA cache_size in the documentation - is this the main lever I should be looking at or are there other options that might help here?

My current plan is to set up a simple load test that exercises my Python + SQLite code across a number of different queries, then modify the cache_size and see what impact that has on requests/second and request durations.

I'd thoroughly appreciate any advice on ways to scale SQLite up like this.

Most of my workloads are read-only (I open the SQLite file with the ?mode=ro or ?immutable=1 flags) so I'm mainly interested in improving performance of SELECTs against read-only databases.

18:40 Edit: Tips for getting SQLite under Python to perform better with more RAM? (artifact: f6bdd6b503 user: simonw)

I'm interested in getting better performance out of SQLite on machines with more RAM, using the Python sqlite3 module from the standard library.

I've seen PRAGMA cache_size in the documentation - is this the main lever I should be looking at or are there other options that might help here?

My current plan is to set up a simple load test that exercises my Python + SQLite code across a number of different queries, then modify the cache_size and see what impact that has on requests/second and request durations.

I'd thoroughly appreciate any advice on ways to scale SQLite up like this.

Most of my workloads are read-only (I open the SQLite file with the ?mode=ro or ?immutable=1 flags) so I'm mainly interested in improving performance of SELECTs against read-only databases.

18:26 Reply: Should SQLite be able to optimize this query? (artifact: faf982aced user: simonw)

As a really safe alternative, how about if the ignore-order-by only took place for a specific list of aggregations (or even just count)?

18:23 Post: Tips for getting SQLite under Python to perform better with more RAM? (artifact: 9e15765874 user: simonw)

I'm interested in getting better performance out of SQLite on machines with more RAM, using the Python sqlite3 module from the standard library.

I've seen PRAGMA cache_size in the documentation - is this the main lever I should be looking at or are there other options that might help here?

My current plan is to set up a simple load test that exercises my Python + SQLite code across a number of different queries, then modify the cache_size and see what impact that has on requests/second and request durations.

I'd thoroughly appreciate any advice on ways to scale SQLite up like this.

01:50 Reply: Should SQLite be able to optimize this query? (artifact: 6ecf3cab23 user: simonw)

I would definitely expect the order by on the view there to be respected, whether or not that's in the SQL standard.

2021-07-15
19:45 Reply: Should SQLite be able to optimize this query? (artifact: 8ada395a25 user: simonw)

Breaking that group_concat() example certainly looks like an unacceptable change - I wouldn't be at all surprised to see real-world code that relies on that.

In my particular case queries that I was running extremely frequently (often several times for a single HTTP page load) got nearly a 10x speed-up - but I've now changed my code to remove the inner order by (see https://github.com/simonw/datasette/issues/1394) so my project would no longer benefit from this optimization.

17:52 Reply: Should SQLite be able to optimize this query? (artifact: 0212439555 user: simonw)

I'm glad to have posed something that turns out to be an interesting problem!

16:40 Reply: Should SQLite be able to optimize this query? (artifact: 5d14ed56db user: simonw)
explain query plan select country_long, count(*)
from (
  select * from [global-power-plants] order by rowid
)
where country_long is not null
group by country_long
order by count(*) desc

Output:

id,parent,notused,detail
2,0,0,CO-ROUTINE 1
5,2,0,SCAN TABLE global-power-plants
52,0,0,SCAN SUBQUERY 1
57,0,0,USE TEMP B-TREE FOR GROUP BY
92,0,0,USE TEMP B-TREE FOR ORDER BY
And for the query without the order by rowid:
explain query plan select country_long, count(*)
from (
  select * from [global-power-plants]
)
where country_long is not null
group by country_long
order by count(*) desc
Output:
id,parent,notused,detail
8,0,0,"SCAN TABLE global-power-plants USING INDEX ""global-power-plants_country_long"""
40,0,0,USE TEMP B-TREE FOR ORDER BY

16:32 Reply: Should SQLite be able to optimize this query? (artifact: 466c8fec3b user: simonw)

It doesn't look like that speeds things up - I would expect SQLite to spot that optimization already.

16:31 Reply: Should SQLite be able to optimize this query? (artifact: 91c56240d2 user: simonw)
CREATE TABLE "global-power-plants" (
"country" TEXT,
  "country_long" TEXT,
  "name" TEXT,
  "gppd_idnr" TEXT,
  "capacity_mw" REAL,
  "latitude" REAL,
  "longitude" REAL,
  "primary_fuel" TEXT,
  "other_fuel1" TEXT,
  "other_fuel2" TEXT,
  "other_fuel3" TEXT,
  "commissioning_year" REAL,
  "owner" TEXT,
  "source" TEXT,
  "url" TEXT,
  "geolocation_source" TEXT,
  "wepp_id" TEXT,
  "year_of_capacity_data" INTEGER,
  "generation_gwh_2013" REAL,
  "generation_gwh_2014" REAL,
  "generation_gwh_2015" REAL,
  "generation_gwh_2016" REAL,
  "generation_gwh_2017" REAL,
  "generation_data_source" TEXT,
  "estimated_generation_gwh" REAL
);
CREATE INDEX ["global-power-plants_country_long"] ON [global-power-plants]("country_long");
CREATE INDEX ["global-power-plants_owner"] ON [global-power-plants]("owner");

You can download the full database from https://global-power-plants.datasettes.com/global-power-plants.db (11.1 MB)

16:07 Post: Should SQLite be able to optimize this query? (artifact: 2d76f2bcf6 user: simonw)

I spotted something interesting about this query. The following executes in about 47ms (against a table containing 33,000 rows):

select country_long, count(*)
from (
  select * from [global-power-plants] order by rowid
)
where country_long is not null
group by country_long
order by count(*) desc

Note that there's an order by rowid in that inner query which is unnecessary - the outer group by query doesn't need to care about the order of the rows in that inner query.

That same query again, without the unnecessary order by rowid:

select country_long, count(*)
from (
  select * from [global-power-plants]
)
where country_long is not null
group by country_long
order by count(*) desc

This executes in 7ms against the same table, returning the same results.

Would it be possible for SQLite to notice that the inner order by is unnecessary and ignore it when executing this kind of query?

(The reason I'm using a nested query here is that my software is designed to be able to calculate "facet counts" against the results of any arbitrary SQL query, which this pattern lets me do.)

2021-04-11
15:05 Reply: deploying 1_000_000 sqlitedb on AWS? (artifact: e529ede831 user: simonw)

Have you investigated EFS at all for this?

2021-04-05
03:39 Edit reply: For a given SQL query, mechanisms for figuring out which physical columns would be returned? (artifact: e20f3d0861 user: simonw)

I have enough of a proof of concept implementation in https://github.com/simonw/datasette/issues/1293#issuecomment-813162622 that I'm pretty confident I can get this to work. I'm ready to take my chances with regards to stability - this feature is a nice-to-have, so if future changes break it I can work around that.

Thanks for the help!

03:39 Reply: For a given SQL query, mechanisms for figuring out which physical columns would be returned? (artifact: 5b6f7e7ee5 user: simonw)

I have enough of a proof of concept implementation in https://github.com/simonw/datasette/issues/1293#issuecomment-813162622 that I'm pretty confident I can get this to work. I'm ready to take my chances with regards to stability - this feature is a nice-to-have, so if future changes break it I can work around that.

Thanks for the help!

03:38 Reply: For a given SQL query, mechanisms for figuring out which physical columns would be returned? (artifact: 04033829a0 user: simonw)

For those cases I'm fine with it not being able to identify the column that is being used in the expression. I only want to be able to detect columns that are used as-is, without being part of a formula.

01:17 Reply: For a given SQL query, mechanisms for figuring out which physical columns would be returned? (artifact: 2258454563 user: simonw)

That's really useful, thanks! It looks like it might be possible for me to reconstruct where each column came from using the explain select output.

Here's a complex example: https://calands.datasettes.com/calands?sql=explain+select%0D%0A++AsGeoJSON%28geometry%29%2C+*%0D%0Afrom%0D%0A++CPAD_2020a_SuperUnits%0D%0Awhere%0D%0A++PARK_NAME+like+%27%25mini%25%27+and%0D%0A++Intersects%28GeomFromGeoJSON%28%3Afreedraw%29%2C+geometry%29+%3D+1%0D%0A++and+CPAD_2020a_SuperUnits.rowid+in+%28%0D%0A++++select%0D%0A++++++rowid%0D%0A++++from%0D%0A++++++SpatialIndex%0D%0A++++where%0D%0A++++++f_table_name+%3D+%27CPAD_2020a_SuperUnits%27%0D%0A++++++and+search_frame+%3D+GeomFromGeoJSON%28%3Afreedraw%29%0D%0A++%29&freedraw=%7B%22type%22%3A%22MultiPolygon%22%2C%22coordinates%22%3A%5B%5B%5B%5B-122.42202758789064%2C37.82280243352759%5D%2C%5B-122.39868164062501%2C37.823887203271454%5D%2C%5B-122.38220214843751%2C37.81846319511331%5D%2C%5B-122.35061645507814%2C37.77071473849611%5D%2C%5B-122.34924316406251%2C37.74465712069939%5D%2C%5B-122.37258911132814%2C37.703380457832374%5D%2C%5B-122.39044189453125%2C37.690340943717715%5D%2C%5B-122.41241455078126%2C37.680559803205135%5D%2C%5B-122.44262695312501%2C37.67295135774715%5D%2C%5B-122.47283935546876%2C37.67295135774715%5D%2C%5B-122.52502441406251%2C37.68382032669382%5D%2C%5B-122.53463745117189%2C37.6892542140253%5D%2C%5B-122.54699707031251%2C37.690340943717715%5D%2C%5B-122.55798339843751%2C37.72945260537781%5D%2C%5B-122.54287719726564%2C37.77831314799672%5D%2C%5B-122.49893188476564%2C37.81303878836991%5D%2C%5B-122.46185302734376%2C37.82822612280363%5D%2C%5B-122.42889404296876%2C37.82822612280363%5D%2C%5B-122.42202758789064%2C37.82280243352759%5D%5D%5D%5D%7D

It looks like the opcodes I need to inspect are OpenRead, Column and ResultRow.

OpenRead tells me which tables are being opened - the p2 value (in this case 51) corresponds to the rootpage column in sqlite_master here: https://calands.datasettes.com/calands?sql=select+*+from+sqlite_master - it gets assigned to the register in p1.

The Column opcodes tell me which columns are being read - p1 is that table reference, and p2 is the cid of the column within that table.

The ResultRow opcode then tells me which columns are used in the results. 15 16 means start at the 15th and then read the next 16 columns.

I think this might work!

2021-04-04
23:40 Edit: For a given SQL query, mechanisms for figuring out which physical columns would be returned? (artifact: 81976ebc35 user: simonw)

Given any SQL query - such as:

select a.*, b.* from a join b on a.b_id = b.id

I would like to know the columns that would be returned by the query. Not just their names, but also which table those columns live in.

For the above example this might be:

a.id
a.name
a.b_id
b.id
b.name

But you could imagine this getting significantly more complex for joins across many different tables.

I want this so my software, Datasette, can display columns from specific tables in different ways, even when those columns are returned as part of an arbitrary query.

My biggest constraint is that I am using the sqlite3 module from the Python standard library. It looks like this family of C functions: https://www.sqlite.org/c3ref/column_database_name.html can solve my problem, but those are not exposed in the Python library I am using.

I wrote a bit more about my research into this here: https://github.com/simonw/datasette/issues/1293

So three questions:

  1. Is there a mechanism for this that I might be missing?

  2. Would it be feasible to achieve this using a custom SQLite extension? One that exposes the column_database_name() C API functions in a way that I can use them to answer my question? I was thinking something like this:

    select database, table, column from analyze_query(
        'select foo, bar from baz'
    )
    
  3. Feature request: could SQLite itself add some kind of SQL-level mechanism for accessing this information, such that even if the Python sqlite3 module doesn't add support this information would become available via a PRAGMA or similar?

23:37 Reply: For a given SQL query, mechanisms for figuring out which physical columns would be returned? (artifact: 061b8bf9f9 user: simonw)

Here's a shorter, better way to describe the problem I want to solve.

Given an arbitrary SQL query, I would like to know:

  1. The name of the columns that will be returned if I execute the query
  2. For each of those returned columns, did it come from a specific column in a specific table? If so, which one?
  3. If a column was generated by an expression (a group_concat() for example) I'm OK not knowing which columns from which table were involved in that expression

Ideally I'd like to do this without executing the query (or while executing the query with a limit 0) but I'd be OK with a solution that did execute the query too.

23:19 Post: For a given SQL query, mechanisms for figuring out which physical columns would be returned? (artifact: 0180277fb7 user: simonw)

Given any SQL query - such as:

select a.*, b.* from a join b on a.b_id = b.id

I would like to know the columns that would be returned by the query. Not just their names, but also which table those columns live in.

For the above example this might be:

a.id
a.name
a.b_id
b.id
b.name

But you could imagine this getting significantly more complex for joins across many different tables.

I want this so my software, Datasette, can display columns from specific tables in different ways, even when those columns are returned as part of an arbitrary query.

My biggest constraint is that I am using the sqlite3 module from the Python standard library. It looks like this family of C functions: https://www.sqlite.org/c3ref/column_database_name.html can solve my problem, but those are not exposed in the Python library I am using.

I wrote a bit more about my research into this here: https://github.com/simonw/datasette/issues/1293

So three questions:

  1. Is there a mechanism for this that I might be missing?

  2. Would it be feasible to achieve this using a custom SQLite extension? One that exposes the column_database_name() C API functions in a way that I can use them to answer my question? I was thinking something like this:

    select database, table, column from analyze_query(:sql_query)
    
  3. Feature request: could SQLite itself add some kind of SQL-level mechanism for accessing this information, such that even if the Python sqlite3 module doesn't add support this information would become available via a PRAGMA or similar?

2021-03-12
23:52 Reply: Inserting the image as BLOB (artifact: b396eeccd2 user: simonw)

My sqlite-utils command-line tool has a command for doing this:

https://sqlite-utils.datasette.io/en/stable/cli.html#inserting-binary-data-from-files

sqlite-utils insert-files gifs.db images horse.png

I wrote more about that here: https://simonwillison.net/2020/Jul/30/fun-binary-data-and-sqlite/

23:26 Edit: "SQL Features That SQLite Does Not Implement" needs updating (artifact: 6690bfb7d3 user: simonw)

www.sqlite.org/omitted.html still says "Other kinds of ALTER TABLE operations such as DROP COLUMN, ALTER COLUMN, ADD CONSTRAINT, and so forth are omitted." - that's no longer accurate as-of SQLite 3.35.

23:26 Post: "SQL Features That SQLite Does Not Implement" needs updating (artifact: d6dff461f9 user: simonw)

https://www.sqlite.org/omitted.html still says "Other kinds of ALTER TABLE operations such as DROP COLUMN, ALTER COLUMN, ADD CONSTRAINT, and so forth are omitted." - that's no longer accurate as-of SQLite 3.35.

2021-02-19
22:30 Edit: SQLite database in space? (artifact: d623ddc90e user: simonw)

Just a fun question that popped into my head today: does anyone know if there are any SQLite databases running in space?

On the ISS, in satellites or even somewhere further afield?

Apparently NASA have shipped PostgreSQL into lower earth orbit (according to a tip on Twitter)

22:29 Post: SQLite database in space? (artifact: b61c0395ba user: simonw)

Just a fun question that popped into my head today: does anyone know if there are any SQLite databases running in space?

On the ISS, in satellites or even somewhere further afield?

Apparently NASA have shipped PostgreSQL into lower earth orbit (according to a tip on Twitter)

2020-11-17
17:12 Reply: FYI: Personal Data Warehouses: Reclaiming Your Data (artifact: eb151f3a30 user: simonw)

Yeah I see Perkeep as being a little bit more aimed at the personal archiving problem, especially around larger files such as photos and videos. It feels a bit like an open source equivalent of Dropbox in that regard.

Dogsheep is much more aimed at smaller, highly structured data which you want to run arbitrary queries against, hence the focus on SQLite.

I think the two projects could complement each other - having Perkeep archive my personal SQLite database files seems particularly appropriate for example.

2020-11-08
18:14 Edit: sqlite-utils 3.0: CLI tool can now execute FTS seaches against a .db file (artifact: 3712b1f43d user: simonw)

I just released version 3.0 of my sqlite-utils combination command-line tool and Python library for maniuplating SQLite databases.

https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v3-0

The big new feature in this release is the new sqlite-utils search command, which can be used on the command-line to run FTS searches against correctly configured tables and return results ordered by relevance.

Here's a demo, using a CSV file of film locations in San Francisco from https://data.sfgov.org/Culture-and-Recreation/Film-Locations-in-San-Francisco/yitu-d5am

First, to create a database from that CSV file:

curl 'https://data.sfgov.org/api/views/yitu-d5am/rows.csv?accessType=DOWNLOAD' \
  | sqlite-utils insert sf.db locations - --csv

This creates a sf.db SQLite database containing a single locations table.

Next, enable full-text search on that table (for the Title and Locations columns):

sqlite-utils enable-fts sf.db locations 'Title' 'Locations'

Now we can run searches!

sqlite-utils search sf.db locations 'cable'                                          
[{"rowid": 1215, "Title": "Play it Again, Sam", "Release Year": "1972", "Locations": "Hyde Street Cable Car", "Fun Facts": "", "Production Company": "Paramount Pictures", "Distributor": "Paramount Pictures", "Director": "Herbert Ross", "Writer": "Woody Allen", "Actor 1": "Woody Allen", "Actor 2": "Diane Keaton", "Actor 3": "Tony Roberts"},
 {"rowid": 2288, "Title": "Always Be My Maybe", "Release Year": "2019", "Locations": "California Cable Car Line", "Fun Facts": "", "Production Company": "Isla Productions, LLC", "Distributor": "Netflix", "Director": "Nahnatchka Khan", "Writer": "Michael Golamco, Randall Park, Ali Wong", "Actor 1": "Ali Wong", "Actor 2": "Randall Park", "Actor 3": "Keanu Reeves"}

Search results output as JSON by default, but we can request a more readable table format using the -t option. We can also specify the columns we want to see using -c.

sqlite-utils search sf.db locations 'cable' -t -c Title -c Locations
Title                          Locations
-----------------------------  ---------------------------------------------------------
Play it Again, Sam             Hyde Street Cable Car
Always Be My Maybe             California Cable Car Line
Play it Again, Sam             Hyde Street Cable Car
Attack of the Killer Tomatoes  Hyde Street Cable Car
Attack of the Killer Tomatoes  Hyde Street Cable Car
The Bachelor                   Roof of Cable Car (California at Front Street)
The Bachelor                   Roof of Cable Car (California at Front Street)
Woman on the Run               Cable Car Signal Box (California Street at Powell Street)
Woman on the Run               Cable Car Signal Box (California Street at Powell Street)

Full documentation for the new sqlite-utils search command can be found here: https://sqlite-utils.readthedocs.io/en/stable/cli.html#cli-search

One detail I'm particularly proud of: it works with both FTS5 and FTS4 tables, including sort by relevance! This was tricky because FTS5 has relevance sort built-in, but FTS4 requires you to provide your own scoring function. I'm using a scoring function I wrote in Python a couple of years ago which I described on my blog in Exploring search relevance algorithms with SQLite.

17:50 Post: sqlite-utils 3.0: CLI tool can now execute FTS seaches against a .db file (artifact: 8467b73d34 user: simonw)

I just released version 3.0 of my sqlite-utils combination command-line tool and Python library for maniuplating SQLite databases.

https://sqlite-utils.readthedocs.io/en/stable/changelog.html#v3-0

The big new feature in this release is the new sqlite-utils search command, which can be used on the command-line to run FTS searches against correctly configured tables and return results ordered by relevance.

Here's a demo, using a CSV file of film locations in San Francisco from https://data.sfgov.org/Culture-and-Recreation/Film-Locations-in-San-Francisco/yitu-d5am

First, to create a database from that CSV file:

curl 'https://data.sfgov.org/api/views/yitu-d5am/rows.csv?accessType=DOWNLOAD' \
  | sqlite-utils insert sf.db locations - --csv

This creates a sf.db SQLite database containing a single locations table.

Next, enable full-text search on that table (for the Title and Locations columns):

sqlite-utils enable-fts sf.db locations 'Title' 'Locations'

Now we can run searches!

sqlite-utils search sf.db locations 'cable'                                          
[{"rowid": 1215, "Title": "Play it Again, Sam", "Release Year": "1972", "Locations": "Hyde Street Cable Car", "Fun Facts": "", "Production Company": "Paramount Pictures", "Distributor": "Paramount Pictures", "Director": "Herbert Ross", "Writer": "Woody Allen", "Actor 1": "Woody Allen", "Actor 2": "Diane Keaton", "Actor 3": "Tony Roberts"},
 {"rowid": 2288, "Title": "Always Be My Maybe", "Release Year": "2019", "Locations": "California Cable Car Line", "Fun Facts": "", "Production Company": "Isla Productions, LLC", "Distributor": "Netflix", "Director": "Nahnatchka Khan", "Writer": "Michael Golamco, Randall Park, Ali Wong", "Actor 1": "Ali Wong", "Actor 2": "Randall Park", "Actor 3": "Keanu Reeves"}

Search results output as JSON by default, but we can request a more readable table format using the -t option. We can also specify the columns we want to see using -c.

sqlite-utils search sf.db locations 'cable' -t -c Title -c Locations
Title                          Locations
-----------------------------  ---------------------------------------------------------
Play it Again, Sam             Hyde Street Cable Car
Always Be My Maybe             California Cable Car Line
Play it Again, Sam             Hyde Street Cable Car
Attack of the Killer Tomatoes  Hyde Street Cable Car
Attack of the Killer Tomatoes  Hyde Street Cable Car
The Bachelor                   Roof of Cable Car (California at Front Street)
The Bachelor                   Roof of Cable Car (California at Front Street)
Woman on the Run               Cable Car Signal Box (California Street at Powell Street)
Woman on the Run               Cable Car Signal Box (California Street at Powell Street)

Full documentation for the new sqlite-utils search command can be found here: https://sqlite-utils.readthedocs.io/en/stable/cli.html#cli-search

One detail I'm particularly proud of: it works with both FTS5 and FTS4 tables, including sort by relevance! This was tricky because FTS5 has relevance sort built-in, but FTS4 requires you to provide your own scoring function. I'm using a scoring function I wrote in Python a couple of years ago which I described on my blog in Exploring search relevance algorithms with SQLite.

2020-10-12
22:17 Reply: Most efficient way to tell if any database content has changed (artifact: 6cbd42a6bf user: simonw)

Not sure how I missed that, that looks perfect! Thanks very much.

18:55 Edit: Most efficient way to tell if any database content has changed (artifact: 34b89f2ff1 user: simonw)

I'm interested in answering the question "has the contents of this database changed in any way" as efficiently as possible.

I want to do this for caching purposes - I'd like to be able to cache various things relating to an infrequently changed database but invalidate that cache if even a single row has been inserted or updated.

I've successfully used PRAGMA schema_version to invalidate a cache when the database schema changes, but now I'm looking to do it based on the file contents instead.

Options I'm considering:

  • Calculating the MD5 hash of the file contents on disk, which runs pretty fast (but not fast enough to do it often)
  • Adding triggers to every single database table which increment a counter in a dedicated content_version table any time those tables have content inserted/updated/deleted

I'm pretty sure that second option would work correctly (I'm not at all worried about the additional overhead on writes), but I'm wondering if there's a solution I haven't considered that would be even better.

18:55 Edit: Most efficient way to tell if any database content has changed (artifact: e879702045 user: simonw)

I'm interested in answering the question "has the contents of this database changed in any way" as efficiently as possible.

I want to do this for caching purposes - I'd like to be able to cache various things relating to an infrequently changed database but invalidate that cache if even a single row has been inserted or updated.

I've successfully used PRAGMA schema_version to invalidate a cache when the database schema changes, but now I'm looking to do it based on the file contents instead.

Options I'm considering:

  • Calculating the MD5 hash of the file contents on disk, which runs pretty fast (but not fast enough to do it often)
  • Adding triggers to every single database table which increment a counter any time those tables have content inserted/updated/deleted

I'm pretty sure that second option would work correctly (I'm not at all worried about the additional overhead on writes), but I'm wondering if there's a solution I haven't considered that would be even better.

18:54 Post: Most efficient way to tell if any database content has changed (artifact: 337bf94e26 user: simonw)

I'm interested in answering the question "has the contents of this database file changed in any way" as efficiently as possible.

I want to do this for caching purposes - I'd like to be able to cache various things relating to an infrequently changed database but invalidate that cache if even a single row has been inserted or updated.

I've successfully used PRAGMA schema_version to invalidate a cache when the database schema changes, but now I'm looking to do it based on the file contents instead.

Options I'm considering:

  • Calculating the MD5 hash of the file contents on disk, which runs pretty fast (but not fast enough to do it often)
  • Adding triggers to every single database table which increment a counter any time those tables have content inserted/updated/deleted

I'm pretty sure that second option would work correctly (I'm not at all worried about the additional overhead on writes), but I'm wondering if there's a solution I haven't considered that would be even better.

2020-10-08
03:06 Reply: Insert ("slurp" ?) JSON documents (artifact: 3733efdcb6 user: simonw)

I did not know about that readfile function - this is a really clever solution.

03:06 Reply: Efficiently checking if a TEXT column exclusively contains integers or floating point numbers (as strings) (artifact: a7624e9747 user: simonw)

That's brilliant!

2020-10-06
01:10 Reply: Insert ("slurp" ?) JSON documents (artifact: f20e8b409b user: simonw)

My sqlite-utils CLI tool can do exactly this:

echo '[
    {
        "id": 1,
        "name": "Cleo",
        "age": 4
    },
    {
        "id": 2,
        "name": "Pancakes",
        "age": 2
    },
    {
        "id": 3,
        "name": "Toby",
        "age": 6
    }
]' | sqlite-utils insert dogs.db dogs - --pk=id

Having run this a new database file called dogs.db will be created with the following schema:

$ sqlite3 dogs.db .schema
CREATE TABLE [dogs] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT,
   [age] INTEGER
);

2020-09-28
16:33 Reply: Efficiently checking if a TEXT column exclusively contains integers or floating point numbers (as strings) (artifact: 67b08938fe user: simonw)

Yes, it has MANY limitations with things like that syntax, NaN etc.

01:45 Reply: Efficiently checking if a TEXT column exclusively contains integers or floating point numbers (as strings) (artifact: 6952c2332c user: simonw)

That's really useful, thank you for the example code!

More ↓