JOINs with FTS5 virtual tables are very slow

(1) By punkish on 2020-03-13 09:53:18 updated by 1.1 [link] [source]

I had submitted the following question a few days ago to which Dan Kennedy very kindly replied. But the problem wasn't really resolved so I am resubmitting it. This time I am trying to be less clever and not submitting pseudo-code. Instead, these are actual queries that I ran about 5 mins ago. Here goes.

The following query took 170s and returned 10 rows. Let's call this Query 1

## Query 1

```sql
SELECT 
	collectionCode, 
	Count(collectionCode) AS c 
FROM 
	materialsCitations
	JOIN treatments ON materialsCitations.treatmentId = treatments.treatmentId
	JOIN vtreatments ON treatments.treatmentId = vtreatments.treatmentId
WHERE 
	collectionCode != ''
	AND materialsCitations.deleted = 0 
	AND treatments.deleted = 0
	AND vtreatments MATCH "carabus"
GROUP 
	BY collectionCode;

-- QUERY PLAN
|--SCAN TABLE vtreatments VIRTUAL TABLE INDEX 0:m
|--SEARCH TABLE materialsCitations USING INDEX ix_materialsCitations_collectionCode (deleted=? AND deleted=?)
|--SEARCH TABLE treatments USING COVERING INDEX ix_treatments_treatmentId (deleted=? AND treatmentId=?)
`--USE TEMP B-TREE FOR GROUP BY
```

Let's break it down into sub-queries. The following takes 668ms

## Query 2

```sql
SELECT 
	collectionCode, 
	Count(collectionCode) AS c 
FROM 
	materialsCitations
    JOIN treatments ON materialsCitations.treatmentId = treatments.treatmentId
WHERE 
	collectionCode != ''
	AND materialsCitations.deleted = 0 
    AND treatments.deleted = 0
GROUP 
	BY collectionCode;

-- QUERY PLAN
|--SEARCH TABLE materialsCitations USING INDEX ix_materialsCitations_collectionCode (deleted=? AND deleted=?)
`--SEARCH TABLE treatments USING COVERING INDEX ix_treatments_treatmentId (deleted=? AND treatmentId=?)
```

The following takes 37ms

## Query 3

```sql
SELECT treatmentId FROM vtreatments WHERE vtreatments MATCH "carabus";

-- QUERY PLAN
 SCAN TABLE vtreatments VIRTUAL TABLE INDEX 131073:
```


Now, let's put them together. The following query takes 439ms and also returns 10 rows. As one can see, **Query 1** is more than 380 times slower than **Query 4** even though producing identical results.

## Query 4

```sql
SELECT 
	collectionCode, 
	Count(collectionCode) AS c 
FROM 
	materialsCitations
	JOIN (SELECT * FROM treatments WHERE treatmentId IN 
            (SELECT treatmentId FROM vtreatments WHERE vtreatments MATCH "carabus")) t
            ON materialsCitations.treatmentId = t.treatmentId
WHERE 
 	collectionCode != ''
	AND materialsCitations.deleted = 0 
	AND t.deleted = 0
GROUP 
	BY collectionCode;

--QUERY PLAN
|--SEARCH TABLE materialsCitations USING INDEX ix_materialsCitations_collectionCode (deleted=? AND deleted=?)
|--LIST SUBQUERY 1
|  `--SCAN TABLE vtreatments VIRTUAL TABLE INDEX 0:m
|--SEARCH TABLE treatments USING INDEX sqlite_autoindex_treatments_1 (treatmentId=?)
`--LIST SUBQUERY 1
   --SCAN TABLE vtreatments VIRTUAL TABLE INDEX 0:m
```

Consistently, my experience is that when I JOIN a virtual table to a normal table, the query is not longer performant. Is that expected or can I do something about it to make such a JOIN efficient? If yes, then I will try to change my program so it creates queries of the type **Query 4**, although given the unpredictability of what params the users might submit, it is going to be very tricky.

A related observation is that querying a virtual table is usually very fast *unless* the query finds a lot of results. I don't have a cut-off number, but a query finding fewer than 100 rows is blindingly fast but a query finding 80K rows is very slow. And, I don't really mean "returning", I really mean "finding". That is, just returning the `Count(*)` of matching rows is very low when the count is very high. I would expect that such a count would be returned from some kind of term frequency index.

JOINs with FTS5 virtual tables are very slow

(1.1) By punkish on 2020-03-13 10:09:33 edited from 1.0 updated by 1.2 [link] [source]

I had submitted the following question a few days ago to which Dan Kennedy very kindly replied. But the problem wasn't really resolved so I am resubmitting it. This time I am trying to be less clever and not submitting pseudo-code. Instead, these are actual queries that I ran about 5 mins ago. Here goes.

Note: the total row counts are as follows

- treatments 308498
- materialsCitations 269546

The following query took 170s and returned 10 rows. Let's call this Query 1

## Query 1

```sql
SELECT 
	collectionCode, 
	Count(collectionCode) AS c 
FROM 
	materialsCitations
	JOIN treatments ON materialsCitations.treatmentId = treatments.treatmentId
	JOIN vtreatments ON treatments.treatmentId = vtreatments.treatmentId
WHERE 
	collectionCode != ''
	AND materialsCitations.deleted = 0 
	AND treatments.deleted = 0
	AND vtreatments MATCH "carabus"
GROUP 
	BY collectionCode;

-- QUERY PLAN
|--SCAN TABLE vtreatments VIRTUAL TABLE INDEX 0:m
|--SEARCH TABLE materialsCitations USING INDEX ix_materialsCitations_collectionCode (deleted=? AND deleted=?)
|--SEARCH TABLE treatments USING COVERING INDEX ix_treatments_treatmentId (deleted=? AND treatmentId=?)
`--USE TEMP B-TREE FOR GROUP BY
```

Let's break it down into sub-queries. The following takes 668ms

## Query 2

```sql
SELECT 
	collectionCode, 
	Count(collectionCode) AS c 
FROM 
	materialsCitations
    JOIN treatments ON materialsCitations.treatmentId = treatments.treatmentId
WHERE 
	collectionCode != ''
	AND materialsCitations.deleted = 0 
    AND treatments.deleted = 0
GROUP 
	BY collectionCode;

-- QUERY PLAN
|--SEARCH TABLE materialsCitations USING INDEX ix_materialsCitations_collectionCode (deleted=? AND deleted=?)
`--SEARCH TABLE treatments USING COVERING INDEX ix_treatments_treatmentId (deleted=? AND treatmentId=?)
```

The following takes 37ms

## Query 3

```sql
SELECT treatmentId FROM vtreatments WHERE vtreatments MATCH "carabus";

-- QUERY PLAN
 SCAN TABLE vtreatments VIRTUAL TABLE INDEX 131073:
```


Now, let's put them together. The following query takes 439ms and also returns 10 rows. As one can see, **Query 1** is more than 380 times slower than **Query 4** even though producing identical results.

## Query 4

```sql
SELECT 
	collectionCode, 
	Count(collectionCode) AS c 
FROM 
	materialsCitations
	JOIN (SELECT * FROM treatments WHERE treatmentId IN 
            (SELECT treatmentId FROM vtreatments WHERE vtreatments MATCH "carabus")) t
            ON materialsCitations.treatmentId = t.treatmentId
WHERE 
 	collectionCode != ''
	AND materialsCitations.deleted = 0 
	AND t.deleted = 0
GROUP 
	BY collectionCode;

--QUERY PLAN
|--SEARCH TABLE materialsCitations USING INDEX ix_materialsCitations_collectionCode (deleted=? AND deleted=?)
|--LIST SUBQUERY 1
|  `--SCAN TABLE vtreatments VIRTUAL TABLE INDEX 0:m
|--SEARCH TABLE treatments USING INDEX sqlite_autoindex_treatments_1 (treatmentId=?)
`--LIST SUBQUERY 1
   --SCAN TABLE vtreatments VIRTUAL TABLE INDEX 0:m
```

Consistently, my experience is that when I JOIN a virtual table to a normal table, the query is not longer performant. Is that expected or can I do something about it to make such a JOIN efficient? If yes, then I will try to change my program so it creates queries of the type **Query 4**, although given the unpredictability of what params the users might submit, it is going to be very tricky.

A related observation is that querying a virtual table is usually very fast *unless* the query finds a lot of results. I don't have a cut-off number, but a query finding fewer than 100 rows is blindingly fast but a query finding 80K rows is very slow. And, I don't really mean "returning", I really mean "finding". That is, just returning the `Count(*)` of matching rows is very low when the count is very high. I would expect that such a count would be returned from some kind of term frequency index.

JOINs with FTS5 virtual tables are very slow

(1.2) By punkish on 2020-03-13 11:07:32 edited from 1.1 [link] [source]

I had submitted the following question a few days ago to which Dan Kennedy very kindly replied. But the problem wasn't really resolved so I am resubmitting it. This time I am trying to be less clever and not submitting pseudo-code. Instead, these are actual queries that I ran about 5 mins ago. Here goes.

Note: the total row counts are as follows

treatments 308498
materialsCitations 269546

The following query took 170s and returned 10 rows. Let's call this Query 1

Query 1

SELECT 
	collectionCode, 
	Count(collectionCode) AS c 
FROM 
	materialsCitations
	JOIN treatments ON materialsCitations.treatmentId = treatments.treatmentId
	JOIN vtreatments ON treatments.treatmentId = vtreatments.treatmentId
WHERE 
	collectionCode != ''
	AND materialsCitations.deleted = 0 
	AND treatments.deleted = 0
	AND vtreatments MATCH "carabus"
GROUP 
	BY collectionCode;

-- QUERY PLAN
|--SCAN TABLE vtreatments VIRTUAL TABLE INDEX 0:m
|--SEARCH TABLE materialsCitations USING INDEX ix_materialsCitations_collectionCode (deleted=? AND deleted=?)
|--SEARCH TABLE treatments USING COVERING INDEX ix_treatments_treatmentId (deleted=? AND treatmentId=?)
`--USE TEMP B-TREE FOR GROUP BY

Let's break it down into sub-queries. The following takes 668ms

Query 2

SELECT 
	collectionCode, 
	Count(collectionCode) AS c 
FROM 
	materialsCitations
    JOIN treatments ON materialsCitations.treatmentId = treatments.treatmentId
WHERE 
	collectionCode != ''
	AND materialsCitations.deleted = 0 
    AND treatments.deleted = 0
GROUP 
	BY collectionCode;

-- QUERY PLAN
|--SEARCH TABLE materialsCitations USING INDEX ix_materialsCitations_collectionCode (deleted=? AND deleted=?)
`--SEARCH TABLE treatments USING COVERING INDEX ix_treatments_treatmentId (deleted=? AND treatmentId=?)

The following takes 37ms

Query 3

SELECT treatmentId FROM vtreatments WHERE vtreatments MATCH "carabus";

-- QUERY PLAN
 SCAN TABLE vtreatments VIRTUAL TABLE INDEX 131073:

Now, let's put them together. The following query takes 439ms and also returns 10 rows. As one can see, Query 1 is more than 380 times slower than Query 4 even though producing identical results.

Query 4

SELECT 
	collectionCode, 
	Count(collectionCode) AS c 
FROM 
	materialsCitations
	JOIN (SELECT * FROM treatments WHERE treatmentId IN 
            (SELECT treatmentId FROM vtreatments WHERE vtreatments MATCH "carabus")) t
            ON materialsCitations.treatmentId = t.treatmentId
WHERE 
 	collectionCode != ''
	AND materialsCitations.deleted = 0 
	AND t.deleted = 0
GROUP 
	BY collectionCode;

--QUERY PLAN
|--SEARCH TABLE materialsCitations USING INDEX ix_materialsCitations_collectionCode (deleted=? AND deleted=?)
|--LIST SUBQUERY 1
|  `--SCAN TABLE vtreatments VIRTUAL TABLE INDEX 0:m
|--SEARCH TABLE treatments USING INDEX sqlite_autoindex_treatments_1 (treatmentId=?)
`--LIST SUBQUERY 1
   --SCAN TABLE vtreatments VIRTUAL TABLE INDEX 0:m

Consistently, my experience is that when I JOIN a virtual table to a normal table, the query is not longer performant. Is that expected or can I do something about it to make such a JOIN efficient? If yes, then I will try to change my program so it creates queries of the type Query 4, although given the unpredictability of what params the users might submit, it is going to be very tricky.

A related observation is that querying a virtual table is usually very fast unless the query finds a lot of results. I don't have a cut-off number, but a query finding fewer than 100 rows is blindingly fast but a query finding 80K rows is very slow. And, I don't really mean "returning", I really mean "finding". That is, just returning the Count(*) of matching rows is very slow when the count is very high. I would expect that such a count would be returned from some kind of term frequency index.

(2) By Dan Kennedy (dan) on 2020-03-13 13:56:04 in reply to 1.2 [link] [source]

If you run "ANALYZE" so that SQLite can see that the (deleted=0) condition matches a large percentage of the table, does it come up with a better plan for query 1?

(3) By punkish on 2020-03-13 14:15:42 in reply to 2 [link] [source]

jeebus! that was it! I ran ANALYZE on both treatments and materialsCitations tables. The QUERY PLAN changed ever so slightly (on the order of searching treatments and materialsCitations tables was flipped)

QUERY PLAN
|--SCAN TABLE vtreatments VIRTUAL TABLE INDEX 0:m
|--SEARCH TABLE treatments USING INDEX sqlite_autoindex_treatments_1 (treatmentId=?)
|--SEARCH TABLE materialsCitations USING INDEX ix_materialsCitations_treatmentId (deleted=? AND treatmentId=?)
`--USE TEMP B-TREE FOR GROUP BY

But the query was like night and day

sqlite> SELECT
   ...>     collectionCode,
   ...>     Count(collectionCode) AS c
   ...> FROM
   ...>     materialsCitations
   ...>     JOIN treatments ON materialsCitations.treatmentId = treatments.treatmentId
   ...>     JOIN vtreatments ON treatments.treatmentId = vtreatments.treatmentId
   ...> WHERE
   ...>     collectionCode != ''
   ...>     AND materialsCitations.deleted = 0
   ...>     AND treatments.deleted = 0
   ...>     AND vtreatments MATCH "carabus"
   ...> GROUP
   ...>     BY collectionCode;
collectionCode  c
--------------  ----------
CFDD            1
CFDD, CTAY      3
CFDD, CTAY, CZ  23
CFDD, CTAY, CZ  2
IZAS, MPU, BMN  1
NMNHS           14
NMP             6
ZISP            2
ZISP, CAK, CBK  1
ZISP, CFDD, CT  1
Run Time: real 0.259 user 0.014076 sys 0.064197
sqlite>

boom! thanks Dan! you saved me hours of headache trying to rewrite my query making routines.

A question – if ANALYZE makes such a difference, why doesn't SQLite just do it anyway? I mean, is it *not* advised toANALYZEunder certain circumstances? Keep in mind, I didn't firstINDEXthe tables and *then* changed thedeletedcolumn (thereby possibly necessitatingANALYZE). Thedeleted` column was set to 0 from the get go.

(4) By Simon Slavin (slavin) on 2020-03-13 14:23:24 in reply to 3 [link] [source]

ANALYZE not only looks at the tables and the indexes, but also the data in the tables, to work out how 'chunky' the data in each column is. So you can set up your schema perfectly, do ANALYZE with no data in the tables, and SQLite will still not choose the best strategies for search.

Also, ANALYZE can take a long time since it has to iterate through all data in the database. So SQLite never does it automatically because it doesn't know when a delay would be convenient to your users.

(5) By punkish on 2020-03-13 14:34:17 in reply to 4 [link] [source]

Thanks Simon, good to know this. My db is primarily read only. Once I fill it with data, it is used to power an API. So I can fill the db, create the indexes, run ANALYZE on all the tables, and be ready.

(6) By Richard Hipp (drh) on 2020-03-13 14:44:59 in reply to 4 updated by 6.1 [source]

> ANALYZE can take a long time

I just ran ANALYZE on the 93.7MB SQLite Fossil repository on my desktop
and it took 161 milliseconds.

But, yeah, depending on the database and the schema and the hardware on which it is running, ANALYZE can sometimes be expensive is it does have to do a full scan of every index.  Finding ways to make this work better is an 
area of active research.  The new [approximate-analyze][approx] branch,
for example, is intended to be a place where I can experiment with ideas of
computing a reasonable analysis using sampling of each index rather than
reading each index from end to end.  (I had hoped to get to work on that some
today, but I might end up spending the whole day working on this forum...)

See also the [PRAGMA optimize][opt] command, which is intended to be run by
applications as they are shutting down.  The PRAGMA optimize command might
run an ANALYZE on one or two indexes, on an as needed basis, but it does not
typically run a complete ANALYZE on all indexes, and so it tends to be faster.
And it knows how to rerun ANALYZE when the content of the database changes
significantly. Fossil has been [using PRAGMA optimize][fossil-use] for about
3 years now and that has worked out quite well.

[approx]: https://sqlite.org/src/timeline?r=approximate-analyze
[opt]: https://sqlite.org/pragma.html#pragma_optimize
[fossil-use]: https://fossil-scm.org/fossil/artifact/d75f4c685956c943?ln=1975

(7) By punkish on 2020-03-13 15:29:33 in reply to 6.0 [link] [source]

Yes, My experience was similar. Running analyze on the two culprit tables in a 5 GB database took a couple of seconds, maybe less. That said, I will follow Simon’s advice and run analyze consciously, not automatically via a program. Should do the trick for me. Thanks all, especially to Dan Kennedy, for helping solve this.

(6.1) By Richard Hipp (drh) on 2020-03-13 15:49:03 edited from 6.0 in reply to 4 [link] [source]

ANALYZE can take a long time

I just ran ANALYZE on the 93.7MB SQLite Fossil repository on my desktop and it took 161 milliseconds.

But, yeah, depending on the database and the schema and the hardware on which it is running, ANALYZE can sometimes be expensive as it does have to do a full scan of every index. Finding ways to make this work better is an area of active research. The new approximate-analyze branch, for example, is intended to be a place where I can experiment with ideas of computing a reasonable analysis using sampling of each index rather than reading each index from end to end. (I had hoped to get to work on that some today, but I might end up spending the whole day working on this forum...)

See also the PRAGMA optimize command, which is intended to be run by applications as they are shutting down. The PRAGMA optimize command might run an ANALYZE on one or two indexes, on an as needed basis, but it does not typically run a complete ANALYZE on all indexes, and so it tends to be faster. And it knows how to rerun ANALYZE when the content of the database changes significantly. Fossil has been using PRAGMA optimize for about 3 years now and that has worked out quite well.