Some Queries are 40 times slower after v3.39.1

(1.3) By Suat Hakki (suathd) on 2024-02-13 00:27:09 edited from 1.2 [source]

After upgrading to newer versions of sqlite3.dll some queries became 40 times slower.

sample db is at the following link.

https://dmksoft.com/test/TestDb.zip

sample SQL is below (it takes more than 7 seconds while it takes 0.2 seconds in earlier versions)

Note: eliminating just 1 table join, makes query time normal.

select a.SongId ItemId, a.SongName, c.GenreName, d.LanguageName, h.SubgenreName, e.ArtistName, n.CutAndFadeName from SONGS a, GENRES c, LANGUAGES d, OTHER_ARTISTS e, SUBGENRES h, ( select 0 CutAndFadeId, '<.?.>' CutAndFadeName UNION ALL select 1, 'CC' UNION ALL select 2, 'CM' UNION ALL select 3, 'CF' UNION ALL select 4, 'MC' UNION ALL select 5, 'MM' UNION ALL select 6, 'MF' UNION ALL select 7, 'FC' UNION ALL select 8, 'FM' UNION ALL select 9, 'FF' ) n

where e.ArtistId=a.ComposerId and c.GenreId=a.GenreId and d.LanguageId=a.LanguageId and n.CutAndFadeId=a.CutAndFade and h.SubgenreId=a.SubgenreId and a.SongName like 'Hata%'

EDIT: I've justed tested 3.45.1 Problem is solved. It takes 156 msec. probably it is the best performance of all versions. Thank you very much for all efforts.

(2) By Richard Hipp (drh) on 2023-12-29 00:23:31 in reply to 1.0 [link] [source]

A quick work-around for your specific database is to run this:

PRAGMA analysis_limit=100; ANALYZE;

The fact that this was working efficiently earlier was dumb luck. I will see if I can improve the query planner to deal better with database schemas similar to yours for a future release.

(3) By Richard Hipp (drh) on 2023-12-29 00:37:46 in reply to 2 [link] [source]

Another approach would be to rewrite your query like this:

WITH n(CutAndFadeId,CutAndFadeName) AS (
  VALUES(0,'<.?.>'),(1,'CC'),(2,'CM'),(3,'CF'),
        (4,'MC'),(5,'MM'),(6,'MF'),(7,'FC'),(8,'FM'),(9,'FF'))
SELECT a.SongId ItemId,
       a.SongName,
       c.GenreName,
       d.LanguageName,
       h.SubgenreName,
       e.ArtistName,
       n.CutAndFadeName 
  FROM SONGS a, GENRES c, LANGUAGES d, OTHER_ARTISTS e, SUBGENRES h, n
 WHERE e.ArtistId=a.ComposerId
   AND c.GenreId=a.GenreId 
   AND d.LanguageId=a.LanguageId 
   AND n.CutAndFadeId=a.CutAndFade 
   AND h.SubgenreId=a.SubgenreId 
   AND a.SongName like 'Hata%';

In other words, factor our the subquery into a ordinary common table expression implemented using a VALUES clause rather than a big UNION ALL. The query planner is able to deal better with that case.

(5.1) Originally by Suat Hakki (suathd) with edits by Richard Hipp (drh) on 2023-12-29 11:48:43 from 5.0 in reply to 3 [link] [source]

Unfortunatelly using more than one inline set, made the query 160 times slower (32 seconds). I think some enhancements should be applied in future versions.

WITH 
n(CutAndFadeId,CutAndFadeName) AS (
  VALUES(0,'<.?.>'),(1,'CC'),(2,'CM'),(3,'CF'),
        (4,'MC'),(5,'MM'),(6,'MF'),(7,'FC'),(8,'FM'),(9,'FF')),
g(TempoId, TempoName) AS (
  VALUES(0,'<.?.>'),(1,'S'),(2,'M'),(3,'F'))
SELECT a.SongId ItemId,
       a.SongName,
       c.GenreName,
       d.LanguageName,
       h.SubgenreName,
       e.ArtistName,
       n.CutAndFadeName,
       g.TempoName
  FROM SONGS a, GENRES c, LANGUAGES d, OTHER_ARTISTS e, SUBGENRES h, n, g
 WHERE e.ArtistId=a.ComposerId
   AND c.GenreId=a.GenreId 
   AND d.LanguageId=a.LanguageId 
   AND n.CutAndFadeId=a.CutAndFade 
   AND g.TempoId=a.TempoMid 
   AND h.SubgenreId=a.SubgenreId 
   AND a.SongName like 'Hata%';

(9) By Richard Hipp (drh) on 2023-12-29 12:16:35 in reply to 5.1 [link] [source]

The _SONG16 index looks like this:

CREATE INDEX _SONGS16 on SONGS (GenreId);

This is not a good index because the GenreId has a very lopsided distribution of values. 75% of the entries in the SONGS table are in the top 4 genres, and your query is hitting all of the most popular genres. This turns the index lookup into something closer to a full table scan, making the index ineffective at speeding up the query.

SQLite does not currently do a very good job of detecting lopsided indexes like this. Perhaps I can improve that in a future release.

Meanwhile, I find that you can get excellent performance on all of the queries you have shown me, on all recent versions of SQLite if you will just run:

DROP INDEX _SONGS16;

(4.5) By Suat Hakki (suathd) on 2023-12-29 03:33:08 edited from 4.4 in reply to 2 [link] [source]

I used "case when" while retrieving names from both Tempo and CutAndFades tables instead of using inline tables.

select a.SongId, a.SongName, case when a.CutAndFade=0 then '<.?.>' when a.CutAndFade=1 then 'CC' ... ... end CutAndFadeName, case when a.TempoMid=0 then '<.?.>' when a.TempoMid=1 then 'S' when a.TempoMid=2 then 'M' when a.TempoMid=1 then 'F' end TempoName, ... from SONGS a, ...

it is very fast (0.2 seconds) in all versions.

this issue is not urgent for me any more. But it may be helpful for finding and fixing some performance problems.

thank you very much...

(6) By Stephan Beal (stephan) on 2023-12-29 03:50:36 in reply to 4.5 [link] [source]

case
  when a.TempoMid=0 then '<.?.>'
  when a.TempoMid=1 then 'S'
  when a.TempoMid=2 then 'M'
  when a.TempoMid=1 then 'F'
 end TempoName,

Note that you have a duplicated WHEN condition, so you may not be getting the results you want. The first match in the WHEN list will be used:

sqlite> select case when 1=1 then 'hi' when 1=1 then 'bye' end;
hi

(7.3) By Suat Hakki (suathd) on 2023-12-29 11:45:33 edited from 7.2 in reply to 6 [link] [source]

yes. sorry. I added Tempo case quickly for test. In my original SQL, everything was ok (when a.TempoMid=3 then 'F').

P.S: In my 1st Post, eliminating/excluding 1 table (GENRES or SUBGENRES or OTHER_ARTISTS etc.) causes SQL works in normal speed (0.2 secs) although inline table (CutAndFades) is used. I think the performance issue is not directly related to inline table(s).

P.S.2: SQL in my first post works in normal speed in some other customer databases having same amount data. Problem may be related to some records in that database.

many thanks again.

(8) By TripeHound on 2023-12-29 12:01:09 in reply to 7.3 [link] [source]

I've not explored to see if there is a difference, so just thinking aloud... but have you tried putting the CutAndFadeId/Name values in a real table, rather than defining them "on the fly" in the query?

(10.2) By Suat Hakki (suathd) on 2023-12-29 12:28:30 edited from 10.1 in reply to 8 [link] [source]

Now, I've created CUT_AND_FADES table (with no index for test) and inserted 10 records;

create table CUT_AND_FADES (CutAndFadeId Integer, CutAndFadeName nVarChar(5), CutAndFadeNameCiAi nVarChar(5));

insert into CUT_AND_FADES values (0, '<.?.>', '<.?.>'); insert into CUT_AND_FADES values (1, 'CC', 'cc'); insert into CUT_AND_FADES values (2, 'CM', 'cm'); insert into CUT_AND_FADES values (3, 'CF', 'cf'); insert into CUT_AND_FADES values (4, 'MC', 'mc'); insert into CUT_AND_FADES values (5, 'MM', 'mm'); insert into CUT_AND_FADES values (6, 'MF', 'mf'); insert into CUT_AND_FADES values (7, 'FC', 'fc'); insert into CUT_AND_FADES values (8, 'FM', 'fm'); insert into CUT_AND_FADES values (9, 'FF', 'ff');

select a.SongId ItemId, a.SongName, c.GenreName, d.LanguageName, h.SubgenreName, e.ArtistName, n.CutAndFadeName from SONGS a, GENRES c, LANGUAGES d, OTHER_ARTISTS e, SUBGENRES h, CUT_AND_FADES n where e.ArtistId=a.ComposerId and c.GenreId=a.GenreId and d.LanguageId=a.LanguageId and n.CutAndFadeId=a.CutAndFade and h.SubgenreId=a.SubgenreId and a.SongName like 'Hata%'

executed in 156 MSec.

(12) By Dan Kennedy (dan) on 2023-12-30 15:16:28 in reply to 10.2 [link] [source]

SQLite expert suggests:

CREATE INDEX SONGS_idx_f03d6948 ON SONGS(CutAndFade, SongName COLLATE NOCASE);

https://www.sqlite.org/cli.html#index_recommendations_sqlite_expert_

Dan.

(13.1) By Suat Hakki (suathd) on 2023-12-31 13:04:28 edited from 13.0 in reply to 12 [link] [source]

create index _SONGS49 on SONGS (CutAndFade)

is already used. problem is not related to index.

thanks...

(15) By Richard Hipp (drh) on 2023-12-31 13:23:25 in reply to 13.0 [link] [source]

There is a very big difference between single-column indexes and multi-column indexes. Yes, you do already have a single-column index on SONGS(CutAndFade). But you do not have a multi-column index on SONGS(CutAndFade, SongName). It is the latter multi-column index that will make your query run efficiently.

Somewhere, you got the idea that if you go through and index every single column in your table individually, all queries will run fast. This is not true. On the contrary, such a strategy usually leads to performance problems.

The performance problems you are having are definitely related to your indexing strategy. You would do well to DROP most of your indexes and only retain a select few of them that actually help, and to add a few more multi-column indexes, such as the one that Dan suggested above. Doing so will make all of your queries faster, and dramatically reduce the size of your database.

You would do well to follow Dan's advice and create the two-column index he suggests. I think that will completely resolve your present issue.

Meanwhile, we are striving to make SQLite resistant to questionable index designs, so that it generates fast solutions even in cases like yours that have suboptimal indexes. Application developers are not expected to understand all of the subtle implications of indexes and how to best select appropriate indexes. We are attempting to improve SQLite so that it generates fast answers even when the indexes are not the best. This is an enhancement effort, however, not a bug fix.

If you will put forth just a tiny bit of effort toward improving your choice of indexes, and if we on our side work to make SQLite more resistant to questionable indexing practices, our combined efforts will be a win for everyone.

(17) By Suat Hakki (suathd) on 2024-01-02 01:20:03 in reply to 15 [link] [source]

sorry, I don't think so.

I've just created the indexes for all possible fields where the user can make filtering (depends on application's user-interface). There are many un-indexed fields on SONGS table, because the user will not be allowed to query by those fields.

I think creating multi-column indexes for almost each fields (e.g CutAndFade + SongName, GenreId + SongName, BmpMid + SongName) is not logical.

In the sample SQL (which is simplifed version of main one), query is filtered by just SongName. But it can be filtered by many fields.

I'm agree with you that using many indexes causes bigger databases.

I'm using "case ... when" instead of inline tables (query on fly). I'm satisfied.

many thanks for your kindly replies and efforts.

(18) By Anton Dyachenko (ahtoh_) on 2024-01-02 20:25:44 in reply to 17 [link] [source]

I work in the same area - track metadata library - so we have a similar set of one column indexes with exact same motivation users can have multi column search/filter. This is the second version of our schema, in the first version we separate all the searchable columns in 2 tables one that contains textual values and the other that contains integer values. So we had exactly 2 indexes and join searchable columns on each query. It seems denormalized version with multiple indexes works better(we have changed schema 3 years ago so no numbers exists any more to compare) and it is easier to generate query from user input. The main reason for better perf is much less joins. However I don't claim that our normalized version was properly implemented even more I am going to give it a second chance with completely different/new schema.

Question to experts what kind of schema would be the best to allow multi column search but note that the set of columns can be adjusted by user and there are 2 types of metadata text and numbers.

(19) By Richard Hipp (drh) on 2024-01-02 20:39:17 in reply to 18 [link] [source]

In this case, simply drop indexes on columns that never have more than a single value. We found that in Suat's database, simply doing

DROP INDEX _SONGS49;

Was sufficient to resolve his performance issues. There are a bunch of other indexes in Suat's database that are on columns that never hold more than a single value. But they were not causing problems for the queries he showed us.

I suppose the question is now moot, as we have now enhanced SQLite to recognize when an index is useless (like _SONGS49) and to avoid using it. So it should now "just work".

(20) By Anton Dyachenko (ahtoh_) on 2024-01-02 23:29:01 in reply to 19 [link] [source]

Thanks, got it, you have fixed SQLite to avoid problems with such indexes so the traditional table with all columns and a set of one-column indexes for each searchable column should work great now even if an indexed column has just one value.

What about getting the best performance for such task? Currently, I am working on a schema where all textual metadata columns reference an id from an immutable dictionary. The use case I try to optimize is user has typed some text which usually is a substr of a textual metadata and db needs to search it over all visible searchable columns via LIKE function. Every time user types a new character another search is done. So my goal is instant UX over ~100K tracks table on an embedded device. So in general my experimental schema design and the way I get data from it is like:

    CREATE TABLE text (
        id INTEGER PRIMARY KEY,
        txt TEXT NOT NULL UNIQUE CHECK(LENGTH(txt) > 0)
    );
    CREATE TABLE track (
        id INTEGER PRIMARY KEY,
        album INT REFERENCES text,
        artist INT REFERENCES text,
        title INT REFERENCES text
    );
    CREATE INDEX track_album ON track(album);
    CREATE INDEX track_artist ON track(artist);
    CREATE INDEX track_title ON track(title);

    CREATE TEMP TABLE found_track AS
    WITH
        xxx AS (
            SELECT id FROM text
            WHERE txt LIKE ?1
        )
    SELECT id FROM track
    WHERE album IN (SELECT id FROM xxx)
    UNION
    SELECT id FROM track
    WHERE artist IN (SELECT id FROM xxx)
    UNION
    SELECT id FROM track
    WHERE title IN (SELECT id FROM xxx);

SELECT track.id AS id,
	album_text.txt AS album,
	artist_text.txt AS artist,
	title_text.txt AS title
FROM track
	LEFT JOIN text AS album_text ON album_text.id = track.album
	LEFT JOIN text AS artist_text ON artist_text.id = track.artist
	LEFT JOIN text AS title_text ON title_text.id = track.title
WHERE id IN (SELECT id FROM found_track);

In addition to this, sometime later, I am going to check if FTS MATCH over the text table will improve anything. I think just a UNIQUE constraint or an index can't be used to improve the perf of the LIKE search so FTS should help here. Also the traditional approach (that my app is using now) where artist/title/album are ordinary text columns with indexes (which again doesn't help with like) may only helps to avoid reading whole table row which is quite big as other non-searchable and big columns are in the same table.

I understand you can't fully answer the question and I need to check perf for all my ideas. I want to know if the approach above is bad/wrong/suboptimal/contradicts best practices so any problem I don't see. Or you might advise a better design or high-level principles I need to follow to avoid problems/achieve the best perf.

(21) By ddevienne on 2024-01-03 08:17:15 in reply to 20 [link] [source]

Two remarks come to my mind when reading your post.

First, the schema is denormalized, album and artist would normally be separate tables, and track would have an FK to its album. Would also allow modeling the fact an album (or possibly individual tracks on an album) have different artists, something you current model cannot represent w/o duplications (again denormalized).

Second, factoring common strings is also something I've played with, in my case for enum-like closed-set values, while your use-case is more open-ended and geared toward searching. I don't think it's very practical.

Did you try a more traditional relational (3rd normal form) schema before this one? Would perform well I suspect, plus you could perhaps do your LIKE (or FTS) searches in parallel on the separate track.title, artist.name, and album.title tables/indexes (using separate connections of course), possibly be faster (although even single-threaded / connection will likely be fast enough perhaps).

(23) By Anton Dyachenko (ahtoh_) on 2024-01-03 22:34:00 in reply to 21 [link] [source]

thanks for your reply,

First, the schema is denormalized, album and artist would normally be separate tables, and track would have an FK to its album. Would also allow modeling the fact an album (or possibly individual tracks on an album) have different artists, something you current model cannot represent w/o duplications (again denormalized).

my model is tightly bound to id3 tags and other similar sources of metadata where all the data is denormalized and yes all data is represented with duplication. This is not a problem by itself and duplication is taken into account in experimental schema by having the dictionary. So instead of having separate album/genre/artist/etc tables, there will be exactly one table text. Note that user can arbitrarily change any of these data but only for one track not for all that shares the same value, so just denormalized tables aren't a solution, they should be either EAV (allows modification and thus duplicates) or immutable and therefore open-ended.

Second, factoring common strings is also something I've played with, in my case for enum-like closed-set values, while your use-case is more open-ended and geared toward searching. I don't think it's very practical.

tracks library by nature has an open-set domain for each column and I don't see any practical advantage even to try to make it manually manageable closed-set as the same advantage of substitution a string by int comes from using the dictionary.

Did you try a more traditional relational (3rd normal form) schema before this one?

kind of yes, EAV schema was used in the very first version but it turned to be slow due to lots of joins, however, I am not sure that the slow perf was due to EAV and not lacking proper indexes I don't want to spend lot of hours digging git history to check this. Anyway experemental schema with the dictionary is denormalized only text columns and immutable with all necessary indexes but not numeric ones, so it is both semi-normalized.

you could perhaps do your LIKE (or FTS) searches in parallel on the separate track.title, artist.name, and album.title tables/indexes (using separate connections of course), possibly be faster (although even single-threaded / connection will likely be fast enough perhaps).

not sure it will perform faster in worst-case scenario, as all connections will have to read pages from disk concurrently (cold cache) which isn't great in case of poor usb flash sticks or HDD, and of course it makes code much more complicated due to end sync users usually see search result ordered by some columns not just filtered. So search perf in a single connection is never a problem on desktop, it is only a concern on embedded devices with arbitrary quality of disks usually poor and on first search.

(11) By Richard Hipp (drh) on 2023-12-29 21:59:03 in reply to 1.0 [link] [source]

The underlying problem here is that your TestDb.db database schema contains a lot of very bad indexes - indexes that do not help with queries but which do a great job of getting the query planner confused.

I have an experimental branch that attempts to address this problem by recognizing bad indexes and marking them to be ignored by the query planner. Are you able to compile the latest check-in on that break from sources and try it out in your system? That would be a big help.

After installing the new SQLite, you'll need to run ANALYZE, either with or without "PRAGMA analysis_limit" - whatever you were doing before. The branch works great on your database for the queries that you showed us, but I imagine that you have a great many more queries that we have not seen yet. If able, please give the new code a try and let us know if it (1) resolves your issue and (2) does not cause any new issues.

Please note: you must run ANALYZE using the newer SQLite in order for the changes to take effect.

Third parties are also welcomed and encouraged to try out this branch and report any issues.

To get the source code from the branch timeline, first click on the check-in hash just to the right of the "check-in:" label for the check-in you want to try. On the new page, click on "Tarball" or "ZIP archive" links just to the right of "Downloads:" near the top of the page.

(14.1) By Suat Hakki (suathd) on 2023-12-31 13:03:02 edited from 14.0 in reply to 11 [link] [source]

hello Richard, you are right. there should be a problem on that database. because the same SQL works well on hundreds of similar databases (containing nearly same amount data).

I'm just using sqlite3.dll. It is not possible to build it from source code currently (I have no C++ compiler).

I will wait for sqlite3.dll's 3.44.3 and make test with same SQL. If you have beta version for sqlite3.dll's 3.44.3 I can test.

thank you very much.

(16) By Richard Hipp (drh) on 2024-01-01 19:23:17 in reply to 1.0 [link] [source]

The latest trunk check-in now computes answers quickly, even without having to fix the indexes in the schema.

(22) By EricG (EriccG) on 2024-01-03 09:05:59 in reply to 1.0 [link] [source]

Not directly answering your query, but have you tried using an FTS index for pre-filtering?

IME with metadata, especially "approximate" metadata such as for songs, this works very well.

You build a text like keyword + value

SongName Happy Memories
ArtistName Foobar Fun
...

and just FTS index that. When doing a search, you query matches for keyword + value, rank and post-filter the values.

This allows to perform arbitrary searches with a single index, if you want exact match, the post-filtering guarantees that, if you want fuzzy matches, it's available as well.

The FTS search index is quite compact and efficient IME, and you won't have need to re-index or re-analyze (which can sometimes get lengthy)