SQLite Forum

Help speeding up a query
Login

Help speeding up a query

(1) By jose isaias cabrera (jicman) on 2020-04-16 18:13:31 [source]

Greetings!

Pardon the long message. First, I hope you and all yours are well, and as soon as you read this post, please go and wash your hands. :-)

Long story short, I created a tool that reports on projects. They have an AccessDB that is slow, but I created an SQLite tool to make my own reports, etc. All is working great, but I have a few queries that are getting slower and slower. I run ANALYZE every Sunday night, and it's still just as slow.

A little history...
We export data in Excel from a global project's tool, and we import only those that belongs to our department into and use SQLite to report on various criteria. As things were running well, they requested more reports, and now, daily exports and imports, to the point that we are importing daily and, sometimes two or three times in a day. Imagine this scenario...

create table t0 (a, b, c, d, e, idate, PRIMARY KEY (a, idate));
insert into t0 (a, b, c, d, e, idate) values ('p001', 'Project I', 1000.0, 'a', 3, '2018-02-11');
insert into t0 (a, b, c, d, e, idate) values ('p002', 'Project II', 2039., 'a', 3, '2018-02-11');
insert into t0 (a, b, c, d, e, idate) values ('p003', 'Project III', 30023.0, 'b', 4, '2018-02-11');
insert into t0 (a, b, c, d, e, idate) values ('p001', 'Project I', 1010.0, 'b', 4, '2018-03-01');
insert into t0 (a, b, c, d, e, idate) values ('p002', 'Project II', 2039.0, 'c', 5, '2018-03-01');
insert into t0 (a, b, c, d, e, idate) values ('p003', 'Project III', 30023.0, 'd', 5, '2018-03-01');
...
...
insert into t0 (a, b, c, d, e, idate) values ('p001', 'Project I', 900.0, 'e', 5, '2020-03-20');
insert into t0 (a, b, c, d, e, idate) values ('p002', 'Project II', 3423.0, 'd', 6, '2020-03-20');
insert into t0 (a, b, c, d, e, idate) values ('p003', 'Project III', 40000.0, 'a', 6, '2020-03-20');
insert into t0 (a, b, c, d, e, idate) values ('p001', 'Project I', 900.0, 'f', 7, '2020-03-21');
insert into t0 (a, b, c, d, e, idate) values ('p002', 'Project II', 3500.0, 'g', 6, '2020-03-21');
insert into t0 (a, b, c, d, e, idate) values ('p003', 'Project III', 40000, 'h', 6, '2020-03-21');

As you can see, these import change a few time in the life of the project, but not daily. There are also about 56 fields, and I have thought of optimizing the data, but that would take lots of code changes, and extra effort. Although, that may need to be done in the future, but for right now, I just want to be able to minimize the query response time. I have also set an INDEX for a and idate:

CREATE INDEX IF NOT EXISTS a_idate_idx on t0 (a,idate);
The queries are related to searches on values other than a or idate. For example,
SELECT * FROM t0 as rec
WHERE rec.a IN 
(
  SELECT a FROM t0 WHERE 1=1 
  AND lower(rec.b) LIKE '%ioc-net%' 
)
AND
  rec.idate = 
    (SELECT MAX(idate) FROM t0 WHERE a = rec.a)
GROUP BY rec.a;

When I run this query on the real data, this is the result of .timer on:

Run Time: real 243.291 user 240.125000 sys 0.765625

I have tried to use .expert, but it tells me that (no new indexes) are needed. Right now the total number of records for t0 is 56984. Yearly, we do about 140 projects per year. So, there is a lot of duplicated data assigned to a, and idate. Any thoughts on how to make this scenario faster? Thanks for your help.

josé

(2) By Keith Medcalf (kmedcalf) on 2020-04-16 19:18:00 in reply to 1 [link] [source]

Firstly, (a, idate) is unique. You have defined it as the PRIMARY KEY. Therefore your index a_idate_idx is redundant (even though you forgot to define it as unique, it is not required anyway, and defining it is just a waste of time and space, since it is already defined as a unique key to the table).

As I understand your query, you want the latest entry for each project "a" that has a row with b being like '%ioc-net%'.

There is nothing you can do about having to scan the entire table to find the list of projects since your like expression starts with a wildcard. Also, unless you have changed it, LIKE is case insensitive so no need to call the lower function on each row. You can probably take advantage of the SQLite3 min/max optimization to get rid of the second correlated subquery though and retrieve the results directly rather than having to traverse the entire table again -- I have no idea how much difference that will make.

Is the following any faster:

  SELECT a, b, b, d, e, max(idate) AS idate
    FROM t0 AS rec
   WHERE a IN (SELECT DISTINCT a FROM t0 WHERE b LIKE '%ioc-net%')
GROUP BY a;

(3) By jose isaias cabrera (jicman) on 2020-04-16 19:43:11 in reply to 2 [link] [source]

Yes, indeed. Here is the actual code:

sqlite> SELECT *, max(InsertDate)
   ...>     FROM Project_List
   ...>    WHERE ProjID IN (SELECT DISTINCT ProjID FROM Project_List WHERE Project_Name LIKE '%agr-2020%')
   ...> GROUP BY ProjID;
PR0000015542|6687299|..
..
..LICATIONS|e608313|2020-04-15_08-47-09|Waterfall|2020-04-15_08-47-09
Run Time: real 0.671 user 0.109375 sys 0.359375
sqlite>

This is perfect, Keith. Once again. Thanks.

josé

PS: I am glad you're here... Thanks.

(4) By Keith Medcalf (kmedcalf) on 2020-04-16 20:16:18 in reply to 3 [link] [source]

Thanks Jose.

I thought this might make a big difference. If you look at the query plan for the original query it would basically have to do the following:

  1. Generate the list of "a" by doing a table scan of t0
  2. Scan each row in t0 matching the above
    a) for each row do a subquery to check if it is the max(idate)

The alternate query does the following:

  1. Generate the list of "a" by doing a table scan of t0
  2. For each "a" in the above list, find and output the last record in the index for that group (a single index seek, basically)

So the revised query is avoiding an extra scan of each group of "a" and a correlated subquery for each row in the group to see if it is the one with the max(idate).

Note that this would not work in most other SQL databases because most of them do not allow the select list to contain values that are not in the group by list and are not aggregates.

(5) By jose isaias cabrera (jicman) on 2020-04-16 20:58:04 in reply to 4 [link] [source]

Thanks, Keith. I am going to take some time to jump a bit deeper into both to get to understand how to make queries slim, compact and faster. Thanks for your college course. It is really appreciated.

josé