Minimize the useless INSERTs

(1) By jose isaias cabrera (jicman) on 2022-08-18 20:55:20 [source]

Greetings.

I am sure this will be a quickie for some of you guys, but, I have been trying to figure it out, but I cannot seem to do that. Imagine the following tables and inserts:

create table tt (pid, pn, a, y, type, d, idate);
insert into tt values ('p001', 'Titin', 1, 2019, 'Closure', 4, '2019-02-11');
insert into tt values ('p002', 'Billy', 2, 2019, 'Control', 4, '2019-02-12');
insert into tt values ('p003', 'Carol',  3, 2019, 'Cloud', 4, '2019-02-13');
insert into tt values ('p004', 'Jazmin',  4, 2019, 'APPS', 4, '2019-02-14');
insert into tt values ('p005', 'Peter',  5, 2019, 'Autodesk', 4, '2019-02-15');
insert into tt values ('p001', 'Sofia',  6, 2020, 'Closure', 4, '2019-02-16');
insert into tt values ('p002', 'Billy',  7, 2019, 'Control', 4, '2019-02-17');
insert into tt values ('p003', 'Carol',  8, 2019, 'Cloud', 4, '2019-02-18');
insert into tt values ('p004', 'Jazmin',  9, 2019, 'APPS', 4, '2019-02-19');
insert into tt values ('p001', 'Titin',  10, 2020, 'Digital', 4, '2019-02-20');
insert into tt values ('p002', 'Billy',  11, 2020, 'Control', 4, '2019-02-21');
insert into tt values ('p003', 'Mari',  12, 2020, 'Signature', 4, '2019-02-22');
insert into tt values ('p004', 'Jazmin',  13, 2020, 'APPS', 4, '2019-02-23');
insert into tt values ('p005', 'Peter',  14, 2020, 'Autodesk', 8, '2019-02-24');
insert into tt values ('p001', 'Titin',  15, 2020, 'Digital', 8, '2019-02-24');

create table t0 (pid, pn, idate, primary key(pid, pn));
create table t1 (pid, type, idate, primary key(pid, type));
insert or replace into t0 select pid, pn, idate from tt;
insert or replace into t1 select pid, type, idate from tt;

These two INSERTs gives me what I want because the UNIQUE constraint helps me, however, I have to insert 15 times into each t0 and t1 tables. The real world problem is that the tt table has 114414 records, and growing, and although the INSERTs are very fast, it is writing 114414 times. What I would like to do is to just get the amount of records INSERTed on the t0 and t1 tables, without having to write all those useless instances. I know I can get the answer from this,

insert or ignore into t0 select pid, pn, idate from tt order by idate desc;
insert or ignore into t1 select pid, type, idate from tt order by idate desc;;

but it is sorted by the date, and I would like to have it sorted by the date entered on the tt table. I have tried a few replicas of this:

select a.pid, a.pn, a.idate from tt a
join tt b on a.pid = b.pid
where a.pn in ( select distinct pn from tt )
and a.idate = 
    ( select max(idate) from tt where pid = a.pid and pn = a.pn)
group by a.pid;

But, this is not the right answer. Any help, or thought, would be greatly appreciated. Thanks.

(2) By Michael A. Cleverly (cleverly) on 2022-08-18 22:38:51 in reply to 1 [link] [source]

Thank you for showing your schema, some sample values, and your actual queries.

I think what you want is:

insert or replace into t0 select pid, pn, max(idate) as idate from tt group by pid, pn;
insert or replace into t1 select pid, type, max(idate) as idate from tt group by pid, type;

Though if you never do any other DML on t0 and t1 I would drop them as tables and just recreate them as views:

create view t0 as select pid, pn, max(idate) as idate from tt group by pid, pn;
create view t1 as select pid, type, max(idate) as idate from tt group by pid, type;

As far as ordering goes, I would recommend adding an explicit ORDER BY rather than relying on luck.

(3) By jose isaias cabrera (jicman) on 2022-08-19 12:41:37 in reply to 2 [link] [source]

Thank you, Michael. Why do I always want to make things harder. it was just max(idate). Darn it!

The reason why it's a table is because this data is used for searching. Thanks for the support.

(4) By jose isaias cabrera (jicman) on 2022-08-19 19:17:15 in reply to 3 [link] [source]

Interestingly enough, the original INSERT command INSERTing the whole set of records is much faster than the alternative INSERT provided by Michael. For example, these are the times and records inserted into each table with the INSERT provided by Michael:

BEGIN;
  INSERT OR REPLACE INTO PL_ProjType_Search
    SELECT ProjID, Project_Type, max(InsertDate) AS InsertDate FROM Project_List 
    GROUP by ProjID, Project_Type 
;
END;
...
...
BEGIN;
  INSERT OR REPLACE INTO PL_ITOBS_Search
    SELECT ProjID, IT_OBS, max(InsertDate) AS InsertDate FROM Project_List 
    GROUP by ProjID, IT_OBS 
;
END;

Loading 1741 records in table PL_ProjType_Search took 27.55 secs.
Loading 1570 records in table PL_ProjName_Search took 0.5 secs.
Loading 2087 records in table PL_ClarityPM_Search took 0.53 secs.
Loading 2311 records in table PL_PMOContact_Search took 0.52 secs.
Loading 2168 records in table PL_BRM_Search took 0.54 secs.
Loading 2205 records in table PL_BegDate_Search took 0.53 secs.
Loading 2641 records in table PL_EndDate_Search took 0.52 secs.
Loading 1989 records in table PL_GoLive_Search took 0.46 secs.
Loading 2076 records in table PL_Purpose_Search took 0.44 secs.
Loading 1783 records in table PL_InvestInit_Search took 0.43 secs.
Loading 1899 records in table PL_BusOBS_Search took 0.43 secs.
Loading 1548 records in table PL_ITOBS_Search took 0.56 secs.

Using just the original INSERT command, this is the result:

BEGIN;
  INSERT OR REPLACE INTO PL_ProjType_Search
    SELECT ProjID, Project_Type, InsertDate FROM Project_List 
;
END;
...
...
BEGIN;
  INSERT OR REPLACE INTO PL_ITOBS_Search
    SELECT ProjID, IT_OBS, InsertDate FROM Project_List 
;
END;

Loading 114456 records in table PL_ProjType_Search took 11.61 secs.
Loading 114456 records in table PL_ProjName_Search took 0.48 secs.
Loading 114456 records in table PL_ClarityPM_Search took 0.49 secs.
Loading 114456 records in table PL_PMOContact_Search took 0.54 secs.
Loading 114456 records in table PL_BRM_Search took 0.49 secs.
Loading 114456 records in table PL_BegDate_Search took 0.5 secs.
Loading 114456 records in table PL_EndDate_Search took 0.48 secs.
Loading 114456 records in table PL_GoLive_Search took 0.39 secs.
Loading 114456 records in table PL_Purpose_Search took 0.4 secs.
Loading 114456 records in table PL_InvestInit_Search took 0.35 secs.
Loading 114456 records in table PL_BusOBS_Search took 0.36 secs.
Loading 114456 records in table PL_ITOBS_Search took 0.47 secs.

Notice the amount of records that it's insertering on the last commands, but it's using the UNIQUE constraint to provide the right result. SQLite is fast as heck. I would have expected that Michael's INSERT/Query would have been much faster since it's only INSERTing (at most) 2.3% records vs 114456 from the last one. But apparently, SQLite internals are faster INSERTing than sorting.

(5) By Michael A. Cleverly (cleverly) on 2022-08-19 19:55:30 in reply to 4 [link] [source]

Using just:

  INSERT OR REPLACE INTO PL_ITOBS_Search
    SELECT ProjID, IT_OBS, InsertDate FROM Project_List

you have no guarantee that the final InsertDate that is the last row replaced into PL_ITOBS_Search is the max(InsertDate) from Project_List. Without an explicit order by clause you'll get whatever the query planner came up with and it isn't guaranteed to be stable.

So it isn't exactly an apples to apples comparison.

I bet you can speed up the select with the group by if you added appropriate indexes.

(6) By jose isaias cabrera (jicman) on 2022-08-19 21:17:22 in reply to 5 [link] [source]

Without an explicit order by clause you'll get whatever the query planner came up with and it isn't guaranteed to be stable.

Hmmm...

I bet you can speed up the select with the group by if you added appropriate indexes.

I tried various ones, and it added more time, instead of speeding it up.

CREATE INDEX PL_ProjID_Project_Type ON Project_List (ProjID, Project_Type);
CREATE INDEX PL_ProjID_Project_Name ON Project_List (ProjID, Project_Name);
CREATE INDEX PL_ProjID_Project_Manager ON Project_List (ProjID, Project_Manager);
CREATE INDEX PL_ProjID_Manager ON Project_List (ProjID, Manager);
CREATE INDEX PL_ProjID_PPM ON Project_List (ProjID, PPM);
CREATE INDEX PL_ProjID_BRD ON Project_List (ProjID, BRD);
CREATE INDEX PL_ProjID_Business_Owner ON Project_List (ProjID, Business_Owner);
CREATE INDEX PL_ProjID_Start_date ON Project_List (ProjID, Start_date);
CREATE INDEX PL_ProjID_Finish_Date ON Project_List (ProjID, Finish_Date);
CREATE INDEX PL_ProjID_Target_Go_Live_Date ON Project_List (ProjID, Target_Go_Live_Date);
CREATE INDEX PL_ProjID_Purpose ON Project_List (ProjID, Purpose);
CREATE INDEX PL_ProjID_Investment_Initiative ON Project_List (ProjID, Investment_Initiative);
CREATE INDEX PL_ProjID_Business_OBS ON Project_List (ProjID, Business_OBS);
CREATE INDEX PL_ProjID_IT_OBS ON Project_List (ProjID, IT_OBS);

Hmmm, I have to think about this. Thanks.

(7) By Simon Slavin (slavin) on 2022-08-19 23:14:44 in reply to 6 [link] [source]

Okay. Each index you add means more writing for SQLite to do every time it wries a new record. What you did above is, as you found out, not The Right Thing. You have to add the right indexes (which may be just one index), not all indexes.

Remove all your CREATE INDEX commands, and, if you're working with an existing database file, drop all the indexes they've created.

With that SELECT commands you need in mind, read this:

https://www.sqlite.org/cli.html#index_recommendations_sqlite_expert_

(8) By jose isaias cabrera (jicman) on 2022-08-20 01:37:44 in reply to 7 [link] [source]

Hi Simon. These Indexes created were suggested by .expert:

Enter ".help" for usage hints.
sqlite> .expert
sqlite>   INSERT OR REPLACE INTO PL_ProjType_Search
   ...>     SELECT ProjID, Project_Type, max(InsertDate) AS InsertDate FROM Project_List
   ...>     GROUP by ProjID, Project_Type
   ...> ;
CREATE INDEX Project_List_idx_c8c4e1ac ON Project_List(ProjID, Project_Type);

SCAN Project_List USING INDEX Project_List_idx_c8c4e1ac

Run Time: real 0.000 user 0.000000 sys 0.000000
sqlite>

So, I am following the expert suggestion, but it slows down the insert instead of speeding it up. On Monday, I will test this more thoroughly and provide more info, if needed. Thanks.

(9) By David Raymond (dvdraymond) on 2022-08-24 14:37:03 in reply to 1 [link] [source]

create table t0 (pid, pn, idate, primary key(pid, pn));
create table t1 (pid, type, idate, primary key(pid, type));
insert or replace into t0 select pid, pn, idate from tt;
insert or replace into t1 select pid, type, idate from tt;

Instead of replacing all the values each time there's a duplicate, would an upsert ignore them altogether and save time?

insert into t0 select pid, pn, idate from tt
--order by <something to make the one you really want come first>
on conflict (pid, pn) do nothing;

insert into t1 select pid, type, idate from tt
--order by <something to make the one you really want come first>
on conflict (pid, type) do nothing;

(10) By Ryan Smith (cuz) on 2022-08-24 15:17:20 in reply to 9 [link] [source]

Instead of replacing all the values each time there's a duplicate, would an upsert ignore them altogether and save time?

Yes it would - if rewriting any field in the row is not needed, but even so a DO UPDATE clause will still be better since the row has to be rewritten either way, but the upsert ensures it avoids ON DELETE and ON INSERT triggers and only services an ON UPDATE trigger if there is one.

Also, to resolve conflicts on multiple columns like (pid, pn) and (pid, type) I believe there must exist unique indexes on those.

i.e. it is my current belief (and I could be wrong) that the CONFLICT part of the upsert can only detect conflicts that are already part of the table's constraint checking mechanism, and will not test an arbitrary specified conflict by itself.

Whether those added indexes are welcomed or superfluous depends on the installation and is for the OP to assess.

(12) By jose isaias cabrera (jicman) on 2022-08-24 19:28:02 in reply to 10 [link] [source]

Also, to resolve conflicts on multiple columns like (pid, pn) and (pid, type) I believe there must exist unique indexes on those.

I don't think there would not be any conflict on these, since they could be use by themselves, or in a combination. These _Search tables are for users to search on one or more specific projects data. Using the Project_List table by itself is really slow. These unique _search tables have faster response to the projects data. For example, if the user wants to search for all projects having "CMP" on the project name, AND the PM Name had "Donna" and the business owner had "DiMauro" the search would be:

SELECT a.* FROM Project_List AS a 
JOIN PL_ProjName_Search b ON a.ProjID = b.ProjID 
JOIN PL_BusOwner_Search c ON b.ProjID = c.ProjID 
JOIN PL_Contact_Search d ON c.ProjID = d.ProjID 
WHERE a.InsertDate = 
( 
  select max(insertdate) from Project_List where projid = a.projid 
) 
AND b.Project_Name LIKE '%cmp%' 
AND b.InsertDate = 
( 
   select max(insertdate) from PL_ProjName_Search where projid = b.projid 
) 
AND c.Business_Owner LIKE '%dimauro%' 
AND c.InsertDate = 
( 
   select max(insertdate) from PL_BusOwner_Search where projid = c.projid 
) 
AND d.Project_Manager LIKE '%donna%' 
AND d.InsertDate = 
( 
   select max(insertdate) from PL_Contact_Search where projid = d.projid 
) 
ORDER BY a.ProjID,a.InsertDate 
; 

Run Time: real 0.116 user 0.109375 sys 0.000000

This is really fast compared with the Project_List table which has all the data:

SELECT a.* FROM Project_List AS a 
WHERE a.InsertDate =
(
    SELECT max(InsertDate) from Project_List where projid = a.ProjID
)
AND a.Project_Name LIKE '%cmp%'
AND a.Project_Manager LIKE '%donna%'
AND a.Business_Owner LIKE '%dimauro'
GROUP BY a.ProjID
;

Run Time: real 0.448 user 0.109375 sys 0.312500

The more data the users enters to dice the search, the longer it takes on this last one. On the first query, it pretty much takes the same time no matter how many more fields are searched. The reason why is that there repeated data in fields because the data is based on imports from another system, which brings repeated data. I am trying to slowly normalize the db, but, there are daily reports that are needed, so...

(11) By jose isaias cabrera (jicman) on 2022-08-24 18:31:02 in reply to 9 [link] [source]

Instead of replacing all the values each time there's a duplicate, would an upsert ignore them altogether and save time?

I am going to try it. I have found out that INSERTing a value is so much faster than doing the calculation whether or not to insert it (see reply #4. I will give it a try and see if it's worth it. Thanks.

(13) By jose isaias cabrera (jicman) on 2022-08-24 20:08:12 in reply to 9 [link] [source]

Oh WOW! This is much better! Thanks.

BEGIN;
  INSERT INTO PL_ProjType_Search
    SELECT ProjID, Project_Type, max(InsertDate) AS InsertDate FROM Project_List 
    ORDER BY ProjID, Project_Type 
    ON CONFLICT (ProjID,Project_Type) DO NOTHING 
  ;
END;
...
...
BEGIN;
  INSERT INTO PL_ITOBS_Search
    SELECT ProjID, IT_OBS, max(InsertDate) AS InsertDate FROM Project_List 
    ORDER BY ProjID, IT_OBS 
    ON CONFLICT (ProjID,IT_OBS) DO NOTHING 
  ;
END;

This is the result of these change:

Loading 0 records in table PL_ProjType_Search took 0.02 secs.
Loading 0 records in table PL_ProjName_Search took 0.02 secs.
Loading 0 records in table PL_Contact_Search took 0.01 secs.
Loading 0 records in table PL_ClarityPM_Search took 0.01 secs.
Loading 0 records in table PL_PMOContact_Search took 0.02 secs.
Loading 0 records in table PL_BRM_Search took 0.02 secs.
Loading 0 records in table PL_BusOwner_Search took 0.01 secs.
Loading 0 records in table PL_BegDate_Search took 0.02 secs.
Loading 0 records in table PL_EndDate_Search took 0.02 secs.
Loading 0 records in table PL_GoLive_Search took 0.01 secs.
Loading 0 records in table PL_Purpose_Search took 0.02 secs.
Loading 0 records in table PL_InvestInit_Search took 0.01 secs.
Loading 0 records in table PL_BusOBS_Search took 0.02 secs.
Loading 0 records in table PL_ITOBS_Search took 0.01 secs.
Total time loading all records in tables took: 0.23 secs.

Before, it will write no matter what. Thanks, David. This is great.

(14) By Keith Medcalf (kmedcalf) on 2022-08-25 04:04:18 in reply to 13 [link] [source]

Did you forget a group by clause?

(15) By Keith Medcalf (kmedcalf) on 2022-08-25 04:12:29 in reply to 13 [link] [source]

As in something like:

BEGIN;
  INSERT INTO PL_ProjType_Search
    SELECT ProjID, Project_Type, max(InsertDate) AS InsertDate FROM Project_List 
    GROUP BY ProjID, Project_Type 
    ON CONFLICT (ProjID,Project_Type) DO UPDATE InsertDate = excluded.InsertDate 
                                         WHEN excluded.InsertDate > InsertDate

  ;
END;
...
...
BEGIN;
  INSERT INTO PL_ITOBS_Search
    SELECT ProjID, IT_OBS, max(InsertDate) AS InsertDate FROM Project_List 
    GROUP BY ProjID, IT_OBS 
    ON CONFLICT (ProjID,IT_OBS) DO UPDATE InsertDate = excluded.InsertDate
                                   WHEN exclusded.InsertDate > InsertDate
  ;
END;

(16) By jose isaias cabrera (jicman) on 2022-08-25 14:25:02 in reply to 15 [link] [source]

As in something like:

Thank you. This is the right formula, otherwise, it will leave the InsertDate with the first one that finds. Thanks.