Operation speed

(1) By anonymous on 2021-03-11 02:48:10 [link] [source]

Hello everyone,

I am working on an embedded equipment in Linux C++.

It takes about 4.5 seconds to execute the SQL statement below.
 
DELETE FROM points_11 WHERE uint64_ID NOT IN (SELECT uint64_pointIDX FROM values_11);

There are 1440 records in table points_11, 46049 records in table values_11.
The structure of these 2 talbes is as below.

Is this normal?
Which aspects I cloud check to speed up please? 

Thanks a lot.


points_11
	|_ uint64_ID				INTEGER PRIMARY KEY
	|_ uint_UTC				INTEGER
	|_ uint_OSC				INTEGER
	|_ uint_storeUTC			INTEGER
	|_ uint_storeOSC			INTEGER
	|_ text_timesource			TEXT
	|_ text_status				INTEGER
	|_ uint_job				INTEGER
	|_ blob_raw				BLOB

values_11
	|_ uint64_ID				INTEGER PRIMARY KEY
	|_ uint64_pointIDX			INTEGER
	|_ text_name				TEXT
	|_ text_value				TEXT
	|_ text_unit				TEXT
	|_ uint_UTC				INTEGER
	|_ text_timesource			TEXT
	|_ blob_signature			BLOG
	|_ uint64_obisId			INTEGER
	|_ uint_regValue			INTEGER
	|_ uint_regFormat			INTEGER

(2) By Keith Medcalf (kmedcalf) on 2021-03-11 04:12:41 in reply to 1 [link] [source]

As phrased and assuming no indexes, and depending on the version of SQLite3 in use, then yes. Since you did not mention whether or not there are any indexes or the version of SQLite3 that would be an exercize called "20 Questions".

Create an index if there is not one on values_11 if not already one where the first column is uint64_pointIDX on that column and change the delete to:

delete from points_11 
      where not exists (
                        select * 
                          from values_11 
                         where points_11.uint64_id == values_11.uint64_pointIDX
                       );

so the number of operations is limited to be proportional to the number of rows in points_11 and a table scan of values_11 is not required.

(3) By Richard Damon (RichardDamon) on 2021-03-11 04:14:45 in reply to 1 [link] [source]

Adding an index to value_11 on the column uint64_pointIDX should help. I think it needs to do a complete table scan of values_11 for each row in points_11 to see it the value is there.

(4) By Keith Medcalf (kmedcalf) on 2021-03-11 04:29:44 in reply to 2 [link] [source]

Once the index is created, the optimizer in the upcoming 3.35.0 version will perform the NOT IN -> NOT EXISTS optimization and obtain more efficient code (by 10 bytecode instructions per row in points_11) in a proper circumstance.

(5) By Keith Medcalf (kmedcalf) on 2021-03-11 04:37:04 in reply to 4 [link] [source]

Ooops. Maybe I have that backwards. Anyway, the creation of the missing index is the first step. Then EXPLAIN QUERY PLAN and/or .eqp on are your best friend.

(6) By anonymous on 2021-03-11 08:55:53 in reply to 5 [source]

Hello Keith,

The SQLite version is 3.6.14.2, and there is no index created.

I've tried to create a index on table values_11's uint64_pointIDX column, but the time exhausted didn't decrease.

I'll check the usage of index later. 

Thanks.

Will

(7) By anonymous on 2021-03-11 08:58:38 in reply to 3 [link] [source]

Hi Richard,

I will try it as you suggested.

Thank you very much. 

Will

(8) By Richard Hipp (drh) on 2021-03-11 14:30:40 in reply to 6 [link] [source]

The SQLite version is 3.6.14.2

That version of SQLite is from 2009-05-25. If you upgrade to a newer version (such as the soon-to-be-released 3.35.0) you will find that almost everything will run faster, often much faster. (See the micro-optimization article for more information.) The database files and C/C++ language interface is completely backwards compatible, so upgrading should be as simple as recompiling.

(9.1) By Keith Medcalf (kmedcalf) on 2021-03-11 17:12:23 edited from 9.0 in reply to 6 [link] [source]

Without an index there are two ways to solve the query:

(1) Scan the entire values_11 table and collect all the uint64_pointIDX column into a temporary table/b-tree, then scan the points_11 table and for each row look up into the ephemeral table and see if the reference does not exist to determine whether to delete the row.

(2) Scan the points_11 table and for each row, scan the entire points_11 table to see if the reference does not exist to determine whether to delete the appropriate row.

Clearly (2) requires one scan of the table values_11 for each row in points_11 so (1) is going to "cost less" (unless statistics indicate that values_11 has a very low number of rows relative to points_11) so this method will be chosen (which is, in fact, what SQLite3 chooses).

Once an index exists then another option can be considered:

(3) Scan the points_11 table and for each row perform an index lookup on the values_11 index to see the condition for row deletion is met to determine whether to delete the row.

If the index exists then (3) will be "faster" than (1) by the "cost" of scanning values_11 and building the ephemeral table since the index can be used directly rather than having to scan the table to build the ephemeral table. (This possibility will also be detected by the SQLite3 optimizer and chosen if an appropriate index exists). This is effectively "spreading the cost" of the scan and build of the ephemeral table amongst all the insert/delete/update operations performed on values_11.

If pre-defining the index did not significantly reduce the elapsed time that indicates that the actual cost of scanning the values_11 table and building the ephemeral lookup table is very low -- so there may not be much point in maintaining the index at all (that is, the cost of maintaining the index may outweigh the benefit derived therefrom) and that the bottleneck lies elsewhere (that is, it is inherent in the design and methodology being used and that perhaps foreign key enforcement needs to be performed so that points_11 rows are deleted at the time the values_11 rows are deleted thus spreading the "cost" over a large number of operations).

(10) By Keith Medcalf (kmedcalf) on 2021-03-11 17:18:16 in reply to 6 [link] [source]

Alternately, since you have timestamp data, just keep track somewhere of when was the last time that you did this so that you do not need to "re-process" points_11 rows that you have already processed.

(11.1) By Keith Medcalf (kmedcalf) on 2021-03-11 17:46:15 edited from 11.0 in reply to 6 [link] [source]

Deleted

(12.1) By Keith Medcalf (kmedcalf) on 2021-03-11 17:47:22 edited from 12.0 in reply to 11.1 [link] [source]

Deleted