Slow query

(1.1) Originally by anonymous with edits by Richard Hipp (drh) on 2021-09-14 15:30:33 from 1.0 [source]

I have a table in an SQLite database in the following form (6 columns):

Timestamp	fst_field	...	fifth_field
10001	X	...	...
10003	Y	...	...
10005	Z	...	...
10020	W	...	...

It doesn't have any primary key and all the fields are of type text except for the Timestamp (int)

The following query:

SELECT s.Timestamp, s.fst_field , ..., s.fifth_field, p.Timestamp, p.fst_field, ..., p.fifth_field
FROM table s JOIN table p ON s.Timestamp < p.Timestamp AND p.Timestamp - s.Timestamp < 10

should output every row next to all its following rows within a threshold (here 10).

The output looks like the following (12 columns):

s.Timestamp	s.fst_field	...	s.fifth_field	p.Timestamp	p.fst_field	...
10001	X	...	...	10003	Y	...
10001	X	...	...	10005	Z	...
...	...	...	...

The query works for tables with 4000 rows. However, when I try it with tables with 22 million rows (1,4 GB file) it keeps running for days without ever stopping. I tried adding an index on the attribute 'Timestamp', to make the query faster but that didn't help. I ran the code both on Linux and Windows.

Any idea why this doesn't work?

(2) By anonymous on 2021-09-13 20:45:07 in reply to 1.0 [link] [source]

Did you try using in memory database?

(3) By Ryan Smith (cuz) on 2021-09-14 12:58:15 in reply to 1.0 [link] [source]

That does not look like a function that can be resolved by the Query planner easily.

I don't have your data so cannot easily test it, but this should make a difference, try testing it, especially if you have already added the index on timestamp and assuming timestamp is integer:

SELECT s.Timestamp, s.fst_field , ...
  FROM table s 
  JOIN table p ON p.Timestamp BETWEEN s.Timestamp - 9 AND s.Timestamp - 1

Note that this query on a billion-row table will produce output of easily 5-Billion+ rows (if your example data can be assumed an average indication). That in itself will take significant time to output, let alone query. Probably adding a WHERE clause to limit the Query domain to some specific time-frame or such would be better.

What is it you want to learn form repeating the close data together? There's almost certainly a better way to do the thing you want to do, but for that we need to know what the real information is you wish to glean from the query.

Let us know if that sped things up, and good luck.