Make a time-range query faster when "GROUP BY day"

(1) By huwehas on 2021-01-04 16:45:38 [link] [source]

Hi,

Let's say we have four shops A B C D, each of them having 26 products (A to Z) and 26*26 possible customers AA, AB, ..., ZZ buying these products:

CREATE TABLE transactions(id INTEGER, dt INTEGER, shop TEXT, product TEXT, customer TEXT, PRIMARY KEY(shop, dt, id)) WITHOUT ROWID;

Let's say we have 1 million rows spread in a 1-month datetime range, and we want to find the number of distinct customers who have purchased in the shop A for each day in a date range of 10 days (10 x 24 x 3600 seconds). This query works:

SELECT strftime('%Y-%m-%d', datetime(dt, 'unixepoch', 'localtime')) AS day, COUNT(DISTINCT customer) AS count FROM transactions WHERE shop == 'A' AND dt BETWEEN 1600000000+5*24*3600 AND 1600000000+15*24*3600 GROUP BY day

Interesting fact: the WHERE conditions are fully covered by the PRIMARY KEY / the B-tree.

I used WITHOUT ROWID here, because it's faster and the DB is smaller: there is no need to have one B-tree for an index + another B-tree for rowid.

The problem is: the query is rather slow because:

QUERY EXPLAIN gives USE TEMP B-TREE FOR GROUP BY: is there a way to avoid this temporary B-tree? After all, the table's B-tree is sorted in datetime so it should be pretty straightforward to group consecutive rows when it's the same day?
Here I suspect that strftime(... datetime(...)) is executed for each row, which is slow.

However I can't do a generated column because I need group by day (in localtime), and it datetime(..., 'localtime') is non-deterministic.

What would you do to improve these queries?

Cheers

Fully reproducible Python test code:

import sqlite3, time, random, string
db = sqlite3.connect(':memory:')
db.executescript("""CREATE TABLE transactions(id INTEGER, dt INTEGER, shop TEXT, product TEXT, customer TEXT, PRIMARY KEY(shop, dt, id)) WITHOUT ROWID;""")
for i in range(1*1000*1000):
    db.execute("INSERT INTO transactions(id, dt, shop, product, customer) VALUES (?, ?, ?, ?, ?)", 
                        (i, random.randint(1600000000, 1600000000+31*24*3600), random.choice('ABCD'), random.choice(string.ascii_uppercase), ''.join(random.choices(string.ascii_uppercase, k=2))))
db.commit()    
print('size (MB):', next(db.execute('PRAGMA page_count;'))[0] * next(db.execute('PRAGMA page_size;'))[0] / 1024**2)
db.execute("SELECT strftime('%Y-%m-%d', datetime(dt, 'unixepoch', 'localtime')) AS day, COUNT(DISTINCT customer) AS count FROM transactions WHERE shop == 'A' AND dt BETWEEN 1600000000+5*24*3600 AND 1600000000+15*24*3600 GROUP BY day")

(2.3) By little-brother on 2021-01-04 18:20:27 edited from 2.2 in reply to 1 [link] [source]

I can't do a generated column because I need group by day (in localtime)

You group data with timezones, but fetch them without. Hmmm.
Why don't you use indexes on shop and dt? Without them I have a full table scan.

P.S. Aren't strftime('%Y-%m-%d', datetime(dt, 'unixepoch', 'localtime')) and strftime('%Y-%m-%d', dt, 'unixepoch', 'localtime') the same thing?

(3) By huwehas on 2021-01-04 20:01:59 in reply to 2.3 [link] [source]

You group data with timezones, but fetch them without. Hmmm.

No, both are with timezones (grouping and fetching):

SELECT strftime(...) AS day .... GROUP BY day

Why don't you use indexes on shop and dt? Without them I have a full table scan.

There is an implicit index on (shop, dt) coming from the PRIMARY KEY(...) of the WITHOUT ROWID table:

CREATE TABLE transactions(id INTEGER, dt INTEGER, shop TEXT, product TEXT, customer TEXT, PRIMARY KEY(shop, dt, id)) WITHOUT ROWID;

This is confirmed by EXPLAIN QUERY PLAN:

SEARCH TABLE transactions USING PRIMARY KEY (shop=? AND dt>? AND dt<?)

The main question I had here is: why is the GROUP BY using a temporary B-tree since strftime(...) AS DAY is already sorted?

Does GROUP BY uses a temporary hash table or is the GROUP BY done by sorting in general with Sqlite?

(10) By little-brother on 2021-01-05 07:27:32 in reply to 3 [link] [source]

No, both are with timezones (... fetching): AND dt BETWEEN 1600000000+5243600 AND 1600000000+15243600 GROUP BY day")

If 1600000000 is calculated as a local time then is OK.

There is an implicit index on (shop, dt)

Yeah, I was wrong. I drop shop = 'A'-condition and forgot about it :(

(15) By TripeHound on 2021-01-05 09:51:02 in reply to 3 [source]

The main question I had here is: why is the GROUP BY using a temporary B-tree since strftime(...) AS DAY is already sorted?

One thing I think you're missing is that while it may be "obvious" to a human that if the inputs to strftime are sorted, then its outputs will be sorted, that fact is (almost certainly) not known to the query-planner (QP).

SQLite functions can be marked deterministic, meaning that if you give them the same input(s) they will generate the same output, and the query-planner may use this knowledge to its advantage.

However, as far as I know¹, there is nothing like (what might be called) a monotonic marker, that would assert that if a >= b then f(a) >= f(b) and therefore if a collection of inputs is sorted, the collection of outputs of a function so-marked will remain sorted.

Without that knowledge, the QP cannot assume anything about the outputs from strftime() (=DAY) and must therefore create a temporary b-tree to sort or group them.

¹ I've not actually dug into the sources to check, but in the four or so years I've been following the SQLite mailing list, I've never heard mention of any ability of the QP to "know" that sorted inputs to a function will give sorted outputs. I'm sure someone will correct me if I'm wrong.

(4) By Keith Medcalf (kmedcalf) on 2021-01-04 20:03:22 in reply to 1 [link] [source]

FIrstly, when you incant "day" you appear to mean "date". The first step would be to simplify the convoluted calculation of the date:

strftime('%Y-%m-%d', datetime(dt, 'unixepoch', 'localtime')) is really just overblown twice redundant CPU using expression for date(dt, 'unixepoch', 'localtime')

When I run your test it is "pretty quick".

>t.py
size (MB): 21.13671875
Elapsed 7.119234800338745
('2020-09-18', 674)
('2020-09-19', 676)
('2020-09-20', 676)
('2020-09-21', 676)
('2020-09-22', 676)
('2020-09-23', 676)
('2020-09-24', 676)
('2020-09-25', 676)
('2020-09-26', 676)
('2020-09-27', 676)
('2020-09-28', 671)
Elapsed 0.1614091396331787

After fixing the twice redundant date computation it is, as anticipated, even quicker:

>t.py
size (MB): 21.05078125
Elapsed 7.2174248695373535
('2020-09-18', 676)
('2020-09-19', 676)
('2020-09-20', 676)
('2020-09-21', 676)
('2020-09-22', 676)
('2020-09-23', 676)
('2020-09-24', 676)
('2020-09-25', 676)
('2020-09-26', 676)
('2020-09-27', 676)
('2020-09-28', 645)
Elapsed 0.08199238777160645

81 milliseconds seems "pretty quick" to me, though 161 milliseconds is no slaggard either.

(5) By huwehas on 2021-01-04 20:23:20 in reply to 4 [link] [source]

Oh thanks, I forgot that date(...) was a direct way to have Y-m-d.

More generally, doing:

SELECT func(column) FROM data WHERE ... GROUP BY ...

(where func is here date) is probably bad because it applies func(...) to every row.

What would you do in this case to speed up queries?

Store the unix timestamp and also store the pre-cooked date(dt, 'unixepoch', 'localtime') AS day, so that SELECT ... GROUP BY day will be fast?
Or store only the unix timestamp ? and do date(...) during the SELECT queries?
Or wouldn't you use the unix timestamp at all, and store the date as a ISO8601 string "YYYY-MM-DD HH:MM:SS" instead?

My goal is: keep the date in the DB with second precision, but have very fast GROUP BY day queries.

(8.4) By Keith Medcalf (kmedcalf) on 2021-01-05 09:17:19 edited from 8.3 in reply to 5 [link] [source]

Is it not fast enough? Why are you trying to solve a problem that does not yet exist?

If it were me, and if it were worthwhile (which it probably is not) then I would pre-compute the group table of localtime dates and utc boundaries then dip the data I wanted as a correlated subquery.

Example:

with alldt(dt, ufrom, uto)
  as (
         select date(?1, '+0 days'),
                cast(strftime('%s', ?1, 'start of day', 'utc') as integer),
                cast(strftime('%s', ?1, 'start of day', '+1 days', 'utc') as integer) - 1
      union all
         select date(dt, '+1 days'),
                cast(strftime('%s', dt, '+1 days', 'utc') as integer),
                cast(strftime('%s', dt, '+2 days', 'utc') as integer) -1
          from alldt
         where dt < date(?2, '+0 days')
     )
select dt,
       (
        select count(distinct customer)
          from transactions
         where shop == ?3
           and dt between ufrom and uto
       ) as Customers
  from alldt
;

where you bind the three parameters being the start and end dates (in localtime ISO8601 string format) and the shop.

NB: Edited to correct name of table
NB: I get different results because of timezone difference
NB: Even for your sample data the above is almost twice as fast again coming in at 45 milliseconds
NB: Changed the date(?1) and date(?2) to date(..., '+0 days') which will attempt to make a bad date into a good date -- example, if you use the date 2021-02-29 it will silently change it to the date 2021-03-01 because February 2021 only has 28 days. Also got rid of the constant select for the ending date because current versions of SQLite3 will correctly put the constant expression in a once group without the extra constant select.

(9) By Keith Medcalf (kmedcalf) on 2021-01-05 05:13:53 in reply to 8.3 [link] [source]

This one will get you all the data by shop and date within the range specified by the two date parameters:

with alldt(dt, ufrom, uto)
  as (
         select date(?1),
                cast(strftime('%s', ?1, 'start of day', 'utc') as integer),
                cast(strftime('%s', ?1, 'start of day', '+1 days', 'utc') as integer) - 1
      union all
         select date(dt, '+1 days'),
                cast(strftime('%s', dt, '+1 days', 'utc') as integer),
                cast(strftime('%s', dt, '+2 days', 'utc') as integer) - 1
          from alldt
         where dt < date(?2)
     ),
     allshop(shop)
  as (
      select distinct shop
        from transactions
       where dt between cast(strftime('%s', ?1, 'start of day', 'utc') as integer)
                    and cast(strftime('%s', ?2, 'start of day', '+1 day', 'utc') as integer) -1
     )
select dt,
       shop,
       (
        select count(distinct customer)
          from transactions
         where shop == allshop.shop
           and dt between ufrom and uto
       ) as CustomerCount
  from alldt, allshop
;

(14.1) By huwehas on 2021-01-05 11:21:34 edited from 14.0 in reply to 9 [link] [source]

Thank you, I will try with your code.

In the meantime I think I found a nearly optimal solution: store both the unix timestamp (if one day I need it) and also the "Ymd" date as INTEGER, ex: 20200918, and do the indexing on (shop, day) (with the WITHOUT ROWID PRIMARY KEY's implicit index).

Then, when shop is fixed to A, both WHERE and GROUP BY can use the index.

Then I get a ~ 40ms query, instead of ... ~ 120 ms initially.
(Edit: I initially thought I had a 4 ms query, but this was because I was just doing the query and not fetching the result rows).

Reproducible Python code:

import sqlite3, time, random, string, datetime
db = sqlite3.connect(':memory:')
db.executescript("CREATE TABLE transactions(id INTEGER, dt INTEGER, day INTEGER, shop TEXT, product TEXT, customer TEXT, PRIMARY KEY(shop, day, id)) WITHOUT ROWID;")
for i in range(1*1000*1000):
    dt = random.randint(1600000000, 1600000000+31*24*3600)
    d = int(datetime.datetime.utcfromtimestamp(dt).strftime('%Y%m%d'))
    db.execute("INSERT INTO transactions(id, dt, day, shop, product, customer) VALUES (?, ?, ?, ?, ?, ?)", 
                        (i, dt, d, random.choice('ABCD'), random.choice(string.ascii_uppercase), ''.join(random.choices(string.ascii_uppercase, k=2))))
db.commit()    
print('size (MB):', next(db.execute('PRAGMA page_count;'))[0] * next(db.execute('PRAGMA page_size;'))[0] / 1024**2)
t0 = time.time()
for r in db.execute("SELECT day, COUNT(DISTINCT customer) AS count FROM transactions WHERE shop == 'A' AND day BETWEEN 20200918 AND 20200928 GROUP BY day"):
    pass
print(time.time()-t0)

(16) By Keith Medcalf (kmedcalf) on 2021-01-05 10:44:58 in reply to 14.0 [link] [source]

I suppose it mostly depends what you want in the table and where you want to spend the resources. I get the following:

t1.py

import sqlite3, time, random, string
db = sqlite3.connect(':memory:')
db.executescript("""CREATE TABLE transactions(id INTEGER, dt INTEGER, shop TEXT, product TEXT, customer TEXT, PRIMARY KEY(shop, dt, id)) WITHOUT ROWID;""")
st = time.time()
for i in range(1*1000*1000):
    db.execute("INSERT INTO transactions(id, dt, shop, product, customer) VALUES (?, ?, ?, ?, ?)",
         (i, random.randint(1600000000, 1600000000+31*24*3600), random.choice('ABCD'), random.choice(string.ascii_uppercase), ''.join(random.choices(string.ascii_uppercase, k=2))))
db.commit()
print('size (MB):', next(db.execute('PRAGMA page_count;'))[0] * next(db.execute('PRAGMA page_size;'))[0] / 1024**2, 'Elapsed', time.time() - st)
st = time.time()
for row in db.execute("SELECT date(dt, 'unixepoch', 'localtime') AS day, COUNT(DISTINCT customer) AS count FROM transactions WHERE shop == 'A' AND dt BETWEEN 1600000000+5*24*3600 AND 1600000000+15*24*3600 GROUP BY day"):
    print(row)
print('Elapsed', time.time()-st)
sql = '''
with alldt(dt, ufrom, uto)
  as (
         select date(?1, '+0 days'),
                cast(strftime('%s', ?1, 'start of day', 'utc') as integer),
                cast(strftime('%s', ?1, 'start of day', '+1 days', 'utc') as integer) - 1
      union all
         select date(dt, '+1 days'),
                cast(strftime('%s', dt, '+1 days', 'utc') as integer),
                cast(strftime('%s', dt, '+2 days', 'utc') as integer) -1
          from alldt
         where dt < date(?2, '+0 days')
     )
select dt,
       (
        select count(distinct customer)
          from transactions
         where shop == ?3
           and dt between ufrom and uto
       ) as Customers
  from alldt
;'''
st = time.time()
for row in db.execute(sql, ('2020-09-18', '2020-09-28', 'A')):
    print(row)
print('Elapsed', time.time()-st)

t2.py

import sqlite3, time, random, string, datetime
db = sqlite3.connect(':memory:')
db.executescript("CREATE TABLE transactions(id INTEGER, dt INTEGER, day INTEGER, shop TEXT, product TEXT, customer TEXT, PRIMARY KEY(shop, day, id)) WITHOUT ROWID;")
st = time.time()
for i in range(1*1000*1000):
    dt = random.randint(1600000000, 1600000000+31*24*3600)
    d = int(datetime.datetime.utcfromtimestamp(dt).strftime('%Y%m%d'))
    db.execute("INSERT INTO transactions(id, dt, day, shop, product, customer) VALUES (?, ?, ?, ?, ?, ?)",
                        (i, dt, d, random.choice('ABCD'), random.choice(string.ascii_uppercase), ''.join(random.choices(string.ascii_uppercase, k=2))))
db.commit()
print('size (MB):', next(db.execute('PRAGMA page_count;'))[0] * next(db.execute('PRAGMA page_size;'))[0] / 1024**2, 'Elapsed', time.time() - st)
t0 = time.time()
for row in db.execute("SELECT day, COUNT(DISTINCT customer) AS count FROM transactions WHERE shop == 'A' AND day BETWEEN 20200918 AND 20200928 GROUP BY day"):
    print(row)
print(time.time()-t0)

yielding:

>t1
size (MB): 21.08984375 Elapsed 7.32311224937439
('2020-09-18', 676)
('2020-09-19', 676)
('2020-09-20', 676)
('2020-09-21', 676)
('2020-09-22', 676)
('2020-09-23', 676)
('2020-09-24', 676)
('2020-09-25', 676)
('2020-09-26', 676)
('2020-09-27', 676)
('2020-09-28', 648)
Elapsed 0.07716655731201172
('2020-09-18', 676)
('2020-09-19', 676)
('2020-09-20', 676)
('2020-09-21', 676)
('2020-09-22', 676)
('2020-09-23', 676)
('2020-09-24', 676)
('2020-09-25', 676)
('2020-09-26', 676)
('2020-09-27', 676)
('2020-09-28', 676)
Elapsed 0.029463529586791992

>t2
size (MB): 26.74609375 Elapsed 15.420055627822876
(20200918, 676)
(20200919, 676)
(20200920, 676)
(20200921, 676)
(20200922, 676)
(20200923, 676)
(20200924, 676)
(20200925, 676)
(20200926, 676)
(20200927, 676)
(20200928, 676)
0.0339205265045166

(17) By huwehas on 2021-01-05 11:22:58 in reply to 16 [link] [source]

You're right. My 4ms was wrong because I was only doing:

db.execute(...)

and not really fetching the results. Doing

for r in db.execute(...):
    ...

gives now a timing ~ 40 ms.

(11) By anonymous on 2021-01-05 07:59:22 in reply to 8.3 [link] [source]

Well there is a problem with the date-time function.

For example, DS_DTS_UPDATED TEXT NOT NULL DEFAULT (DATETIME('NOW', 'UTC'))

Will give an error of two hours of in Amsterdam

(12) By Keith Medcalf (kmedcalf) on 2021-01-05 08:42:51 in reply to 11 [link] [source]

The timezone modifiers are as follows:

'localtime' means to modify the datetime value, which is assumed to be in the UTC+00 timezone, to the local timezone by adding the offset that the Operating System believes should be applied to convert that UTC+00 time to the local time zone.
'utc' means to modify the datetime value, which is assumed to be in the local time zone, to the UTC+00 timezone by subtracting the offset that the Operating System believes should be applied to convert that localtime to the UTC+00 timezone.

'now' is the current time in the UTC+00 timezone.

So your request to convert the time, which is already in UTC+00 timezone, by subtracting the Operating Systems offset between UTC+00 and the local time zone is the root cause of the error.

(6.1) By huwehas on 2021-01-06 09:37:15 edited from 6.0 in reply to 4 [link] [source]

Deleted

(7.1) By Richard Hipp (drh) on 2021-01-05 01:52:40 edited from 7.0 in reply to 6.0 [link] [source]

~~What email address did you use?~~

Never mind - I found the entry.

Your email server is rejecting all email from the fossil-scm.org domain because fossil-scm.org is listed in www.spamhaus.org (or so it says). I think this means that you won't be able to sign up for the forum.

Maybe get a gmail or outlook email account instead of "gget.it"?

(13.1) By huwehas on 2021-01-06 09:37:08 edited from 13.0 in reply to 7.1 [link] [source]

Deleted