transient indices vs. LIMIT
My reading of the docs in https://www.sqlite.org/draft/tempfiles.html and in particular section 2.8 tells me that:
If I do a DISTINCT, GROUP BY (for DISTINCT ON) or ORDER BY and I combine it with a LIMIT SQLite will still sort all the rows using a transient index and then apply the limit to the sorted rows.
In principle you could do better by keeping an index of at most k distinct min values where k is the LIMIT amount (OFFSET complicates this as usual). Maybe a heap.
Have I read that correctly? i.e. using LIMIT with ORDER BY doesn't help that much.
I wasn't really expecting any sort of advanced "heap streaming" or anything like that but I was trying to cite the docs for clarity in a talk I'm giving.
OTOH it might be fun to try to add such a thing... There are lots of cases where getting the "most recent" 'k' rows comes up and it would be cool to be able to do that more cheaply and then join from there to get your additional data.
Anyone already looking into that?
A heap could help for ORDER BY, but it I don't see how it helps for DISTINCT. Using a smaller index (only holding items that are currently candidates to meet the limit) would change the complexity from O(n log n) to O(n log k), where n is the number of rows in the table. In Python (3.6.8), I turned journal_mode off. I built a table with 20 million rows of floats (from random.random()). Schema was CREATE TABLE a(b); At that point, Windows task manager said my process had done a couple of hundred reads and about 80k writes. File size was 330MB. I did SELECT COUNT(*) FROM a; That added about 80k reads to the total. As expected, it looks like it had to read the entire database to satisfy that query. I twice did SELECT DISTINCT b FROM a ORDER BY b LIMIT 10; each time that resulted in about 80k reads and about 60k writes. I'm not sure what that says about the implementation. 80k reads is what it takes to read the table once. It isn't enough to do that and also build a full-size additional index. However, it shouldn't take any writes at all to maintain a 10-entry temporary index, so it isn't clear what those 60k writes were actually doing. I never saw the python process memory much above 10MB, so it certainly wasn't holding the full database (or a full temporary index) in memory.
select distinct with a limit of say k means you only need k distinct rows. Not necessarily in any order so a side index helps there too. And you can early out. Distinct plus order by and limit makes that a little harder but you can still do it in one pass with n lg k as you say. For sufficiently small k you could avoid disk writes too.
Based on your numbers it's pretty strong evidence that they are doing a full sort in those cases with a transient index (which is what the docs seem to indicate).
Definitely opportunity there.
I'd call it strong evidence that they aren't doing a full sort.
To do a full sort of 320 MB of data in a process that is only using 10 MB of memory I would need more than 320 MB of reads + 240 MB of writes.
There weren't enough writes to just copy the input (unless individual writes were bigger than reads, which I suppose is possible).
It seems like there were way too many writes to maintain a small index, even if they did a write (or a few writes) when the top-10 contents change.
I'd expect that every time you double the number of records you've already processed, you normally get between 5 and 10 new entries in the top-10.
Log2(20e6) is about 24, so that would imply you can write every change to the top-10 index with just a few hundred writes.
If instead of keeping the top-10, they kept the top page of the index (a few hundred entries), that might explain the number of writes I saw.