Should SQLite be able to optimize this query?

(1) By Simon Willison (simonw) on 2021-07-15 16:07:11 [link] [source]

I spotted something interesting about this query. The following executes in about 47ms (against a table containing 33,000 rows):

select country_long, count(*)
from (
  select * from [global-power-plants] order by rowid
)
where country_long is not null
group by country_long
order by count(*) desc

Note that there's an order by rowid in that inner query which is unnecessary - the outer group by query doesn't need to care about the order of the rows in that inner query.

That same query again, without the unnecessary order by rowid:

select country_long, count(*)
from (
  select * from [global-power-plants]
)
where country_long is not null
group by country_long
order by count(*) desc

This executes in 7ms against the same table, returning the same results.

Would it be possible for SQLite to notice that the inner order by is unnecessary and ignore it when executing this kind of query?

(The reason I'm using a nested query here is that my software is designed to be able to calculate "facet counts" against the results of any arbitrary SQL query, which this pattern lets me do.)

(2) By mzm2021 on 2021-07-15 16:16:50 in reply to 1 [link] [source]

Side note here: Not really an answer to your question, but wouldn't using order by 2 even faster than order by count(*)?

select country_long, count(*)
from (
  select * from [global-power-plants] order by rowid
)
where country_long is not null
group by country_long
order by 2 desc

(3) By Richard Hipp (drh) on 2021-07-15 16:25:47 in reply to 1 [link] [source]

Can you post the database schema, please?

(4) By Simon Willison (simonw) on 2021-07-15 16:31:09 in reply to 3 [link] [source]

CREATE TABLE "global-power-plants" (
"country" TEXT,
  "country_long" TEXT,
  "name" TEXT,
  "gppd_idnr" TEXT,
  "capacity_mw" REAL,
  "latitude" REAL,
  "longitude" REAL,
  "primary_fuel" TEXT,
  "other_fuel1" TEXT,
  "other_fuel2" TEXT,
  "other_fuel3" TEXT,
  "commissioning_year" REAL,
  "owner" TEXT,
  "source" TEXT,
  "url" TEXT,
  "geolocation_source" TEXT,
  "wepp_id" TEXT,
  "year_of_capacity_data" INTEGER,
  "generation_gwh_2013" REAL,
  "generation_gwh_2014" REAL,
  "generation_gwh_2015" REAL,
  "generation_gwh_2016" REAL,
  "generation_gwh_2017" REAL,
  "generation_data_source" TEXT,
  "estimated_generation_gwh" REAL
);
CREATE INDEX ["global-power-plants_country_long"] ON [global-power-plants]("country_long");
CREATE INDEX ["global-power-plants_owner"] ON [global-power-plants]("owner");

You can download the full database from https://global-power-plants.datasettes.com/global-power-plants.db (11.1 MB)

(5) By Simon Willison (simonw) on 2021-07-15 16:32:14 in reply to 2 [link] [source]

It doesn't look like that speeds things up - I would expect SQLite to spot that optimization already.

(6) By Simon Willison (simonw) on 2021-07-15 16:40:28 in reply to 4 [link] [source]

explain query plan select country_long, count(*)
from (
  select * from [global-power-plants] order by rowid
)
where country_long is not null
group by country_long
order by count(*) desc

Output:

id,parent,notused,detail
2,0,0,CO-ROUTINE 1
5,2,0,SCAN TABLE global-power-plants
52,0,0,SCAN SUBQUERY 1
57,0,0,USE TEMP B-TREE FOR GROUP BY
92,0,0,USE TEMP B-TREE FOR ORDER BY

And for the query without the order by rowid:

explain query plan select country_long, count(*)
from (
  select * from [global-power-plants]
)
where country_long is not null
group by country_long
order by count(*) desc

Output:

id,parent,notused,detail
8,0,0,"SCAN TABLE global-power-plants USING INDEX ""global-power-plants_country_long"""
40,0,0,USE TEMP B-TREE FOR ORDER BY

(7) By Richard Hipp (drh) on 2021-07-15 17:18:08 in reply to 1 [link] [source]

Would it be possible for SQLite to notice that the inner order by is unnecessary and ignore it when executing this kind of query?

That seems simple, doesn't it. And yet, when I omit the ORDER BY clause from a subquery in the FROM clause that lacks a LIMIT and does not use window functions, I still get 1524 errors from "make test". It will take some time to analyze this and figure out what is going on.

(8) By Simon Willison (simonw) on 2021-07-15 17:52:58 in reply to 7 [link] [source]

I'm glad to have posed something that turns out to be an interesting problem!

(9) By Richard Hipp (drh) on 2021-07-15 18:06:07 in reply to 7 [link] [source]

I'm finding test cases like this:

SELECT group_concat(word,',')
  FROM (SELECT word FROM dictionary ORDER BY 1);

The idea here is that you want to feed the values into group_concat() in a specific order. The ORDER BY clause in the subquery is used to impose that order.

Now, I don't think the SQL language ever guarantees an order in this particular circumstance. So, technically, we should be free to ignore the ORDER BY clause in the subquery above. But I'm wondering how often this kind of thing occurs in the wild, and how many applications will break if I add an optimization to ignore (technically) useless ORDER BY clauses in subqueries?

If you had an application that has been working great for 10 years and then starts giving incorrect answers when you upgrade to SQLite 3.37.0, would you be pleased?

In order to move forward with this optimization, I need to be convinced that it will be a net positive. I need evidence that the benefit of increased performance will outweigh the cost of fixing applications that break because they were depending on undocumented behavior. I'm not yet convinced that the benefits exceed the inconvenience in this case, but I'm open to arguments to the contrary.

(10) By David Raymond (dvdraymond) on 2021-07-15 18:23:13 in reply to 9 [link] [source]

My hunch is that you'll find that exact case with group_concat() a lot in the wild because we can't do ordered aggregates.

Adding in the ability to do

group_concat(word, ',' order by word)

would allow people to fix it, but then they'd be fixing it because it broke in the meantime.

(11) By Richard Hipp (drh) on 2021-07-15 19:34:24 in reply to 9 [link] [source]

I tried to work around this objection by only omitting the ORDER BY clause if either:

The outer query is not an aggregate
The outer query is an aggregate but it only uses the built-in count(), min(), or max() functions.

And yet there are still issues. See the ef97c3e7c3ea2cf1 check-in for the code so far.

(12) By Simon Willison (simonw) on 2021-07-15 19:45:22 in reply to 9 [link] [source]

Breaking that group_concat() example certainly looks like an unacceptable change - I wouldn't be at all surprised to see real-world code that relies on that.

In my particular case queries that I was running extremely frequently (often several times for a single HTTP page load) got nearly a 10x speed-up - but I've now changed my code to remove the inner order by (see https://github.com/simonw/datasette/issues/1394) so my project would no longer benefit from this optimization.

(13) By Larry Brasfield (larrybr) on 2021-07-15 19:51:08 in reply to 10 [link] [source]

I have that same hunch, for the same reason. And, even before you suggested that group_concat() enhancement, I was thinking, "Oh nuts. I may have to fix that usage sin committed years ago by writing a smarter concatenator() aggregate function."

It is true that there is no good, ready-to-go substitute for the aggregated items being aggregated in a user-determined order. I have relied on the present behavior myself^a, even knowing that it was not strictly predictable, taking the calculated risk that the present SQLite treatment would continue.

Based on the problem I was solving, and the extra work necessitated by more "pure" alternatives, I expect my reliance^b is not rare among users.

a. The application was to take data from a schematic database, (at which point it is organized as Dr. Codd would approve), and from that producing a report in a format which is very commonly used by organizations that build electronic assemblies. The data in the report is badly non-normalized. In particular, each part type has an associated designator list, naming the individual usages of that part type in the assembly. People like to see these in a very predictable order, mainly because people reading the report want to find "C273" or whatever.

Using a set of views and queries, I could produce that report in one fell swoop from the properly normalized database. Without the ability to order the input to group_concat() to put that designator glom in the expected order, a separate processing step would be needed, losing real convenience. Now, the BoM (Bill of Materials) view output can be simply plopped in a spreadsheet, which is the preferred form of the wants-to-be-data.

b. If the designators are misordered, the "data" is still there and subject to later fixup. It's an annoyannce problem, not a missile launch failure issue.

(14) By anonymous on 2021-07-15 19:51:53 in reply to 10 [link] [source]

I suggested the same thing before, and it has not been implemented yet. I hope that such a thing will be implemented later.

(15.1) By Richard Hipp (drh) on 2021-07-15 23:29:05 edited from 15.0 in reply to 7 [source]

Consider this SQL:

CREATE TABLE t1(a);
INSERT INTO t1 VALUES(4),(8),(3),(5),(1),(7),(2),(6);
CREATE VIEW v2 AS SELECT a FROM t1 ORDER BY a;
SELECT * FROM v2;

Would you expect the values to come out of v2 in sorted order because of the ORDER BY clause on the view? They do on all historical versions of SQLite. But if I omit the ORDER BY clauses from subqueries, that ORDER BY clause goes away. The original "SELECT * FROM v2" query is transformed into:

SELECT * FROM (SELECT a FROM t1 ORDER BY a);

The ORDER BY clause is removed from the subquery. Then the query flattener kicks in and simplifies the query to be just:

SELECT * FROM t1;

And the rows come out in some arbitrary order. I'm not sure what the SQL standards say about ORDER BY clauses on VIEWs, but the result here seems counter-intuitive to me.

My current fix for this is to only remove the ORDER BY clause on the subquery if either:

The outer query has an ORDER BY clause of its own, or
The outer query is a join - in other words there are FROM clause terms other than the subquery that contains the ORDER BY.

(16) By Ryan Smith (cuz) on 2021-07-15 23:57:33 in reply to 15.1 [link] [source]

They do on all historical versions of SQLite. But if I omit the ORDER BY clauses ... I'm not sure what the SQL standards say about ORDER BY clauses on VIEWs, but the result here seems counter-intuitive to me.

It isn't intuitive, but strictly an order clause that is not in the outer query is irrelevant to the outer query results, though not irrelevant to an inner query result, like inside GROUP_CONCATS or Windows functions.

Consider:

SELECT a,b FROM (SELECT a FROM ta ORDER BY a), tb ORDER BY b;

The "ORDER BY a" is as useful as the vestigial leg bones in a Whale and I'm all for disregarding it, even where a later "ORDER BY b" is not present.

Regarding historic use and breaking backward - what does pragma reverse_unordered_selects do with your example counter-intuitive query?

If the query output under that pragma is reversed, then all is well and good and I vote for changing it.
If however that query actually remains correctly ordered under said pragma then I for one might have some systems breaking. Like Larry and others, I would have used such queries - though I always sanitize production systems using said pragma in testing - but if that is no longer a sufficient check, I might run into trouble.

Personally I will get over it though, happily even if the change brings other optimizations, but others may disagree.

(17) By Richard Hipp (drh) on 2021-07-16 01:22:33 in reply to 1 [link] [source]

This optimization is now available on the omit-subquery-order-by branch. Try it out if you dare. All the core SQLite tests pass now, but there is still no guarantee that this will ever land on trunk.

(18) By Simon Willison (simonw) on 2021-07-16 01:50:07 in reply to 15.1 [link] [source]

I would definitely expect the order by on the view there to be respected, whether or not that's in the SQL standard.

(19) By anonymous on 2021-07-16 03:45:02 in reply to 16 [link] [source]

What might make sense (but is probably too messy) is to distinguish ordered tables and unordered tables in the FROM clause. Ordinary tables (and probably also virtual tables) are always unordered. VALUES is always ordered. Subqueries, views, and CTEs might be ordered or unordered. However, perhaps this distinguishing is too messy to work in all cases, and I don't know if it might break some optimizations, too. So, this is not my suggestion.

Some things are too difficult to do without ordered subqueries, although such uses will always be messy with ordered subqueries anyways; rather, instead of implementing ordered and unordered subqueries like above, the need for ordered subqueries could be avoided by making improvements of SQLite, such as:

Allow ORDER BY in non-window aggregate functions like PostgreSQL does (e.g. "group_concat(x, ';' order by y)"). (You can also allow first_value, last_value, and nth_value to be used like any other aggregate function.)
Some way to insert row numbers into the result set after the rest of the query is handled, using the output order of the query (this can be used for both inner and outer queries, including compounds). Although the row_number window function could do this with the existing implementation (except for recursive CTEs; see below), the query to do this is going to be complicated; the ORDER BY must be duplicated, and you might have to nest compound queries, etc.

The row numbers might be needed sometimes especially with recursive CTEs; I fail to see how else to return row numbers from a recursive CTE (presumably in the order that the rows are extracted from the queue).

(Even, many of the examples in the documentation about WITH depends on the result order of subqueries, and uses the group_concat aggregate in the way that depends on the order, and I think that it would be better to avoid depending on that by implementing these ideas.)

(Also, I had sometimes used infinite CTEs due to not knowing any other way for an extension to detect if sqlite3_interrupt() has been called (by something outside of the extension, such as the application code). Is there a better way?)

(20) By Gunter Hick (gunter_hick) on 2021-07-16 05:58:03 in reply to 19 [link] [source]

"Ordinary tables (and probably also virtual tables) are always unordered".

Nope. Virtual tables can implement indexed access. That is what the xBestIndex method is for, which is required to set the "orderByConsumed" Flag if the order of rows is guaranteed to be the requested order.

(21) By MBL (RoboManni) on 2021-07-16 06:44:20 in reply to 20 [link] [source]

"Natural ordered" (aka orderByConsumed) is also an order and NOT random.

I the past - before the "as MATERIALIZED" for the WITH clause was introduced - there was an issue where I had to add "limit 8e9" to solve my issue, which I cannot remember exactly now what it was. Might also not well be related to this thread here but just to mention as a hint about other effects out from the CTE into the framing query.

My query:

with
tab as (
  select * from ViewAllAlarms limit 8e9
),
alarms as (
  select rowid,TimeStamp,Seconds,Severity,Reason,Source,RestMsg,Message2,Message3
  from tab
  where (State isNull) and (TestEnd isNull) and (Control isNull)
),
aggregated as (
select   Severity,Reason,Source,RestMsg,Message2,Message3,count(*)NumCount
from alarms
where Severity notNull
group by Severity,Reason,Source,RestMsg,Message2,Message3
having NumCount>1
order by Severity,NumCount desc,Reason,Source,RestMsg,Message2,Message3
)
select * from aggregated;

(22) By Richard Hipp (drh) on 2021-07-16 15:05:05 in reply to 7 [link] [source]

Suppose an application-defined SQL function "sideeffectfunc(x)" has some external side effects. Or suppose that function maintains some kind of internal state such that result of each call depends on previous calls. Then the following two queries might give surprising and undesirable results if the ORDER BY clause is omitted:

SELECT sideeffectfunc(x) FROM (SELECT x FROM table ORDER BY x);
SELECT y FROM (SELECT sideeffectfunc(x) AS y FROM table ORDER BY x);

HT to Tom Lane at Postgres for spotting this one. I have not yet adjusted the implementation to take this case into account.

(23) By Larry Brasfield (larrybr) on 2021-07-16 16:30:19 in reply to 18 [link] [source]

One tenable position^a would be: The query compiler, while under no obligation to produce any particular ordering where an inner ordered SELECT is used for joins, should not needlessly drop the ordering where the row set is going into any aggregating function. After all, anybody who wants to avoid a truly extraneous inner ordering has a relatively simple option: Don't write that!

To the point on respecting order imposed with a view: As I understand SQLite's operation, once a view is incorporated into a larger query, its specifications are subject to all kinds of query flattening and rewriting. I think it would handicap the optimizer if a view's ordering was treated as sacrosanct.

a. At least I would cling to that view, for awhile.

(24) By Simon Willison (simonw) on 2021-07-16 18:26:21 in reply to 17 [link] [source]

As a really safe alternative, how about if the ignore-order-by only took place for a specific list of aggregations (or even just count)?

(25.1) By Richard Hipp (drh) on 2021-07-16 19:40:50 edited from 25.0 in reply to 24 [link] [source]

~~I don't see that that really helps. Because I still need to deal with (omit the optimization from) this case:~~

~~SELECT count(*) FROM (SELECT sideeffectfunc(x) FROM table ORDER BY x);~~

Edit: I now believe my reasoning was incorrect. See 4a76777e6d7aa126 above. Ignore this post.

(26.1) By Richard Hipp (drh) on 2021-07-16 19:46:32 edited from 26.0 in reply to 22 [link] [source]

I now think that neither of these cases is a problem. (Edit: And Tom Lane concurs.)

SELECT sideeffectfunc(x) FROM (SELECT x FROM table ORDER BY x);

This case would not omit the ORDER BY clause because of the restriction described by post 062d576715d277c8 above.

SELECT y FROM (SELECT sideeffectfunc(x) AS y FROM table1 ORDER BY x), table2;

In this case, the order of invocations of sideeffectfunc() is arbitrary anyhow. SQLite has never made any guarantees about the order of evaluation in this case. The sideeffectfunc() calls might be in x order or not, depending on available indexes, table statistics, which version of SQLite is running, and so forth.

Notice that I added another table into the outer FROM clause to make it a join. Otherwise, the restriction described by post 062d576715d277c8 above would apply and the ORDER BY would be retained.

(27) By Richard Hipp (drh) on 2021-07-16 22:51:30 in reply to 1 [link] [source]

This optimization is now on trunk. The following description is intended to serve as archival documentation.

If a subquery in the FROM clause has an ORDER BY, that ORDER BY is omitted if all of the following conditions are also true:

There is no LIMIT clause in the subquery
The subquery was not one of the virtual subqueries added internally by SQLite for window-function processing
The subquery is not part of the FROM clause in an UPDATE-FROM statement
The outer query does not use any aggregate functions other than the built-in count(), min(), and/or max() functions.
Either the outer query has its own ORDER BY clause or else the subquery is one term of a join.
The omit-ORDER-BY optimization is not disabled by the SQLITE_TESTCTRL_OPTIMIZATION test-control.

The omit-ORDER-BY optimization can be disabled (item 6 above) using the following command in the CLI:

 .testctrl optimizations 0x40000

(28) By Simon Willison (simonw) on 2021-07-17 17:44:51 in reply to 27 [link] [source]

This is fantastic. Really like that set of rules for when this applies. Thank you for this!