Can this simple query be optimized?

(1.1) By andyMcoy on 2020-05-12 22:31:11 edited from 1.0 [link] [source]

On the following table

CREATE TABLE foo (x INTEGER PRIMARY KEY, y INTEGER);
INSERT INTO foo VALUES (0,41), (1, 23), (2,45), (3,32), ...

I need the x and y which has min(y) over groups of 10 x, and the same for max(y):

SELECT x, min(y) FROM foo GROUP BY (x/10)
UNION
SELECT x, max(y) FROM foo GROUP BY (x/10);

The EXPLAIN QUERY PLAN output shows that two scans of the table are performed

`--COMPOUND QUERY
   |--LEFT-MOST SUBQUERY
   |  |--SCAN TABLE foo
   |  `--USE TEMP B-TREE FOR GROUP BY
   `--UNION ALL
      |--SCAN TABLE foo
      `--USE TEMP B-TREE FOR GROUP BY

Is there any way to reword the query so that only one scan is performed?

What I've done in the mean time is to select all rows (SELECT x, y FROM foo;) and manually aggregate min/max as rows are returned to the host language (C++):

int lastGroup = 0;
while (sqlite3_step(query) == SQLITE_ROW) {
  int x = sqlite3_column_int(query, 0);
  int y = sqlite3_column_int(query, 1);
  int group = x / 10;
  if (group != lastGroup) {
    // save minX, minY, maxX, maxY in a list somewhere
    // reset minX, minY, maxX, maxY
    // ...
    lastGroup = group;
  }  
  else {
    if (y < minY) {
      minX = x;
      minY = y;
    }
    else if (y > maxY) {
      maxX = x;
      maxY = y;
    }
  }
}

This achieves a single scan and the whole process is more than twice as fast... but I'd rather express this logic declaritively in SQL if possible. I'm kind of surprised that sqlite doesn't recognize the opportunity to do this in one scan.

(2) By Keith Medcalf (kmedcalf) on 2020-05-12 22:19:53 in reply to 1.0 [link] [source]

That is probably the best solution. Well, after fixing all the errors at any rate. I do not believe there is currently a way to meet your constraints by any other simple means.

(3.1) By Ryan Smith (cuz) on 2020-05-13 01:59:50 edited from 3.0 in reply to 2 [link] [source]

Easy:

SELECT (x/10), min(y), max(y) FROM foo GROUP BY (x/10)

[Edit: corrected (x/10) column]

(5) By Keith Medcalf (kmedcalf) on 2020-05-13 01:27:00 in reply to 3.0 [link] [source]

This query would return only a single value of "x" from the group where that value of "x" corresponds to "some row" on which "one of the minima or maxima" were found (the last such row "visited" actually).

The original request was:

For each grouping of 10 x's (x // 10) find:

The maximum Y and (one of) the X on which that maximum occurs; and,
The minimum Y and (one of) the X on which that minimum occurs.

The only case in which the two X values are the same are the case in which the Y value of all rows comprising the group are the same.

A query such as:

select x/10, min(y), max(y) from foo group by x/10;

would indeed return the "group number" and the min/max for that group using a single table scan, however, that is not the data that is being sought.

(8) By Ryan Smith (cuz) on 2020-05-13 01:49:28 in reply to 5 [link] [source]

Hi Keith, sorry the reply was not meant on your post, but on the original post, and I agree the "x" is a mistake and should read "(x/10)" as was my intent.

More importantly, I know that is not the data sought in the original, my post was more intended tongue-in-cheek to point out the silliness of NOT getting both minima and maxima in one line, especially since the OP was willing to program an entire piece of extra code to speed things up slightly - you'd think reading one more field would be very painless next to that.

(7.1) By andyMcoy on 2020-05-13 13:58:31 edited from 7.0 in reply to 3.0 [link] [source]

Deleted

(4) By jake on 2020-05-13 00:17:41 in reply to 2 [link] [source]

Here are 2 options equivalent to your original query which will only perform a single scan of table foo:

Using a temp table

CREATE TEMPORARY TABLE x AS
SELECT x, min(y) min_y, max(y) max_y FROM foo GROUP BY (x/10);

SELECT x, min_y
  FROM x
 UNION
SELECT x, max_y
  FROM x;

In a single query

SELECT DISTINCT x,
       CASE a
         WHEN 1 THEN min_y
         WHEN 2 THEN max_y
       END y
  FROM (SELECT x, min(y) min_y, max(y) max_y FROM foo GROUP BY (x/10))
  JOIN (SELECT 1 a UNION ALL SELECT 2);

(6) By Keith Medcalf (kmedcalf) on 2020-05-13 01:32:29 in reply to 4 [link] [source]

Again, this only produces "an X" where the minima or maxima was found. Not the X where the max(y) was found and the (different) X where the min(y) is to be found. That is, the X value is wrong 50% of the time (assuming (x/10,y) is unique).

(9) By Ryan Smith (cuz) on 2020-05-13 02:22:36 in reply to 6 [link] [source]

True, but I doubt that's what the OP meant to get.

If it is, his original query relies on a special quirk of SQLite that is not guaranteed to work the same in future, and will not work in any other SQL engine.

If it IS the case, he is better off with:

WITH M(x, y, miny, maxy) AS (
    SELECT x, y, 
           min(y) OVER (PARTITION BY (X/10)), 
           max(y) OVER (PARTITION BY (X/10))
      FROM foo
)
SELECT x, y FROM M WHERE y = miny
UNION
SELECT x, y FROM M WHERE y = maxy
;

And there are no shortcuts to make that happen in a single loop I think. Perhaps persisting the subquery/CTE would be slightly faster?

(11) By Mark Lawrence (mark) on 2020-05-13 06:22:47 in reply to 9 [link] [source]

Your window function version has the following query plan:

QUERY PLAN
`--COMPOUND QUERY
   |--LEFT-MOST SUBQUERY
   |  |--CO-ROUTINE 1
   |  |  |--CO-ROUTINE 4
   |  |  |  |--SCAN TABLE foo
   |  |  |  `--USE TEMP B-TREE FOR ORDER BY
   |  |  `--SCAN SUBQUERY 4
   |  `--SCAN SUBQUERY 1
   `--UNION USING TEMP B-TREE
      |--CO-ROUTINE 1
      |  |--CO-ROUTINE 5
      |  |  |--SCAN TABLE foo
      |  |  `--USE TEMP B-TREE FOR ORDER BY
      |  `--SCAN SUBQUERY 5
      `--SCAN SUBQUERY 1

I don't know how much of a performance difference it makes in practice, but the following query plan has fewer co-routines and subqueries:

WITH M(x, y, rmin, rmax) AS (
    SELECT x, y,
           rank() OVER (PARTITION BY (X/10) ORDER BY y ASC),
           rank() OVER (PARTITION BY (X/10) ORDER BY y DESC)
      FROM foo
)
SELECT x,y
FROM M
WHERE rmin=1 OR rmax=1;

--  QUERY PLAN
--  |--CO-ROUTINE 1
--  |  |--CO-ROUTINE 3
--  |  |  |--CO-ROUTINE 4
--  |  |  |  |--SCAN TABLE foo
--  |  |  |  `--USE TEMP B-TREE FOR ORDER BY
--  |  |  |--SCAN SUBQUERY 4
--  |  |  `--USE TEMP B-TREE FOR ORDER BY
--  |  `--SCAN SUBQUERY 3
--  `--SCAN SUBQUERY 1

(12) By Mark Lawrence (mark) on 2020-05-13 06:30:21 in reply to 11 [link] [source]

I just realized my suggestion may or may not be sufficient for the original poster. Their original union query and the union query above always produce two rows for each group. My version only produces one row when y min and y max are the same.

(13) By andyMcoy on 2020-05-13 14:19:06 in reply to 9 [link] [source]

If it is, his original query relies on a special quirk of SQLite that is not guaranteed to work the same in future, and will not work in any other SQL engine.

It is indeed what I meant to get. I vaguely recall reading something about the non-standard behavior of sqlite's min/max function in the official documentation, but I can't find it now. Is my memory bad or is this documented somewhere? The "standard"-compliant query that you suggest is taking 3x as long as the original "non-standard" query (the non-standard query itself being 5x slower than my manual aggregation), so if that behavior isn't likely to change I might continue to use the "non-standard" behavior in the future.

(15.1) By Ryan Smith (cuz) on 2020-05-14 01:39:02 edited from 15.0 in reply to 13 [link] [source]

I vaguely recall reading something about the non-standard behavior of sqlite's min/max function in the official documentation, but I can't find it now

Well, it's specifically NOT documented because it isn't a feature, it's merely a happenstance that goes something like this:

[Edit]: This statement is wrong - there is clear documentation here as noted by Keith in reply. The rest of this message may still be useful to visitors as an explanation, kept in with some edits:

SQLite very kindly does not require non-aggregate values to be in aggregate functions in an aggregate query, yet will typically pick the values adjacent to those picked in the aggregate function to show in the non-aggregate results of the same row.

It is not in SQL standard/rules and no other DB engine does this[1].

I hope that statement is clear, but just in case, here is an example - suppose this schema:

CREATE TABLE t(x,y);
INSERT INTO t(x,y) VALUES
 (1, 15)
,(2, 93)
,(3, 12)
,(4, 81)
;

Now an aggregate query like this:

SELECT x, MAX(y) FROM t GROUP BY 'A';

It will produce in SQLite (and ONLY in SQLite):
  x  |  y
 --- | ---
  2  |  93

In MSSQL it won't run, complaining of not allowing non-aggregates in Aggregate queries (as it should) and in MySQL (sans "Strict" mode anyway) it would accept it, but could easily return:

  x  |  y
 --- | ---
  1  |  93
or maybe
  x  |  y
 --- | ---
  4  |  93

i.e. the x would be completely unrelated to the aggregated max(y).

... is taking 3x as long as the original "non-standard" query (the non-standard query itself being 5x slower than my manual aggregation)

Well yes, you are asking a general-case DB engine to do computations in a general construct under strict adherence to relational algebraic rules, to compete in a rigged race where the other contestant (your code) made very well by a good programmer with immaculate insight into the content of the data and no care about SQL standards adherence...

It is little wonder the other contestant won. If you can do it in code, and you really care about the speed, you should.

...so if that behavior isn't likely to change I might continue to use the "non-standard" behavior in the future.

Yes - it is documented and therefor will be supported so until 2050!

[1] I can't of course be 100% sure of this, I don't use every DB engine, nor know all the most recent updates to the ones I do use, so it may be someone somewhere also has this quirk that I'm unaware of - however, that is beside the point.

(16) By Keith Medcalf (kmedcalf) on 2020-05-13 23:06:29 in reply to 13 [link] [source]

This is specifically a specifically documented and supported feature of SQLite3.

It is documented here https://sqlite.org/lang_select.html#bareagg and reproduced for you below:

Side note: Bare columns in an aggregate queries. The usual case is that all column names in an aggregate query are either arguments to aggregate functions or else appear in the GROUP BY clause. A result column which contains a column name that is not within an aggregate function and that does not appear in the GROUP BY clause (if one exists) is called a "bare" column. Example:

    SELECT a, b, sum(c) FROM tab1 GROUP BY a;

In the query above, the "a" column is part of the GROUP BY clause and so each row of the output contains one of the distinct values for "a". The "c" column is contained within the sum() aggregate function and so that output column is the sum of all "c" values in rows that have the same value for "a". But what is the result of the bare column "b"? The answer is that the "b" result will be the value for "b" in one of the input rows that form the aggregate. The problem is that you usually do not know which input row is used to compute "b", and so in many cases the value for "b" is undefined.

Special processing occurs when the aggregate function is either min() or max(). Example:

    SELECT a, b, max(c) FROM tab1 GROUP BY a;

When the min() or max() aggregate functions are used in an aggregate query, all bare columns in the result set take values from the input row which also contains the minimum or maximum. So in the query above, the value of the "b" column in the output will be the value of the "b" column in the input row that has the largest "c" value. There is still an ambiguity if two or more of the input rows have the same minimum or maximum value or if the query contains more than one min() and/or max() aggregate function. Only the built-in min() and max() functions work this way.

(17) By Ryan Smith (cuz) on 2020-05-14 01:22:34 in reply to 16 [link] [source]

I stand corrected, unlike the other aggregates, min() and max() handling is fully deterministic in aggregate and is not just a temporary quirk as I thought, and if documented it should be supported as-is till 2050 - Thanks Keith for posting this.

There you have it Andy, you can disregard my previous warning completely - it is indeed safe to use the query as in the original post!

(19) By doug (doug9forester) on 2020-05-14 14:32:54 in reply to 17 [link] [source]

Ryan, just to be clear, SQLite's handling of min() and max() is not fully deterministic. In the case that Keith pointed out, where there is more than one row with a max or a min, you don't know which row will be picked to satisfy the x.

(21) By Keith Medcalf (kmedcalf) on 2020-05-14 18:23:24 in reply to 19 [link] [source]

It will be the last one visited that satisfied the aggregate. That makes it 100% deterministic (meaning that if you do not change ANYTHING you will always get the same answer).

It may, however, be somewhat difficult to explain how exactly at any particular point in time for any particular set of circumstances this may be determined, so the answer to "which one" if the choices are multiple is "undefined" in order to avoid having to explain the complicated (but entirely deterministic) result, and to permit that behaviour to change in the future (that is, the particular row returned from amongst multiple choices is not specified in order that as long as one of them is returned, then which of them is not guaranteed).

In other words, it is fully deterministic, but not defined.

(18) By andyMcoy on 2020-05-14 01:26:25 in reply to 16 [link] [source]

Thanks so much, you've all been helpful.

(10) By jake on 2020-05-13 03:33:39 in reply to 6 [link] [source]

Very correct Keith.

I was able to achieve a "single scan" of foo with a similar approach to Ryan using window functions and a temp table like this:

SELECT first_value(x) OVER win x_min_y,
       last_value(x) OVER win x_max_y,
       first_value(y) OVER win min_y,
       last_value(y) OVER win max_y
  FROM foo
WINDOW win AS (PARTITION BY x/10 ORDER BY y ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);

However the overhead of such a query is worse than the OPs original query, expecially if an appropriate index exists i.e.

CREATE INDEX idx ON foo(x/10, y);

SELECT x, min(y) FROM foo GROUP BY (x/10)
UNION
SELECT x, max(y) FROM foo GROUP BY (x/10);

(14) By doug (doug9forester) on 2020-05-13 15:40:48 in reply to 10 [source]

One problem with the query as

SELECT x, min(y) FROM foo GROUP BY (x/10) UNION SELECT x, max(y) FROM foo GROUP BY (x/10);

is that you can't tell whether the x you get is related to the max or the min of the group (without doing another query on the results). You really need 5 values all together in one row of output. This CTE query gives you all the information in a simple and easy to understand query:

sqlite> WITH bar1(x_miny, miny, xd10) AS ...> ( SELECT x, min(y), (x/10) FROM foo GROUP BY (x/10) ) ...> ,bar2(x_maxy, maxy, xd10) AS ...> ( SELECT x, max(y), (x/10) FROM foo GROUP BY (X/10) ) ...> SELECT bar1.x_miny, bar1.miny, bar2.x_maxy, bar2.maxy, bar1.xd10 ...> FROM bar1 JOIN bar2 ON bar1.xd10=bar2.xd10; x_miny|miny|x_maxy|maxy|xd10 1|23|2|45|0

I don't believe you can get all the required information without scanning twice because the max and mins will (almost always) be in different rows of foo. The use of a temporary table adds the equivalent of a scan to load the temp table, so there is no performance gain using it.

As an aside, if other SQL engines don't return correlated fields for aggregate queries (the x in the SELECT), how can you get an x?

(20) By ddevienne on 2020-05-14 18:10:21 in reply to 14 [link] [source]

you can't tell whether the x you get is related to the max or the min of the group

Of course you can. Just

SELECT 'min', x, min(y)
  FROM foo
 GROUP BY (x/10)
UNION ALL
SELECT 'max', x, max(y)
  FROM foo
 GROUP BY (x/10);

I used 'min' and 'max' for illustrative purpose. Could use 0 and 1 too.

Note the UNION ALL, to avoid an unnecessary sort+dedup, which I suspect
if not what you wanted, since then you could have fewer than 2x group rows
with just UNION.