SQLite Forum

Option to make SQLite stricter with non-aggregate columns
Login

Option to make SQLite stricter with non-aggregate columns

(1.1) By Yann Salmon (yannsalmon) on 2023-06-08 11:21:41 edited from 1.0 [link] [source]

As per the documentation, SQLite is peculiar in that it allows aggregation queries with non-aggregate columns that are not functional dependencies of the grouped columns.

This gives a nice shorthand for getting the argmin or argmax of a table (but not both at the same time !), but understandably gives nonsensical results with other aggregation functions, and is therefore something one want to avoid with beginners.

Which is kind of sad because the serverless nature of SQLite makes it so usable in a teaching environment where you want to keep things simple for students and do not always have the technical possibility of setting up a server.

It is of course not possible to break backward compatibility, but would it be doable to have an option to reject non-standard queries of that kind, in the same fashion as it has recently made possible for double-quoted literals ?

(2) By anonymous on 2023-06-08 14:31:58 in reply to 1.1 [link] [source]

Does it give nonsensical results? Or just non-deterministic, incomplete ones?

Anyway, you seem to be suggesting a mode of use for sqlite such that it can be used to validate strict SQL input for teaching purposes.

Why not just use an SQL validator for this - there are many free on line ones available.

The SQL is arguably not non-standard, it is entirely acceptable SQL for sqlite's dialect.

(3) By Keith Medcalf (kmedcalf) on 2023-06-08 15:30:59 in reply to 1.1 [link] [source]

but understandably gives nonsensical results with other aggregation functions

This is manifestly untrue. The returned "bare column" comes from one of the rows comprising the group. If you ask for multiple "bare columns" then they are all taken from the same row which is a member of the group. min/max adds the additional constraint that the row chosen is one of the ones that contain the min or max.

Like almost everything else in life, you pretty much get what you ask for. If you asked for nonsense you will liklely get exactly what you asked for.

If you do not find utility in asking for a "bare column" with an aggregate other than min or max, then do not ask for the return of a "bare column" in such circumstances. Just because you do not find utility does not mean that everyone finds no utility.

Once upon a time the ability to use "bare columns" everywhere was standard. Then a bunch of cry-babies incapable of controlling themselves interfered and a perfectly good feature was destroyed. Probably the same bunch of whiney little babies that did not like the "outer conditionals" and invented that horrid wordy JOIN ... ON syntax garbage.

Reminds me of the patient that goes to the Doctor and says "Doctor, Doctor, It hurts when I stick my thumb in my eye." to which the Doctor responded "Well, do not do that then".

(4) By L Carl (lcarlp) on 2023-06-08 16:40:17 in reply to 3 [link] [source]

I feel this is an overly harsh position. I love SQLite, but I do not always type perfect SQL. I was surprised and dismayed when I first started using it and one of my aggregate queries returned incorrect results. It took me a while to figure out that my query was incorrect because it was a query that would have evoked an error in most (or all?) other SQL engines.

I see the value of this feature and I do not advocate changing it, but it does mean that some incorrect queries return surprising results. (By “incorrect”, I mean I erroneously left something out of a GROUP BY, for example.) In general, I consider it desirable for systems to detect incorrect syntax. Obviously, there are many cases where my incorrect syntax is correct syntax with a different meaning than what I was aiming for, but I consider it unfortunate when that happens for a lot of common errors.

And, because this particular feature is very non-standard, I can see value in having a way to flag it. I’d probably use that option myself if it existed since I seem to make this kind of mistake fairly often.

(6) By Keith Medcalf (kmedcalf) on 2023-06-08 17:03:21 in reply to 4 [link] [source]

I was surprised and dismayed when I first started using it and one of my aggregate queries returned incorrect results.

This is an incorrect position. It returned absolutely correct results that were exactly and precisely what you requested. That what you requested was nonsensical and did not address the reason for the issuance of the query is your problem.

A poor workman blaming its tools is par for the course.

And, because this particular feature is very non-standard, I can see value in having a way to flag it

Why? Seems a very strange way to solve laziness.

I’d probably use that option myself if it existed since I seem to make this kind of mistake fairly often.

So if it hurts when you do that, do not do that then.

(8) By ralf (ralfbertling) on 2023-06-09 08:01:27 in reply to 6 [link] [source]

Hi,

this in part is a quite philosophical discussion.

IMHO it boils down to the implicit question of Postel's Law (see https://www.sqlite.org/quirks.html#dubious_sql_is_allowed_without_any_error_or_warning ) vs. the Fail-Fast-Principle (see. https://en.wikipedia.org/wiki/Fail-fast ).

As pointed out in several bullets on Sqlite's quirks page (see above), many descisions would in retrospect perhaps been done differently.

W.r.t. a stricter type-system, it is now possible to declare tables as STRICT. The way I understand the original post, a similar solution is proposed to allow a fail early approach when using aggregate functions.

SQLite is incredible powerful and useful to lay out complex data models without the hassle to set up database servers. In case the final system is going to be multiuser and/or client/server SQLite is still very useful for prototyping, but it would be extremely helpful to have more of those strict options.

Btw. the STRICT tables would be much more useful if there was any option to handle DATE/DATETIME values and perhaps boolean values.

Regards, ralf

(11) By L Carl (lcarlp) on 2023-06-09 17:02:16 in reply to 6 [link] [source]

I disagree with your characterization of my words: My query returned “incorrect” results because it was an incorrect query. Of course, an incorrect query might return the results I expected by accident, but usually, incorrect queries will return incorrect results. I’ve been using SQL from various vendors since 1987, so I’m fairly comfortable and familiar with it. When I first saw the incorrect result from my incorrect query, I looked at my query more carefully and thought, WTF? This query is incorrect SQL! I then sought the documentation and discovered this feature.

Now, I’m going to move to thinner ice: Every SQL implementation I know of has features unique to that implementation. But, I think it is uncommon to have syntax that is an easy mistake in most implementations to be meaningful in another. Specifically, it is an easy and common error, during development, to leave out a column from a GROUP BY clause. It’s a time-saver during development when the system you are using catches your errors and reports them. My time has value, especially to me since I’m 72 and I’m running low.

Perhaps you have never made this particular mistake, but I’ve done it many times among the thousands of queries I’ve written over the years. Now, that I know SQLite better, I know to take extra care with GROUP BY and I know what to look at when I have incorrect results. But, I still see the value of the original suggestion.

Otoh, I’ve also used this feature on purpose. It saves a lot of tedious code that would be needed in other implementations. With 20-20 hindsight, I might have suggested a slightly different syntax, but I do not advocate changing it now.

(12) By Keith Medcalf (kmedcalf) on 2023-06-09 17:17:13 in reply to 11 [link] [source]

My query returned “incorrect” results because it was an incorrect query

The query returned correct results for the query that was given. If the query returned incorrect results, that would be a bug.

What you mean to say is that you did not pose a query that returned the result you desired. Nevertheless, the result was correct for the query that was posed.

For example, if you intend the variable a to hold the cosine of b, then the command a = tan(b) does not achieve that aim. However, the value of a is absolutely correct for the operation requested. That the request was in error is a user error,

(23) By L Carl (lcarlp) on 2023-06-10 17:36:35 in reply to 12 [link] [source]

I’m sure I’m wasting my time, but if I was teaching someone SQL and one of their queries returned results other than what was specified, I would probably tell them the result was “incorrect” and then proceed to help them determine which of their queries was incorrect.

The vast majority of the time, when my code gives incorrect results it is because my code is incorrect. But once in a while, it is because someone else’s code is incorrect. It’s just a question of whether it’s my bug or someone else’s bug.

We’re debating English, here, which is off-topic, so I’m not going to continue. Perhaps someone should ask ChatGPT for an opinion.

(9) By punkish on 2023-06-09 09:17:01 in reply to 3 [link] [source]

Probably the same bunch of whiney little babies that did not like the "outer conditionals" and invented that horrid wordy JOIN ... ON syntax garbage.

Is there a particular technical reason behind that assertion or is it just that it offends your aesthetic sensibilities? Curious minds want to know

(5) By Deon Brewis (deonb) on 2023-06-08 16:54:49 in reply to 1.1 [link] [source]

The same mechanism also covers DISTINCT ON though, which is an incredibly useful and powerful function.

And DISTINCT ON should probably be taught, whether it be with the DISTINCT ON keywords in other implementations or GROUP BY + non-aggregate in SQLite.

So you're going to throw the baby out with the bathwater if you cut off the min/max bare aggregate (which by itself is also insanely useful albeit non-standard).

(13) By L Carl (lcarlp) on 2023-06-09 17:19:34 in reply to 5 [link] [source]

So, I am gathering that DISTINCT ON is a Postgres feature, not implemented in SQLite. Right? But it does something similar to SQLite where bare columns in a GROUP BY query that are not included in the GROUP BY clause? Have I got that right?

(14.1) By Spindrift (spindrift) on 2023-06-09 17:29:30 edited from 14.0 in reply to 13 [link] [source]

Yes, you are correct (for a given but reasonable value of "similar"). This comparable functionality is clearly (and explicitly) documented in sqlite docs.

(15) By David Raymond (dvdraymond) on 2023-06-09 18:20:19 in reply to 13 [link] [source]

Postgres DINSTINCT ON

SELECT DISTINCT basically says use all the columns in the following query to determine distinctiveness and give me 1 row from the query per distinct set of values. (Since you're using all fields for distinctiveness then they're all gonna look the same whichever one you pick)

SELECT DISTINCT ON (things here) says use (things here) to determine distinctiveness and give me 1 row from the query per distinct set of (things here).

DISTINCT ON doesn't do any grouping of its own, if you want aggregates for example you'd have to add in GROUP BY as well.

Take for example...

SELECT DISTINCT ON (manufacturer) * FROM table_of_cars;

It goes through the table and says...

"You're a Honda? I don't have one of those yet, this row gets returned"

"You're a Ford? I don't have one of those yet, this row gets returned"

"You're another Honda? I've already got one of those, you get ignored"

(16) By Spindrift (spindrift) on 2023-06-09 18:57:32 in reply to 15 [link] [source]

Well.... Group by xxx can substitute in sqlite for distinct on xxx in postgres, if you so desire.

Distinct on cannot substitute for any but a tiny and very specific subset of group by uses though. But this isn't what was asked.

(7) By JayKreibich (jkreibich) on 2023-06-08 17:53:21 in reply to 1.1 [link] [source]

I love this feature of SQLite and feel all RDBMSs should work this way. It is hugely handy, and makes many, many queries simpler and more direct. Yes, if used incorrectly it can result in unexpected or non-deterministic results, but do can dozens of other things in SQL. In that sense, I'd argue the behavior can help teach people when they're making assumptions about their queries.

(10) By J-L Hainaut (JLHainaut) on 2023-06-09 09:53:02 in reply to 1.1 [link] [source]

As a teacher myself, I'm particularly sensitive to the OP's arguments. Two comments come to mind:

It's reasonable to assume that the OP has a fairly good knowledge of SQL and its dialects. His target audience is therefore students, which makes his request valid.

This discussion reminds me of comments in the manuals of early versions of MySQL (MySQL Reference Manual, Version 3.23.2-alpha August 8, 1999): "There are so many problems with foreign keys that we don't know where to start..." and "The only interesting thing about foreign keys is that they allow ODBC and other client programs to see how a table is connected and use it to show connection diagrams and help build applications". Fortunately, things have changed since then!

Allowing bare columns in the selection list of queries is a nice feature for seasoned programmers, but an optional stricter mode would also be desirable for novices. It's just a matter of target audience!

(27) By Keith Medcalf (kmedcalf) on 2023-06-12 15:33:07 in reply to 10 [link] [source]

Having a "Stupid User" vs "Intelligent User" switch has been suggested for a lot of software (and Operating Systems) for a very long time. In every case it would seem that it has been rejected.

Generally, the "Stupid User"s are unwilling to set the "Stupid User" mode because they believe they fall into the "Intelligent User" category, then they whine.

The worst offender is Microsoft. All Microsoft software is preconfigured for the most stupid user imaginable, and as a result presents an impossibly large and uncontrollable attach surface. It takes weeks of effort to configure any Microsoft product for "non-stupid" mode.

(17) By Richard Hipp (drh) on 2023-06-09 19:45:58 in reply to 1.1 [link] [source]

There is a prototype implementation of this on https://sqlite.org/fiddle. You have to enter the ".dbconfig strict_aggregate on" command in order to enable PostgreSQL-style strict enforcement the no-bare-aggregate-columns limitation/rule.

Behavior is the same as in PG-14, as far as I can tell, except that the text of the error messages is different (of course).

This prototype has not landed on trunk, and may or may not land, depending on community feedback.

(18) By Gerry Snyder (GSnyder) on 2023-06-09 20:36:47 in reply to 17 [link] [source]

The only downside to doing so that I can see is the testing and maintaining that you (drh) will have to do, which I cannot judge.

But I think it could be useful to some.

Gerry Snyder

(19) By mgr (mgrmgr) on 2023-06-10 08:13:08 in reply to 17 [link] [source]

I do like such an option. In an interactive/analytical scenario, helps me quicker realizing missing columns in my 'group by' ...

Tested some queries on the fiddle, looks fine so far.

(20) By Keith Medcalf (kmedcalf) on 2023-06-10 16:50:57 in reply to 17 [link] [source]

I take it this is a new configuration parameter which presumably applies to the SQLite3 engine and not the Debugging CLI -- the Debugging CLI can merely "access" the feature of the sqlite3 engine?

Or is this a continuation of the split of SQLite3 into two distinct non-compatible products: The CLI and the database library, as separate entities, with nothing in common save a fragment of the name?

(21) By Richard Hipp (drh) on 2023-06-10 16:56:58 in reply to 20 [link] [source]

It is a new sqlite3_db_config() option. So it is baked into the core.

(22) By Richard Hipp (drh) on 2023-06-10 17:29:41 in reply to 21 [link] [source]

Documentation now available on Draft

(24) By ddevienne on 2023-06-12 07:39:14 in reply to 17 [link] [source]

Reading this thread, it sounds like advanced users like the ability to select non-aggregate columns, while others (less advanced users, teachers, etc...) consider it a pitfall many fall into. It's also a compatibility issues (with not just PosgreSQL). I tend to be in the latter camp, but I also can't deny the higher expertise of those in the first camp.

My question then is: Is there a way to achieve the current (default) behavior, while that new config is active, by explicitly opting-in to that behavior, via some special pseudo (?) aggregate function?

No sure that makes sense... Because I'm not sure to understand the use cases Keith and Jay like the current behavior for. (from lack of study of this thread probably).

(25) By David Raymond (dvdraymond) on 2023-06-12 12:06:56 in reply to 17 [link] [source]

Behavior is the same as in PG-14, as far as I can tell

Postgres does allow bare columns in one case. When the full primary key of a table is included as a subset of the GROUP BY columns, then you can list any other column from that table as a bare column, because it knows for any output row it's processing will only ever be dealing with one single input row from that table.

As far as I can tell it doesn't work with unique constraints, just the primary key.

From the 9.1 release notes:

Allow non-GROUP BY columns in the query target list when the primary key is specified in the GROUP BY clause (Peter Eisentraut)

The SQL standard allows this behavior, and because of the primary key, the result is unambiguous.

Oversimplified example:

psql (15.3)
Type "help" for help.

testing=> create temp table foo (i integer primary key, t text);
CREATE TABLE
Time: 19.635 ms
testing=> insert into foo values (1, 'one'), (2, 'two');
INSERT 0 2
Time: 11.100 ms
testing=> select i, t from foo group by i;
 i |  t
---+-----
 2 | two
 1 | one
(2 rows)

Time: 0.888 ms
testing=> select t, i from foo group by t;
ERROR:  column "foo.i" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: select t, i from foo group by t;
                  ^
Time: 8.164 ms
testing=> create unique index t_idx on foo (t);
CREATE INDEX
Time: 6.154 ms
testing=> select t, i from foo group by t;
ERROR:  column "foo.i" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: select t, i from foo group by t;
                  ^
Time: 8.733 ms
testing=>

(26) By Richard Hipp (drh) on 2023-06-12 15:16:39 in reply to 25 [source]

To my mind, this is a compelling argument that I should not land the experimental SQLITE_DBCONFIG_STRICT_AGGREGATE change.

(28.2) By Keith Medcalf (kmedcalf) on 2023-06-12 15:54:28 edited from 28.1 in reply to 26 [link] [source]

That would be a bad conclusion. If you wish it to work like Postgress then you would need to change the implementation such that if the flag is set, then the result must be unambiguous (that is, the group by + the aggreggate should select only one candidate row -- if there is more than one candidate, then an "ambiguous" error should be returned.

That is, the flag should be called SQLITE_DBCONFIG_NO_AMBIGUOUS_AGGREGATE (or better, SQLITE_DBCONFIG_UNAMBIGUOUS_BARE_COLUMNS) so that it works in a fashion plainly compatible with the label on the tin.

or even SQLITE_DBCONFIG_NO_ARBITRARY_BARE_COLUMNS performing the same way. Then any "bare column" must be a singleton. If it is not a singleton (ie, there is more than 1 value to chose from) then return an error rather than an arbitrary value).

(29) By Richard Hipp (drh) on 2023-06-12 16:30:08 in reply to 28.0 [link] [source]

But that is not how PG works. PG only allows bare column values in an aggregate if the PRIMARY KEY is used into the GROUP BY clause. I tried (using PG-14) and having a NOT NULL UNIQUE constraint in the GROUP BY is not sufficient to allow bare columns, even though it would be sufficient to guarantee a unique, well-defined answer. I understand why this limitation exists - deciding whether or not to allow bare columns means that the system has to generate a proof that the result is unique and well-defined, but that proof can sometimes be tricky to generate. So we have:

  • PostgreSQL: Allow bare columns in an aggregate if and only if the appropriate theorem prover inside the query planner is able to establish that there is only one row being aggregated for each row of output.

  • SQLite: Always allow bare columns in an aggregate and leave it up to the programmer to decide whether or not the values that come out are meaningful.

So given some query X, in PG you might have some doubt about whether or not the query is accepted by the system as whether or not the query is accepted depends on how sophisticated their theorem prover is. Perhaps some enhancement in the next release will make the theorem prover smarter and a query that is rejected today will be accepted in the new release. And just looking a a query, it become difficult to know if the query will be accepted without deep knowledge of how the PG theorem prover works.

In SQLite, on the other hand, the query is always accepted. Whether or not the results are what the programmer desired might be in question. But the results are, at least, well-defined. And a programmer and always look at the query and answer "yes" or "no" about whether or not the system will reject the query.

To my mind, the SQLite approach seems simpler to document and to understand. After learning that it is often unclear whether or not PG will allow bare columns in the query, trying to emulate PG's behavior seems like an unnecessary complication.

(30) By Richard Hipp (drh) on 2023-06-12 16:35:59 in reply to 28.2 [link] [source]

Let me try again to explain my idea:

  • PostgreSQL: Allow bare columns in an aggregate query only if either

    1. The column is part of an expression that is also a term in the GROUP BY clause, or
    2. The query is against a single table (not a join) and the complete PRIMARY KEY of that table (not just a NOT NULL UNIQUE constraint but the PRIMARY KEY) is part of the GROUP BY clause.

  • SQLite: Always allow bare columns in an aggregate query.

Which rule is easier to document? Which rule is easier to explain to students? I think we should stay with the rule that is easier to document and explain.

(31) By Richard Hipp (drh) on 2023-06-12 16:48:46 in reply to 30 [link] [source]

Further to this point, it isn't just the programmer who has to decide if a particular bare column is legal or not - the query planner has to make that decision too. The PG-style rule is more complicated, involves more code and more CPU cycles. The current prototype implementation only considers the (a) case, and it is (slightly) larger and slower than trunk. This difference will increase if I add the (b) case.

The SQLite-style decision function is much simpler: "return 1;" Smaller binary, fewer CPU cycles, less to go wrong.

I'm rapidly talking myself into closing the prototype branch without landing it on trunk...

(32) By Keith Medcalf (kmedcalf) on 2023-06-12 16:58:19 in reply to 30 [link] [source]

Clearly the current SQLite approach best meets the KISS principle.

My suggestion would (by necessity) have to generate the error at runtime and thus would merely emit "different" VDBE code if the option were activated. Detection at compilation (prepare) time would require a prover fraught with peril and would best be avoided.

Instead of emiting bytecode that merely "selects" the value, it would do the same thing, but with the additional check that if there were "more than one row" available from which the selection could be made, to chuck a runtime error. (That is, instead of jumping out of the loop when the solution is hit, you would need to maintain a counter and do a next in the loop. If the loop counter is greater than 1 then an error is chucked rather than returning the candidate result).

This would mean that "by happenstance" any arbitrary query containing bare columns with an aggregate may "work" -- it would only chuck a runtime error if it was discovered at runtime that the query was ambiguous.

This would not prevent the composition of an erroneous query (this is still the bailiwick of the programmer) but would effectively prevent "arbitrary" results.

(34) By Richard Hipp (drh) on 2023-06-12 17:30:34 in reply to 32 [link] [source]

check that if there were "more than one row" available from which the selection could be made,

That could be done. One minor problem is that by the time that the error is detected, the parse tree has already been discarded, which means it more challenging to generate a helpful error message.

You can already do this (sort of):

The PRAGMA reverse_unordered_selects=ON; statement, or its equivalent SQLITE_DBCONFIG_REVERSE_SCANORDER option to sqlite3_db_config() can often be used to determine if a query gives an ambiguous answer. Just run the query twice, once with REVERSE_SCANORDER enabled and one with it disabled, and see if you get the same answer. If the query is ambiguous, you probably will not.

Perhaps there should be a new option "RANDOM_SCANORDER" which is enabled by default and which causes the query planner to select a scan order at random (at prepare time). Then if an ambiguous query is provided, the results will likely be different from one run to the next. This would cause tests to fail, leading to an investigation, and perhaps a fix. Or, it might cause developers to just curse the database, I dunno.

Note:

PostgreSQL is not immune to ambiguous queries. You can add "LIMIT 1" to the end of a PG query and you will get back a row, but there is no guarantee about which row you will get, at least not as far as I am aware.

(33) By David Raymond (dvdraymond) on 2023-06-12 17:24:01 in reply to 30 [link] [source]

The query is against a single table (not a join)

Just for reference, Postgres doesn't limit it to a single table... which I guess would add more checking.

create temp table countries (country_code text not null primary key, name text, population int);
insert into countries values ('USA', 'United States', 331900000), ('CAN', 'Canada', 38250000);
create temp table states (country_code text not null, state_code text not null, name text, population int, primary key (country_code, state_code));
insert into states values ('USA', 'VT', 'Vermont', 645570), ('USA', 'NH', 'New Hampshire', 1389000), ('CAN', 'ON', 'Ontario', 14570000), ('CAN', 'AB', 'Alberta', 4371000);
create temp table cities (country_code text not null, state_code text not null, name text);
insert into cities values ('USA', 'VT', 'Montpelier'), ('USA', 'VT', 'Burlington'), ('USA', 'NH', 'Concord'), ('USA', 'NH', 'Manchester'), ('CAN', 'ON', 'Toronto'), ('CAN', 'ON', 'Ottawa'), ('CAN', 'AB', 'Calgary'), ('CAN', 'AB', 'Edmonton');

select
--primary key of countries is in group by, but these columns aren't
countries.name as country_name,
countries.population as country_population,
--primary key of states is in group by, but these columns aren't
states.name as state_name,
states.population as state_population,
count(*) as city_count
from
cities
inner join states using (country_code, state_code)
inner join countries using (country_code)
group by
countries.country_code, --covers primary key of countries
states.country_code, states.state_code --redundant for country_code, but needed to cover primary key of states
;


 country_name  | country_population |  state_name   | state_population | city_count
---------------+--------------------+---------------+------------------+------------
 Canada        |         38,250,000 | Alberta       |        4,371,000 |          2
 Canada        |         38,250,000 | Ontario       |       14,570,000 |          2
 United States |        331,900,000 | New Hampshire |        1,389,000 |          2
 United States |        331,900,000 | Vermont       |          645,570 |          2
(4 rows)

(35) By Richard Hipp (drh) on 2023-06-12 17:35:59 in reply to 33 [link] [source]

Thanks for the test case.

So now we (kinda) know the rules for an INNER JOIN. But what happens if we have an aggregate over an OUTER JOIN? I'm sure there must be some specific rules about that too, but I don't know what those rules are.

In a way, you are helping to prove my point. Do we really want the answer to the question "Is this bare column permitted" to be so complex? Isn't it better that the answer always be "Yes"?

(37) By Colin Caine (cmcaine) on 2023-09-10 15:31:48 in reply to 30 [link] [source]

I agree the PostgreSQL rule is too complicated, but I don't think it makes sense to choose not to have a stricter mode for ungrouped columns on this basis because we can choose our own, simpler rule.

I think a stricter rule should:

  • be easy to explain and implement
  • help users avoid accidental indeterminism
  • help novices build a useful mental model of how aggregation works
  • and, less important to me, code following the stricter rule should be more portable to other SQL dialects

The simple part of the PostgreSQL rule (which may be the rule you had implemented in the previous branch) meets my desired outcomes:

  • When GROUP BY is present, or any aggregate functions are present, it is an error for any SELECT list expression to refer to ungrouped columns except within aggregate functions.

This rule is stricter than the SQL standard or PostgreSQL. I think this is fine and that no-one will miss the mostly useless (mis)features about "functionally dependent" columns. Users can explicitly group by dependent columns if they want to select them.

The rule also disables the sqlite min/max aggregate extension. I don't know if this is okay. I found it quite complicated to rewrite queries using the extension into a form that followed the strict rule. On the other hand, users could just turn the strict feature off while they write queries like this? I rarely write or teach queries like this, so it wouldn't affect me.

The other alternative is to adjust the rule so that ungrouped columns may be selected if the select-list calls min or max on a column. We'd lose some portability and the rule would get a bit more complicated to explain. Personally, I would not want the extension in strict mode.

My (perhaps bad) attempt at rewriting a query that uses the sqlite min/max feature:

-- using min/max aggregate feature
SELECT a, b, max(c), sum(d), e FROM tab1 GROUP BY a;

-- without
WITH
agg1 as (
    SELECT a, max(c) as maxc, sum(d) as sumd
    FROM tab1
    GROUP BY a
),
-- You could skip the agg2 CTE if you were sure that
-- there's only one row for each {a, max(c)} pair.
--
-- If you contrive to have no `unique not null` column
-- then you'd need to do something complicated with
-- row_number(), but don't do that.
agg2 as (
    SELECT max(t.rowid), g.a, g.maxc, g.sumd
    FROM tab1 t
    JOIN agg1 g
    ON g.a = t.a AND g.maxc = t.c
    GROUP BY g.a, g.maxc
)
SELECT g.a, t.b, g.maxc, g.sumd, t.e
FROM agg2 g
JOIN tab1 t
  USING(rowid)

Values for the ungrouped columns b and e are always selected from the same row for each {a, max(c)} tuple in the rewritten version, which is not guaranteed when using the min/max aggregates feature.

Definitely much more complicated, though.

(36) By Richard Hipp (drh) on 2023-06-13 10:50:19 in reply to 17 [link] [source]

The branch that contains the prototype implementation has been closed. Fiddle has been recompiled to trunk and no longer supports the ".dbconfig strict_aggregate" setting.