Unexplained difference in behavior between 3.19.3 and 3.31.1

(1) By anonymous on 2020-04-11 04:58:49 [link] [source]

I had a bug in a python program I wrote that came down to a difference in sqlite versions used in Python2 and Python3. Unfortunately, I don't know what the SQL standard specifies as the right answer (assumption: it does specify the required behavior). I spent some time looking through the timeline to try and understand if this behavior was changed explicitly but I don't have enough context to answer my own question.

Was the change in query result below intentional? If so, can someone point me at the commit?  TIA.

NOTE: I was hoping to find the set of "last ident" rows in the table with a single query and no post-processing. I now have a single query that grabs everything and filters out the last row programmatically.

I've attached a minimal example and its output below:

Script used
===========
import sqlite3,sys
conn = sqlite3.connect(sys.argv[1])
print(sqlite3.sqlite_version)
setup = '''
CREATE TABLE Weird (
   id INTEGER PRIMARY KEY,
   ident TEXT NOT NULL,
   name TEXT NOT NULL, 
   insert_time DATETIME DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO Weird(ident,name) VALUES("World","Huey");
INSERT INTO Weird(ident,name) VALUES("World","Dewey");
INSERT INTO Weird(ident,name) VALUES("World","Louie");
INSERT INTO Weird(ident,name) VALUES("Hello","Mickey");
INSERT INTO Weird(ident,name) VALUES("Hello","Goofy");
INSERT INTO Weird(ident,name) VALUES("Hello","Minnie");
'''

query = 'SELECT ident,name,insert_time FROM Weird GROUP BY ident;'

conn.executescript(setup)
conn.commit()

print(conn.execute(query).fetchall())
========================================
Running with Python3 (first row in table)
========================================
3.31.1
[('Hello', 'Mickey', '2020-04-11 04:53:56'), ('World', 'Huey', '2020-04-11 04:53:56')]
========================================
Running with Python2 (last row in table)
========================================
3.19.3
[(u'Hello', u'Minnie', u'2020-04-11 04:53:57'), (u'World', u'Louie', u'2020-04-11 04:53:57')]

(2.2) By Keith Medcalf (kmedcalf) on 2020-04-11 08:18:45 edited from 2.1 in reply to 1 [link] [source]

The query you are using is, in both instances, returning valid results. In an aggregate query (GROUP BY) a named column or expression that is NOT in the GROUP BY list and is not an aggregate is the value computed from "one of the rows" in the group (though all such values will be computed from the same row -- this is a guaranteed implementation detail). Which particular row is undefined.

SQLite3 lets you do this an an implemetation detail, most other RDBMS will give you an error that you have selected a value that is NEITHER in the GROUP BY list nor an AGGREGATE and refuse to process your query.

If you wish to have a specific result then you must specify it. For example, the query

select ident, name, max(insert_time) from Weird group by ident;

uses another implementation detail of SQLite3 where name will be the value of that column on one of the rows of the group that contains the max(insert_time) for the group.

Such a query also violates the general prohibition that the results of a select with a group by must be either aggregates or in the group by list, but in this case SQLite3 guarantees (as an implementation detail) that the value returned for non-grouped by and non-aggregates will come from one of the rows containing the aggregate result (max or min) (and that all such values will come from the same row). If more than one row meets the constraint then whether it is the first or the last row (or some other row) is undefined.

So if you were to define that (ident, insert_time) was unique, then the above query would always return the last (greatest insert_time) values for each group, guaranteed.

(3) By anonymous on 2020-04-11 08:11:23 in reply to 1 [source]

Unfortunately, I don't know what the SQL standard specifies as the right answer (assumption: it does specify the required behavior).

It does not, and many SQL implementations will not let you select a column that is not grouped by or passed through an aggregate function.

SQLite has a quirk, though, that lets you ask for non-aggregate and non-grouped columns in the result set. Usually you get an unspecified ("as if randomly chosen") row from the group, but there are special cases for min() and max() aggregate functions returning the same row that corresponded to the value chosen by min() or max(). This means that replacing insert_time with max(insert_time) in the query should let you get that "last name for each ident" you are looking for, but such query will not be portable outside SQLite.

One way to achieve this result in a portable manner would be to use subqueries:

SELECT ident,name,insert_time
FROM Weird a
WHERE insert_time = (
    SELECT MAX(insert_time)
    FROM Weird b
    WHERE a.ident = b.ident
)

(though the query is untested and I am not sure about its performance)

(4) By anonymous on 2020-04-11 14:38:39 in reply to 3 [link] [source]

Thank you both for your notes. i appreciate the time you took to provide guidance.