UPDATE ... RETURNING bug: does not present consistent snapshot of DB

(1) By Daniel Chambers (daniel-chambers) on 2023-02-21 04:06:17 [link] [source]

I think I've discovered a bug in UPDATE ... RETURNING queries, where the RETURNING does not appear to return data based on the state of the DB after all rows have been updated; instead it returns data that changes as the UPDATE statement executes.

I have reproduced this bug on SQLite 3.40.1.

To demonstrate the bug, consider the following SQL:

CREATE TABLE Parent (
  ParentId INT NOT NULL PRIMARY KEY
);

CREATE TABLE Child (
  ChildId INT NOT NULL PRIMARY KEY,
  ParentId INT NOT NULL
);

INSERT INTO Parent (ParentId) VALUES (1);
INSERT INTO Parent (ParentId) VALUES (2);

-- Add two children to parent 1
INSERT INTO Child (ChildId, ParentId) VALUES (1, 1);
INSERT INTO Child (ChildId, ParentId) VALUES (2, 1);
-- Add three children to parent 2
INSERT INTO Child (ChildId, ParentId) VALUES (3, 2);
INSERT INTO Child (ChildId, ParentId) VALUES (4, 2);
INSERT INTO Child (ChildId, ParentId) VALUES (5, 2);

-- Move all children from parent 2 to parent 1
-- and return all moved children, joined to their new parent
-- and with all the new parent's children
UPDATE Child
SET ParentId = 1
WHERE ParentId = 2
RETURNING
  JSON_OBJECT(
    'ChildId', [Child].[ChildId],
    'Parent', (
      SELECT
        JSON_OBJECT(
          'ParentId', [Parent].[ParentId],
          'Children', (
            SELECT JSON_GROUP_ARRAY(ChildObj)
            FROM (
              SELECT
                JSON_OBJECT(
                  'ChildId', child2.ChildId
                ) AS ChildObj
              FROM Child child2
              WHERE child2.ParentId = parent.ParentId
            )
          )
        )
      FROM Parent parent
      WHERE Parent.ParentId = [Child].[ParentId]
    )
  ) AS data;

If you run this query against an empty SQLite database, you will get the following output:

{"ChildId":3,"Parent":{"ParentId":1,"Children":["{\"ChildId\":1}","{\"ChildId\":2}","{\"ChildId\":3}"]}}
{"ChildId":4,"Parent":{"ParentId":1,"Children":["{\"ChildId\":1}","{\"ChildId\":2}","{\"ChildId\":3}","{\"ChildId\":4}"]}}
{"ChildId":5,"Parent":{"ParentId":1,"Children":["{\"ChildId\":1}","{\"ChildId\":2}","{\"ChildId\":3}","{\"ChildId\":4}","{\"ChildId\":5}"]}}

I think the correct output should be this:

{"ChildId":3,"Parent":{"ParentId":1,"Children":["{\"ChildId\":1}","{\"ChildId\":2}","{\"ChildId\":3}","{\"ChildId\":4}","{\"ChildId\":5}"]}}
{"ChildId":4,"Parent":{"ParentId":1,"Children":["{\"ChildId\":1}","{\"ChildId\":2}","{\"ChildId\":3}","{\"ChildId\":4}","{\"ChildId\":5}"]}}
{"ChildId":5,"Parent":{"ParentId":1,"Children":["{\"ChildId\":1}","{\"ChildId\":2}","{\"ChildId\":3}","{\"ChildId\":4}","{\"ChildId\":5}"]}}

What appears to be happening is that SQLite is evaluating the RETURNING for each row, after each row has been updated and then concatenating the results together. This is presenting a shifting view of the data in the database, as can be seen by the output where you can see the children moving onto the new parent incrementally.

I think this behaviour is wrong. RETURNING should be evaluated for all updated rows against a consistent snapshot of the DB, after all updates have been made to the rows. We should not be able to observe the UPDATE being processed incrementally and should only see results computed after the UPDATE has been applied to all necessary rows.

(2.2) By Keith Medcalf (kmedcalf) on 2023-02-21 06:23:39 edited from 2.1 in reply to 1 [link] [source]

What appears to be happening is that SQLite is evaluating the RETURNING for each row, after each row has been updated and then concatenating the results together. This is presenting a shifting view of the data in the database, as can be seen by the output where you can see the children moving onto the new parent incrementally.

This is indeed what is happening and is what the documentation states. https://sqlite.org/lang_returning.html

The returning data is accumulated incrementally as each operation is performed during the first step (as it would without returning) and then the accumulated rows are returned one after each.

I think this behaviour is wrong. RETURNING should be evaluated for all updated rows against a consistent snapshot of the DB, after all updates have been made to the rows. We should not be able to observe the UPDATE being processed incrementally and should only see results computed after the UPDATE has been applied to all necessary rows.

The purpose of the RETURNING clause is to return data regarding each incremental database change during the process of "making it so". If you wish to obtain data regarding the state of the database after all the changes have been made, then you should issue a query for that information after the changes are complete.

The only reason for the existance of the RETURNING clause is because the children do not comprehend commands to a database to "make it so". They want incremental reports on "the making of it so", rather than grasping that if one commands "make it so" then after the command is completed (successfully) the "so" that has been commanded has been rendered as commanded. In other words, they are not cognisant that a database is a stateful data store (that is a separate entity from the program issuing the commands to it) and do not understand the concept of "an authoritative source".

(3) By Daniel Chambers (daniel-chambers) on 2023-02-21 05:27:15 in reply to 2.0 [source]

Am I misunderstanding something from the documentation you reference, because it seems to read differently to how you've described it?

When a DELETE, INSERT, or UPDATE statement with a RETURNING clause is run, all of the database changes occur during the first invocation of sqlite3_step(). The RETURNING clause output is accumulated in memory. The first sqlite3_step() call returns one row of RETURNING output and subsequent rows of RETURNING output are returned by subsequent calls to sqlite3_step(). To put this another way, all RETURNING clause output is embargoed until after all database modification actions are completed.

That sounds to me like the entire update should be performed before RETURNING is evaluated.

There is a reference in the documentation to a behaviour that sounds more like what you described:

The first prototype of the RETURNING clause returned values as they were generated. [...] For these reasons, the current implementation was modified so that all database changes happen before any RETURNING output is emitted.

That makes it sounds like the incremental approach you describe is no longer how it works?

(4) By anonymous on 2023-02-21 06:01:19 in reply to 3 [link] [source]

You missed the important part:

The RETURNING clause output is accumulated in memory.

This would make no sense if the output were generated in a separate pass.

(5) By Keith Medcalf (kmedcalf) on 2023-02-21 06:01:54 in reply to 3 [link] [source]

The way that I read it is that all the changes are made during the first step.

Any returning data, computed at each operation, is accumulated (in a transient table). This accumulated (transient table) is then returned one row after each.

The first prototype of the RETURNING clause returned values as they were generated. [...] For these reasons, the current implementation was modified so that all database changes happen before any RETURNING output is emitted.

That makes it sounds like the incremental approach you describe is no longer how it works?

No. The reason for the change was that if you decided not to retrieve all the returning data rows, then the changes were only partially made. The change was to do all the changes first, and then feed back the result rows so that if you reset or finalized the statement or otherwise did not retrieve all the rows, the entire change was still "made so" (or none of it if you rolled the transaction back) -- in other words to ensure that the operation was "atomic" notwithstanding the futzing with the returning data.

(6) By Daniel Chambers (daniel-chambers) on 2023-02-21 06:30:51 in reply to 5 [link] [source]

Thank you for your replies, they are appreciated! :)

The current behaviour certainly agrees with your assessment. However, I'd argue that the results it produces are... not great.

Perhaps the solution indeed is to issue another query after UPDATE is complete, based on the rows from RETURNING. But that seems to defeat what seems like it ought to be possible with just RETURNING itself. It's a hassle to have to receive the rows back into client code and then package them up again inside a subsequent SELECT query (perhaps with a WHERE primary_key IN (...) sort of thing) in order to achieve query consistency.

I can understand how the current implementation produces these results, but I wonder if it is fulfilling the usual expectations of how SQL works in general, which is to provide a consistent snapshot of data throughout the execution of a single query. Given that the RETURNING implementation now waits until the whole update is done before it starts returning rows to the client code, could it be reasonable to have it calculate the returning rows after all the updates are done, rather than incrementally as they are done? Either way, the client is waiting for all updates to be completed before it gets a row, and waiting until all updates are done would also produce a consistent view of the data, instead of a shifting-as-the-table-changes view.

(7) By Ryan Smith (cuz) on 2023-02-21 07:03:38 in reply to 6 [link] [source]

Perhaps the solution indeed is to issue another query after UPDATE is complete...

-- and --

It's a hassle to have to receive the rows back into client code and then package them up again...

Think of it this way - Let's assume the point of view from someone on the other side of the fence for a moment, one of the people that wanted (like some often do) to understand the changes as they happened. Currently, that's how it works, and if one needed to inspect the full set (as you currently need) one could instantiate the returned values and re-query.

Now imagine SQLite changes how it works to the way you suggest. Your annoyance would be gone for sure, but what about this other someone - what recourse/mechanism do they have to understand the blow-by-blow? The answer is of course nothing, it would be impossible.

So to conclude: Right now, although one way is somewhat annoying, at least both ways are possible.

(8) By Keith Medcalf (kmedcalf) on 2023-02-21 07:20:36 in reply to 6 [link] [source]

The implementation of the RETURNING clause is different in every SQL implementation that supports it. I have no idea whether or not there is a standard, however, I kind of doubt it (or if there is, no one seems to be following it).

As I see it, the RETURNING clause in SQLite is intended to return information about the changes made by a statement to a table. It does not include anything other than the changes made by the DML command itself (no trigger effects on the very self row being modified, nor other rows modified by triggers).

The fact that you can put "arbitrary expressions" in the returning clause is what is causing you difficulties. You are expecting that these expressions are evaluated after all the changes are complete.

This is incorrect and an abuse of the RETURNING clause which is designed to return columns of the table that are the "direct" object of the operation of the DML command.

The returning clause could restrict its contents to just "bare columns" from the table that is the DML target -- this would prevent abuse. However, being able to use arbitrary scalar expressions is advantageous, as long as they are not abused.

(9) By Daniel Chambers (daniel-chambers) on 2023-02-22 00:04:40 in reply to 8 [link] [source]

If this is the behaviour that SQLite has decided to have for its RETURNING statements, it may be worthwhile updating the documentation to explicitly call out the fact that there are certain "arbitrary expressions" that can expose the implementation details of how updates are executed and that one should expect them to behave in this particular way. For me, at least, this was not clear from reading the documentation as it is right now. The docs say that the syntax of RETURNING is modelled after Postgres, but after your explanation and commitment to the current behaviour, it is clear that SQLite has decided to implement a very different behaviour for its RETURNING and I think it's worth calling that out.

(10.1) By Keith Medcalf (kmedcalf) on 2023-02-22 00:25:52 edited from 10.0 in reply to 9 [link] [source]

The documentation seems pretty explicit to me. References to "columns" of the "table" being modified and the same as would appear in a trigger fired immediately after the modification is made "for each row" -- the "new" column value in the case of insert/update, and the "old" column value in the case of delete. Any "expressions" in the returning clause are evaluated "for each row" at this time.

That is to say that a returning clause is a trigger which fires first (before all other triggers) "after" the modification is made "for each row", and the output of the trigger is socked away in a temp table for retrieval after all the updates that were commanded to be "made so" have been made so.

In other words, the RETURNING clause operates (and is implemented, interestingly enough) as if the following:

create table temp.results(one column for each column in the RETURNING clause);
create trigger returning after update/insert/delete on "table" begin
 -- references to column in "table" are to new.column for insert/update
 -- and old.column for delete
 insert into temp.results values (whatever follows the word RETURNING in the returning clause);
end;
(the insert/update/delete command here)
select * from temp.results;
drop table temp.results;

You will also note that the documentation says that the syntax was copied, not the semantics or implementation.

(11) By Daniel Chambers (daniel-chambers) on 2023-02-22 04:10:03 in reply to 10.1 [link] [source]

Perhaps to you it is explicit. However, you seem very familiar with the internals of SQLite. I humbly suggest that the documentation should not require that level of familiarity to impart the semantics of the SQL it is documenting. I certainly did not manage to gain the knowledge you've imparted to me here by reading the documentation, hence this thread! Clarifying the documentation may prevent someone asking these similar questions in the future.

Thanks for your help. I think I have what I need now to rework the way we're using SQLite on our end.