Repeated rows in select loop

(1) By HashBackup (hashbackup) on 2022-04-04 23:40:48 [link] [source]

The https://sqlite.org/isolation.html page says:

"If an application issues a SELECT statement on a single table like "SELECT rowid, * FROM table WHERE ..." and starts stepping through the output of that statement using sqlite3_step() and examining each row, then it is safe for the application to delete the current row or any prior row using "DELETE FROM table WHERE rowid=?". It is also safe (in the sense that it will not harm the database) for the application to delete a row that expected to appear later in the query but has not appeared yet. If a future row is deleted, however, it might happen that the row turns up after a subsequent sqlite3_step(), even after it has allegedly been deleted. Or it might not. That behavior is undefined. The application can also INSERT new rows into the table while the SELECT statement is running, but whether or not the new rows appear in subsequent sqlite3_step()s of the query is undefined. And the application can UPDATE the current row or any prior row, though doing so might cause that row to reappear in a subsequent sqlite3_step(). As long as the application is prepared to deal with these ambiguities, the operations themselves are safe and will not harm the database file."

Is it still true that modifying a table during a select loop on that table can cause things like repeated rows? I have all my loops coded with tests like "if this rowid is less than the previous rowid, ignore it", and am wondering if this is still necessary.

Thanks! Jim

(2) By Larry Brasfield (larrybr) on 2022-04-04 23:58:10 in reply to 1 [link] [source]

Your question/worry leads me to think you should read enough of the docs to: (1) learn that each statement sequence executes within a transaction; (2) learn that, within a transaction, statements can only execute sequentially; and (3) understand that separate transactions do not interfere with the view of the database that each is working with. (This is the Isolation part of ACID.)

Short answer: Your worry is based on a conceptual model that is inapplicable.

(3.2) By Keith Medcalf (kmedcalf) on 2022-04-05 00:39:07 edited from 3.1 in reply to 1 [source]

If you execute a SELECT on a CONNECTION, and on that very same SINGLE CONNECTION execute DML that transmorgificates the underlying data being "traversed" by the SELECT running on that connection, then the changes that you have made will be presented to you.

That is to say that, if you were, for example, reading "data rows" from a "file" and in the midst of reading from the file you make changes to the file, then those changes will be visible to operations that occur after the change is made.

The "proper" way to do this (if you really cannot consruct a proper update) is to ensure that you use a SEPARATE CONNECTION because transaction isolation occurs at the CONNECTION level, not the statement level.

Alternatively, you can make sure that the data your select is retrieving is not, at the time of presentement to you, dependant on the content of the underlying database.

That is to say more clearly that the method of datastore update invented to deal with this exact situation when it first arose in the 1940's remains unchanged: Firstly start an IMMEDIATE transaction, SELECT all your data, and only after all the data has been read, issue the updates IN THE EXISTING TRANSACTION, and then COMMIT the transaction.

Note that some big-iron databases allow a FOR UPDATE OF ... clause to be appended to the SELECT so that the planner can ensure that the UPDATE OF the listed fields will not affect the projection.

(4) By Gunter Hick (gunter_hick) on 2022-04-05 06:27:07 in reply to 1 [link] [source]

Your check will only work if the visitation order of the table is the rowid, which means you need to explicitly "ORDER BY rowid ASC".

In the general case, the constraints of the WHERE clause may cause an index to be used, which may cause the visitation order to differ from ascending rowid, rendering your check unusable.

The terms "current row", "prior row" and "future row" in the documentation pertain to the visitation order, which may be checked using the EXPLAIN QUERY PLAN feature.

BTW: If ORDER BY is implemented via a BTree, then this would indicate that all of the source rows have already been selected before the first row is returned, thus tending to immunizing the result set from any changes to the table.

(5) By Richard Hipp (drh) on 2022-04-05 11:47:45 in reply to 1 [link] [source]

Is it still true that modifying a table during a select loop on that table can cause things like repeated rows?

Yes. If you have been doing this and have not had problems yet, then you are lucky. A future enhancement to the query planner might cause you code to break.

In all existing versions of SQLite, you can work around this by including an ORDER BY clause on the query that cannot be satisfied by an index. If the first term of the ORDER BY clause begins with "+", then the use of an index for that clause is prohibited. Probably it will be safe to do this moving forward. I can't think of a reason why it shouldn't continue to work forever, though I do reserve the right to change my mind on that point.

Summary

Very likely, if you include an ORDER BY clause on your SELECT statement where the first term of the ORDER BY clause begins with "+" then you can do DELETE, INSERT, and UPDATE statements against any tables in the database without affecting subsequent output rows from the SELECT statement.

Aside

The fact that people want to make changes to a table based on the results of a SELECT from that same table makes me wonder if I ought to get busy and add support for a MERGE statement to SQLite? Or if not that, perhaps some other mechanism that causes the entire result set of a SELECT to be computed first before any rows are returned?

(6) By ddevienne on 2022-04-05 11:54:39 in reply to 5 [link] [source]

mechanism that causes the entire result set of a SELECT to be computed first

Isn't that what CTE Materialized hints already do?

(7) By Richard Hipp (drh) on 2022-04-05 11:59:13 in reply to 6 [link] [source]

Good point. That is what the MATERIALIZED hint is suppose to do. The MATERIALIZED hint is always honored, IIRC. But the NOT MATERIALIZED hint is not necessarily honored, if the query planner cannot figure out how to do so. So NOT MATERIALIZED is saying "please do it this way if you can". But MATERIALIZED should always be honored, I think.

I should double-check on that...

(8) By Richard Hipp (drh) on 2022-04-05 12:06:40 in reply to 7 [link] [source]

I'm wrong. MATERIALIZED does not work like that. Here is a counter-example:

CREATE TABLE t1(id INTEGER PRIMARY KEY, x TEXT UNIQUE);
EXPLAIN
  WITH t2(id, x) AS MATERIALIZED (SELECT id, x FROM t1)
  SELECT * FROM t2 ORDER BY x;

If the MATERIALIZED actually caused the CTE to be materialized here, then the output would be the same as:

CREATE TABLE t1(id INTEGER PRIMARY KEY, x TEXT UNIQUE);
EXPLAIN SELECT * FROM t1 ORDER BY +x;

I think what is happening here is that SQLite sees that the CTE is already materialized (since it is a real table on disk) and so it feels like it can simply ignore the MATERIALIZED hint.

(9) By midijohnny on 2022-04-05 12:26:14 in reply to 5 [link] [source]

Would using the existing UPDATE-FROM mechanism fit for this use case?

(10) By HashBackup (hashbackup) on 2022-04-06 01:40:27 in reply to 1 [link] [source]

Thanks for all the different insights, much appreciated!

Here's a recent example where this came up. HashBackup has a file table and block table, with multiple blocks for most files - standard stuff for a backup program. A retention schedule deletes old copies of files and their blocks, but because of deduplication, some blocks are kept. New files and blocks get new primary keys (integers), so over time, there are lots of holes in the primary keys. On my 12-y/o dev server backup for example, the highest block number was around 170M but there are only 12M blocks in the backup. There are customers who have overflowed a 32-bit block id, which isn't a problem for SQLite, but is a difficulty for some other parts of HashBackup like the dedup table.

So I wrote a "tune" command to compact the primary keys in several tables. It does a select, loops through the primary keys, and when it finds a gap, changes the primary key to eliminate the gap, updates the record, then continues the select. I do have that test in there to ignore any keys that have already been seen, but if I stick a print on it, it doesn't seem to ever trigger.

Is this something that's going to break? Is it necessary instead to do one select pass, record all the gaps, then go back and do all the updates?

Thanks again for the tips! Jim

(11) By Simon Slavin (slavin) on 2022-04-06 03:15:53 in reply to 10 [link] [source]

While you're iterating through rows in an order that depends on values, don't do anything that changes those values. You have no way to predict how the two operations would interact.

For reasons explained in previous answers to your original question, your idea of doing a SELECT pass first is a good answer. Perform the SELECT pass first, and keep a list (in memory, or in a SQLite table) of the existing primary keys.

Once that is done, iterate through your list of primary keys, considering those as 'old key', and keeping a counter which starts from 1 as 'new key'. Thus you have pairs as follows:

old key	new key
1	1
2	2
3	3
5	4
6	5
9	6

For each row of that table where 'new key' ≠ 'old key', you issue UPDATE commands to make the change. Note that the order in which the rows are processed prevents any time when two rows have the same key.

You can do all this inside a transaction (BEGIN EXCLUSIVE before, END after) and that will make the changes happen in the minimum time (I'm omitting complexities here for the sake of brevity).

As you noted, this is necessary not due to any shortcoming of SQLite, but because of something else that has to work with the same values.