Shortcut to change a non changing update or upsert in a noop

(1.1) By Marco Bubke (marcob) on 2021-07-06 09:30:08 edited from 1.0 [link]

Hello

I try to avoid updates or upsert which are not changing values. So I write
very often (quite long) WHERE clauses. Is there maybe a more elegant way like a pragma to always check writes to avoid this code?


here a simple example:

INSERT INTO types(importId, name,  accessSemantics, sourceId)
VALUES(?1, ?2, ?3, nullif(?4, -1)) 
ON CONFLICT DO UPDATE SET prototypeId=excluded.prototypeId, accessSemantics=excluded.accessSemantics, sourceId=excluded.sourceId 
WHERE prototypeId IS NOT excluded.prototypeId OR accessSemantics IS NOT excluded.accessSemantics OR sourceId IS NOT excluded.sourceId 
RETURNING typeId

Thank you, Marco

(2) By Ryan Smith (cuz) on 2021-07-06 12:46:45 in reply to 1.1 [link]

This should not be needed, the SQL engines usually all do this internally, for the exact reason you are trying to do it.

I'm not 100% sure right this moment that SQLite does it, or how effective it is, but I have to think that with all the years of chasing faster an faster total execution times and squeezing every last drop of efficiency out of the engine, the SQLite devs would not have neglected this rather obvious first point of optimization.

Have you any observations/measurements that leads you to think the opposite?

(3.1) By Ryan Smith (cuz) on 2021-07-06 13:47:05 edited from 3.0 in reply to 2 [link]

Well, I did some testing by updating some stuff that did not need updating and querying the change counts, with astonishing results - SQLite actually updates them.

```
  -- SQLite version 3.35.4  [ Release: 2021-04-02 ]  on  SQLitespeed version 2.1.3.11.
================================================================================================

CREATE TABLE t(a,b);

INSERT INTO t(a,b) VALUES
 (1, 'John')
,(2, 'Joan')
,(3, 'Jane')
,(4, 'Joanne')
,(5, 'Jenna')
,(6, 'Jack')
;

SELECT * FROM t;

  --       a     |b       
  -- ------------|--------
  --       1     |John    
  --       2     |Joan    
  --       3     |Jane    
  --       4     |Joanne  
  --       5     |Jenna   
  --       6     |Jack    


UPDATE t SET b='Jane' WHERE a=3;
SELECT Changes();

  --   Changes() 
  -- ------------
  --       1     


UPDATE t SET b='Jane' WHERE a=3;
SELECT Changes();

  --   Changes() 
  -- ------------
  --       1     


UPDATE t SET b='Jenna' WHERE a > 4;
SELECT Changes();

  --   Changes() 
  -- ------------
  --       2     

```
This same test in, for instance MySQL, yields 0 values where changes were not needed.

Possibilities I can think of:

  - 1 "Changes()" here reports the mathematical execution (honoured requests) rather than physical data writes, 
  - 2 or perhaps the SQLite devs figured that the incidence of UPDATEs that do NOT actually alter row contents are so low during typical operation that the overhead of first reading the data to "check-if-needed" is more expensive than just always writing,
  - 3 or maybe the trigger mechanisms need the actual writing to do their thing,
  - 4 or maybe it is because I use WAL journaling which always writes,
  - 5 or perhaps it is a genuine overlooked optimization opportunity (though I have some difficulty believing this hypothesis).

Whatever the reason, you may have good reason to actually avoid it, though the habit may cause slower execution (as per point 2 above).

I'm now quite curious.

(4) By Larry Brasfield (larrybr) on 2021-07-06 15:04:04 in reply to 3.1 [link]

Ryan, I am sure you would agree that changes() should report actual inserts and rewrite of rows on update, regardless of whether the latter changed any bits.

The optimization opportunity is an interesting question, with no clear answer applicable across use cases. It takes time to read and compare the existing values, which cost would be incurred across the use cases, whereas avoiding a write to storage is obviously only possible for some fraction, likely small, of the update use cases. Given these facts, I see no harm in relying on users to do the check for whether a physical update is needed.

(5) By Ryan Smith (cuz) on 2021-07-06 16:10:57 in reply to 4 [link]

> I am sure you would agree that changes() should report actual inserts and rewrite of rows on update, regardless of whether the latter changed any bits.

I don't agree, or disagree. "Changes()", like any other function, should report what its documentation claims for it to report. Whether or not this involves actual "bits on disk" changing or not I cannot seem to find reference to in its documentation, so have no basis to argue one or way or the other.

What now transpires, regardless of what "changes()" report, is the question of "Do actual bits change on disk?", or more succinctly, does a physical file write happen if no bits have changed, or not?

If the answer is NO, then the act of trying to prevent writes by adding a lot of extra time-consuming conditions is a fool's errand.  If the answer is YES, that process may well be worth the effort.

You will find that MSSQL and MySQL for instance internally avoid writes that are not needed but then also report so in their respective "row_count()" or similar functions, though options exist to change this behaviour to show all rows matching the search rather than those actually changed, if needed.

I do not really care how SQLite reports it, as long as it is documented so, and understanding how it functions to know whether or not it is worth adding code to avoid unnecessary writes or not.


I would love a definitive answer on that.


PS:
Upon re-reading my post I think some confusion comes from this sentence:

> Whatever the reason, you may have good reason to actually avoid it...

Which can be interpreted as me warning off the habit of checking to avoid writes, but in fact, with that unfortunate phrase, by "avoid it" I really meant "avoiding the updates yourself" - as in I precisely encouraged what the OP did and is in full agreement with your statement.

I realise now also that a lot of deliberation water flowed under the bridge, yet the original question remains unanswered. I'm not sure there is a better way to do it (i.e. avoid the updates) than what the OP described, should SQLite not avoid it internally.

(7) By Larry Brasfield (larrybr) on 2021-07-06 17:10:00 in reply to 5 [link]

> > I am sure you would agree that changes() should report actual inserts and rewrite of rows on update, regardless of whether the latter changed any bits.

> I don't agree, or disagree. "Changes()", like any other function, should report what its documentation claims for it to report.

(Hmmm, not so sure now ...) Well, I'm not going to argue with **that**. But I do think (and meant to say) that it's more sensible as "number of rows made to reflect intent of the DML".

> ... Whether or not this involves actual "bits on disk" changing or not I cannot seem to find reference to in its documentation

Me neither. This looks like a doc clarification (and API disambiguation) opportunity.

> ..., so have no basis to argue one or way or the other.

How about both?

> ... I'm not sure there is a better way to do it (i.e. avoid the updates) than what the OP described, should SQLite not avoid it internally.

I would think that this might better be an option offered by a VFS. There is little point to doing work to avoid some byte writes if a page is dirtied elsewhere anyway and has to be written out.

(8) By RandomCoder on 2021-07-06 17:34:57 in reply to 7 [link]

> I would think that this might better be an option offered by a VFS. There is little point to doing work to avoid some byte writes if a page is dirtied elsewhere anyway and has to be written out.

I might be misunderstanding your point, but:

The VFS might not have enough information to optimize this.  If SQLite can determine that the a write is unnecessary from an index lookup, say, then skipping the write saves the VFS from having to determine if a non-index page is dirty in the first place.

(16) By Larry Brasfield (larrybr) on 2021-07-07 15:26:44 in reply to 8 [link]

True, the present VFS interface does not help with this. The interface would need to allow the page caching system to notify the VFS when a page is flushed, and no longer associated with a given file section. Then it would be practical for the VFS to keep a per-page CRC or checksum.

I grant that this would not be as effective as avoiding writes with more granularity where such avoidance can be an improvement at all.

I am not convinced that the required extra reading to check for "actual" update is generally worth the I/O operations potentially saved. I am suspicious of applications that would be doing a lot of useless updates. Where are they getting the update data? Why is there so much unchanged data being passed around? I would think acting on actual change is better done at a higher level than the low-level persistence layers.

(6.1) By Keith Medcalf (kmedcalf) on 2021-07-06 16:39:30 edited from 6.0 in reply to 1.1 [link]

As far as I know and can see from the code and experimentation, UPDATE/INSERT which update a value to the same value as it already was is still an update to the row.

This is because it is what was commanded by the commander, you, who is sending the commands.  You can point at a parking lot full of cars and say "update the colour on those cars so they are all green".  If the parking lot contains 25 cars, then the correct answer for "how many cars are green" is 25.

The fact that 10 of them were *already* green and did not need to be re-painted is an "optimization" and has nothing to do with the fact that you commanded to make 25 cars green, and that is exactly what happened.

If you wanted to only paint the cars which were not already green with green paint and return the count of the cars so painted, that is what you should have commanded.

In similar fashion, SQLite3 will carry out your commands as directed.  However, the avoidance of painting a car that is already green with green paint is an optimization that occurs at a lower level (ie, in the parking lot analogy, it occurs when the painter goes to paint the car and sees that it is not in need of painting so does not bother to do so).  SQLite3 will similarly avoid extra I/O of unchanged pages.

(9) By Simon Slavin (slavin) on 2021-07-07 12:20:50 in reply to 6.1 [link]

Can confirm, from reading discussions of this years ago.

Calls like <code>changes()</code> report the number of rows which satisfy the WHERE clause (and similar things which affect how many rows should be updated).  They are not influenced by existing values in fields to be overwritten.

In order to make things fast, various parts of SQLite act to minimise the actual work done to file storage.  They might be at the data level or the database page level, but you should not assume that file access is actually done (reading or writing).  As an example of this, SQLite might not even bother to read existing values from a table if they are stored in a convenient index.

If you want to know how many rows of the database would actually get different values, add that restriction to your WHERE clause.

(10) By Ryan Smith (cuz) on 2021-07-07 12:42:53 in reply to 6.1 [link]

Agreed Keith (and Simon in a later post), thanks for posting and this is all well and good, but the original question which I am still not 100% sure of, although what you said "seems to suggest", is this:

Does SQLite specifically have an optimization that do not actually write rows that did not change during an update statement, YES or NO?
Does it depend on the VFS or not? 

i.e. does it repaint Green cars to Green, or doesn't it, or does it not know? (regardless of what it reports back)

Put another way, and what we really want to know: Is it helpful for the OP to add code to his queries to avoid updating rows that already contain the correct values, or will SQLite already avoid rewriting those internally? (in which case he is wasting effort and cpu cycles for no effect).   

What it returns in the "changes()" value is a separate thought, perhaps needing better documentation as noted before, but of no consequence in answering this question.

(11) By TripeHound on 2021-07-07 13:42:19 in reply to 10

Another complication in all this is the interaction between database rows and database pages. Although it's natural to think of "updating rows", the reality (as I think I understand it) is that only whole _pages_ are updated.

Obviously things will be highly dependent on row-size vs. page-size, but as an example, suppose you have a schema where roughly 10 rows fit in one page. Even if the addition of a complicated `WHERE` clause would prevent 90% of unnecessary "row updates", then – assuming the remaining 10% of "real" updates are reasonably evenly distributed – on average every page would contain one updated row and would therefore need to be written: you wouldn't save anything over a "dumb" update statement.

I have a _feeling_ that I've seen mention that SQLite does/can optimise at the page level: if a page that has been "touched" by an `UPDATE` hasn't actually changed, then that page won't be written to disk. If that IS the case, it still wouldn't make any difference if the 10% of changed rows are evenly distributed as in the above example (because every one would have changed), but _would_ make a difference if the changed rows were more clustered: whether or not the complicated `WHERE` clause was present, only pages that have actually changed would be written.

(13) By Marco Bubke (marcob) on 2021-07-07 14:10:31 in reply to 11 [link]

I can imagine that reading and comparing can be produce quite some overhead for many cases. So I am not so sure about it's done by default. My use case is the synchronization of source text files where only minimal changes are done. So very often many tables are not changed. Otherwise I would have not bothered. Maybe a language extension like UPDATE CHANGED or INSERT CHANGED would be a nice idea? I think I will go now the safe route like proposed in the UPSERT documentation and use the WHERE clause.

(12) By David Raymond (dvdraymond) on 2021-07-07 14:07:11 in reply to 10 [link]

Granted it's not the most thorough of tests, but in Windows, when I run an update query that doesn't end up changing any values, then the Modified and Accessed times of the file are not updated, despite changes() saying something was changed. When I do an update that does change values, then the Modified and Accessed times on the file do get updated. So I'd say yes, there is some optimization there. Whether it depends on the VFS... I haven't a clue.

(14) By Ryan Smith (cuz) on 2021-07-07 14:17:35 in reply to 12 [link]

Yes, I've noted this too, and as Mr. Hound reported, the whether any single row is updated or not is irrelevant, it's more to do with if any row on a specific page is changed that decides if it gets written to disk - that is, assuming we are correct in thinking that SQLite checks it so. The non changes may be brought about by a completely other mechanism.

Either way, I'm thinking more and more that SQLite writes only when needed and any query parts to try and "help" it is a waste of CPU cycles - but I am still not 100% sure.

(15) By David Raymond (dvdraymond) on 2021-07-07 14:36:41 in reply to 14 [link]

I think the main reason changes() reports is for triggers. That way "on update" triggers will still fire even if all the old and new fields are the same for the record.

(17) By Keith Medcalf (kmedcalf) on 2021-07-07 17:11:34 in reply to 14 [link]

> Either way, I'm thinking more and more that SQLite writes only when needed and any query parts to try and "help" it is a waste of CPU cycles - but I am still not 100% sure.

I would suggest that this is untrue in some cases.  That is, using a set of conditions which limits the *candidates* may in fact reduce the amount of work done whether or not the actual amount of pages I/O changes.

That is to say that including WHERE conditions that remove candidate rows which do not need to change, while it may not affect the number of pages written, may in fact alter the query plan and cause a significant reduction in "work" performed.  It may also modify the query plan so as to increase the amount of work to be done.

In general I would suggest that the more narrowly the set of candidates to be updated is defined, the more latitude the query planner has to generate a more efficient query plan.

(18) By Simon Slavin (slavin) on 2021-07-08 13:24:00 in reply to 10 [link]

This too was discussed previously, but I can't find the post in question, or even remember whether. it was on the old mailing list.  What I remember is …

If SQLite already has the value to be overwritten in memory, because it was covered by the index used to find the row, it will check the new value against the one in memory to make sure that a change needs to be made.

If SQLite has already needed to read the old value from the table, in order to fetch enough of the row to figure out which rows satisfy the WHERE clause, it will check the new value against the one in memory to make sure that a change needs to be made.

It seems that a good way to optimise your code seems to be to make sure that the column you're updating is included in the index used to satisfy the WHERE clause.  In other words, don't write extra code, just craft your indexes carefully.  And, of course, test out the change to make sure there's a worthwhile difference.  Because if you save just a millisecond an hour, it's not worth the extra storage cost.