autovacuum in sqlite VS autovacuum in PG

(1) By Lewis (baiwfg2) on 2022-01-06 02:54:20 [link] [source]

Hi, guys.

In PG, there's a autovacuum daemon that's perioidically launched to remove dead tuples and marks the space available for future reuse. It will not return the space to OS.

On the contrary, sqlite autovacuum will try to swap valid pages residing at the end of file with free pages, and release the tail-end pages of file back to OS.

So sqlite autovacuum can shrink the file while PG autovacuum cannot.

Is my understanding correct ? Any other similarities and differences ?

(2) By Simon Slavin (slavin) on 2022-01-06 09:32:27 in reply to 1 [link] [source]

Let me be the first to ask what PG is. I'm going to assume PostgreSQL. I don't know enough about PostgreSQL to discuss it.

SQLite autovacuum can shrink the file. Since you ask, here's the short simplified version.

SQLite database files are made up of pages: blocks of a fixed length. Each page is devoted to a specific table or index. It is possible that an entire page would be released (e.g. DELETE FROM deleting lots of rows of a table, which would also delete lots of entries of an index). It is also possible that part of a page would be released (e.g. DELETE FROM deleting one just row, which would also delete an entry in an index). to release the maximum amount of filespace, you need to worry about both kinds of unused space. To find out how autovacuum does this, read this:

https://www.sqlite.org/pragma.html#pragma_auto_vacuum

However, auto_vacuum can introduce unpredictable delays in operations, and this may be inconvenient to your program. Instead you might prefer to do a VACUUM at a time when a delay doesn't matter. VACUUM reconstructs the entire database (ignoring freed space) from the data in tables. This means that not only can it free up filespace but it can also get rid of corrupted index data.

Come back to us if you have specific questions.

(4) By Lewis (baiwfg2) on 2022-01-07 02:39:32 in reply to 2 [link] [source]

Thanks for the link. It tells more details than https://www.sqlite.org/lang_vacuum.html

(3) By David Raymond (dvdraymond) on 2022-01-06 14:27:33 in reply to 1 [link] [source]

I believe your understanding is correct.

My limited understanding of Postgres is that it keeps multiple copies of each record. Each copy has metadata that says I am valid from transaction A through transaction B. It does that because older transactions still need to see the table as it was when they started even if a newer transaction comes along and makes changes. So Postgres autovacuum comes along and says "ok, you were valid from A through D, and the oldest running transaction is F, so no one needs you any more, you can be removed from the file." Or "You were valid from D through G, but the oldest running transaction is F, so F might still need you and we have to leave you in the file."

The way SQLite handles that is that in rollback journal mode you can't make changes until all other transactions complete. In WAL mode you can't checkpoint a page in the WAL back to the main file until all transactions older than that change are complete. So in WAL mode you'll have your multiple copies of a page, but they'll be in the WAL and not the main file.

SQLite's autovacuum is pretty much what you said. It's there to reclaim unused pages and keep the file size smaller by shuffling pages from the end to free pages earlier in the file.

(5) By Lewis (baiwfg2) on 2022-01-07 02:51:39 in reply to 3 [link] [source]

So for sqlite,

does update/delete operation cause sqlite generate multiple copies of each record or just one copy ? (update-in-place or not ?)
delete is easier to create freelist pages than update, so delete facilitates shrink of database file much more than update ?

(6) By Simon Slavin (slavin) on 2022-01-07 03:11:04 in reply to 5 [source]

(I'm ignoring BLOBs and other weird things here to give a simple answer.)

Updating any part of a row can require that the row be rewritten to file. Depending on whether the row takes up more or less space, and on what lengths of free spare are available, the row may or may not be re-written in place. Because SQLite keeps pending updates in a journal file, not in the main database file, the amount of time that both the old and new versions of a row are in the main database file is very short. Immediately after that, the space taken up by the old version of the data is marked as free, and may be reused by future updates or freed up by VACUUMing.

Updates can release space just as well as deletes. Rows can take up less table space, freeing up filespace. But DELETE FROM frees up space in indexes. An UPDATE doesn't do that.

Given the questions you ask I will mention that there's a trade-off here. You can write your database engine to keep the file size as small as possible, reclaiming the maximum of space at all times and releasing filespace as soon as possible. But all this processing and input/output can take up time, making operations slow, and can wear out SSD storage by making numerous unnecessary changes. For instance, in a journaling system updating a single row of data by changing values which appear in two indexes could require 12 writes ! That'a a lot of work.

In some situations it's better to just mark filespace as 'unused, available for reuse' and worry about releasing filespace only when a programmer explicitly does a VACUUM.

(9) By Lewis (baiwfg2) on 2022-01-11 06:48:47 in reply to 6 [link] [source]

Keith said:

Update is not done in place but rather the old tuple is removed and a new one inserted.

So 'update in-replace' can happen or may happen ?

(7) By Keith Medcalf (kmedcalf) on 2022-01-07 03:15:46 in reply to 5 [link] [source]

No (No and No).

That is, update/delete does not generate multiple copies of each record (tuple). Each tuple has only one data entry. Ever. Update is not done in place but rather the old tuple is removed and a new one inserted.

Not relevant.

When a page becomes free (via whatever mechanism causes the page to become free) it is added to the freelist (just as it does when there is no auto vacuum in effect). Enabling auto_vacuum causes extra information to be recorded in the database so that pages may be diddled-about (moved).

WHen the automatic vacuum is triggered (whether automagically for every commit or user activation via explicit command, as was specified by the user) the "extra information" stored in the database when auto-vacuum is enabled is used to move freelist pages to the end of the database file, whereupon the database in truncated to release the free pages to the Operating System.

There is no way for a layman to determine whether DELETE or UPDATE will cause more pages to be added to the freelist.

For example, if you have a table create table x(x integer not null check (x in (0,1)) and a table create table y(data blob not null) where the blob is 2 terrabytes, then updating one row of y will "free" 2 terrabytes of disk space, while in the former case you have to delete about 1.8 trillion rows from x to achieve the same "freeing" of disk space.

(8) By Bill Wade (billwade) on 2022-01-07 16:41:02 in reply to 7 [link] [source]

nitpick about Keith's final paragraph.

Paraphrasing

"create table x(x integer not null check (x in (0,1))"

" ... about 1.8 trillion rows [is at about 2 terabytes]"

Implying sqlite can hold each row of x in slightly more than one byte of file space.

I think that ignores too much (roughly a factor of 10) of the overhead associated each row.

For each row of table x, SQLite knows x.rowid and x.x.

From https://sqlite.org/fileformat2.html

I think I see that the leaf pages need, per row
A 2-byte cell pointer, pointing at the location of the cell holding the row on the page.

The b-tree "cell" for the table x row has
- A varint giving the cell size. This will be 1 byte.
- A varint for the rowid. For a million rows, most of these will be three bytes. For 1.8e12 rows, I believe the majority of these would be six bytes.

The payload uses the record format. It seems to need a varint (1 byte) for the header size, and a varint for each column (apparently other than the rowid) which hold the type (and for some values, including 0,1, the value) of the column. For table x, I believe the payload is 2 bytes.

For a million rows, I expect 2 byte cell pointer, 1 byte cell size, 3-byte rowid, and 2 byte payload, or 8 bytes total. I see 8MB file size when I insert a million records into table x, so that seems consistent.

For 1.8e12 rows, I think the rowid grows to ~six bytes, so the cost per row is around 11 bytes.