feature request: vacuum into without indexes

(1) By neurosis (neurosis69) on 2022-01-20 12:27:01 [link] [source]

Background for request:
Currently I am supporting a sqlite data migration.

The databases size is about 45 GB and may be located at both SSD and HDD devices. Reasons for the migration are merging tables, changes of data types and adding column compression which is handled by the python application layer.
The datafile is constantly growing (from 0G to 45G in 9 months) and it was never vacuumed until now.

As the migration performance, especially for the databases located at HDD, is not very good, I was thinking of proposing `vacuum into` upfront. Than rename tables and migrate as it was planned initially. This approach is already tested and it works. But the problem is, that `vacuum into` has no feature to only copy tables.
At least, I didn't finnd it documented. 
And we really don't want to maintain all the indexes during the data migration.

Question:
Is there a way to use `vacuum into` to only create tables and ignore indexes?
I'd really like to see such a feature, like `vacuum data into ...`
(Of course, I don't speak about primary keys)

(2) By Richard Hipp (drh) on 2022-01-20 12:44:28 in reply to 1 [link] [source]

Is there a way to use vacuum into to only create tables and ignore indexes?

Use ATTACH to add a (blank) database that you will be vacuuming into. All this new database "vac".
For peak efficiency, set "PRAGMA vac.journal_mode=OFF;". Doing this means that the "vac" database will be corrupt if you lose power or crash, though the original database will be unharmed. That seems like a reasonable approach in this case.
Do "CREATE TABLE vac.whatever(...)" for each table you want to vacuum.
Run "INSERT INTO vac.whatever SELECT * FROM main.whatever;" to transfer the table content.
Detach vac.

The "VACUUM INTO" command does essentially the above, though it also processes indexes.

(3) By neurosis (neurosis69) on 2022-01-20 14:27:31 in reply to 2 [link] [source]

So I guess `INSERT INTO ... SELECT * FROM ...;` is optimized in a certain way?
How is the data sorted? Using a PK if available?
Is the optimization still valid when adding conditions (`INSERT INTO ... SELECT * FROM ... WHERE ...`)?

I'll test it anyways, thanks a lot for your quick feedback.

(4) By Simon Slavin (slavin) on 2022-01-20 15:01:24 in reply to 3 [link] [source]

You are correct that INSERT INTO ... SELECT * FROM ...; is optimized in certain ways.

The order in which rows are inserted is not documented. That means that even if I told you what the current version of SQLite did, it might change in future versions. However, if I understand the documentation properly, if you have an ORDER BY clause in your SELECT, this will be respected. Note that this might slow the process down, though.

The SELECT part of the statement should always have a WHERE clause, even if it's just WHERE true.

(5) By Larry Brasfield (larrybr) on 2022-01-20 15:30:32 in reply to 4 [link] [source]

The SELECT part of the statement should always have a WHERE clause, even if it's just WHERE true.

Why do you assert that? The WHERE clause restricts (or filters) output, and "WHERE true" is a no-op, yielding exactly the same VDBE code as if not present. Do you think it is clearer to say "there is no restriction" than to not say there is restriction?

(7) By Simon Slavin (slavin) on 2022-01-20 17:10:50 in reply to 5 [link] [source]

I got it from the section on INSERT INTO … SELECT on this page:

https://sqlite.org/lang_insert.html

To avoid a parsing ambiguity, the SELECT statement should always contain a WHERE clause, even if that clause is simply "WHERE true", if the upsert-clause is present.

Unfortunately, I see to have missed the bit about UPSERT. I thought it was true for all INSERT INTO … SELECT commands. Thanks for the heads-up.

(6) By David Raymond (dvdraymond) on 2022-01-20 15:40:43 in reply to 3 [source]

> So I guess `INSERT INTO ... SELECT * FROM ...;` is optimized in a certain way?

There is indeed an optimization for that. Here's part of the comment on it from the source code since I'm not sure if it's documented on the web page anywhere.

/*
** Attempt the transfer optimization on INSERTs of the form
**
**     INSERT INTO tab1 SELECT * FROM tab2;
**
** The xfer optimization transfers raw records from tab2 over to tab1.
** Columns are not decoded and reassembled, which greatly improves
** performance.  Raw index records are transferred in the same way.
**
** The xfer optimization is only attempted if tab1 and tab2 are compatible.
** There are lots of rules for determining compatibility - see comments
** embedded in the code for details.

...

** This optimization is particularly useful at making VACUUM run faster.
*/

(9.1) By neurosis (neurosis69) on 2022-01-22 12:08:23 edited from 9.0 in reply to 6 [link] [source]

Thanks for pointing that documentation out (and to the others for their feedback as well).

I'm currently testing the approach brought up by drh and it looks promising.

One follow-up question. How should I split such a data copy into smaller chunks to prevent swapping?

What I do is

BEGIN TRANSACTION
INSERT INTO db2.xxx SELECT * FROM main.yyy
COMMIT
PRAGMA SHRINK_MEMORY
BEGIN TRANSACTION
... proceed with other tables

for instance, when I want to use a batch size of 10000 and I change the statement to INSERT INTO db2.xxx SELECT * FROM main.yyy where rowid between 0 and 10000 will this than still be optimized? Or are there other ways to optimize that like using 'PRAGMA temp_store=file`?

(8) By Donald Griggs (dfgriggs) on 2022-01-20 19:04:51 in reply to 2 [link] [source]

Neurosis,

Just for kicks I wrote the script below as a mostly-generalized way to perform a vacuum without indexes. You'd still want to customize it with your source database name and any needed persistent pragma settings.

If you named it "MyCustomVac.sql" you'd invoke it with a new target db name, e.g.

sqlite3 MyCustomVac.sql NewDBname.db

Caution: Not well tested. I'm no expert. Use at own risk.

/* Script to perform quasi-vacuum from one db to another, but omit indexes */

/* Run this script from new, empty target database */

/* !customize with source db name below*/
attach 'myOriginal.db' as cur;

.bail on
.echo off

.output 'MyVacTEMP.sql'
.print PRAGMA foreign_keys=0;
.print PRAGMA ignore_check_constraints=0;
.print PRAGMA synchronous=0;

/* !customize by adding any needed persistent pragma settings below  E.g.  ".print PRAGMA page_size=8192;" */

.print /* Create schema */

.print BEGIN IMMEDIATE;
select sql || ';' from cur.sqlite_schema where name not like 'sqlite__%' AND type <> 'index';
.print COMMIT;

.print /* Insert data into our new tables (can simplify if don't want the loading of each table to be separately atomic)*/
select printf(
         'BEGIN IMMEDIATE;' || x'0A'
      || 'INSERT INTO "%s" SELECT * FROM cur."%s";' || x'0A'
	  || 'COMMIT;' , name, name ) 
   from cur.pragma_table_list 
   where name not like 'sqlite__%'
   and type <> 'view';

.output stdout

.echo on

/* run the vacuum-ish script we just created */
.read 'MyVacTEMP.sql'

.print Done -- Did you remember persistent pragma settings?

.quit