Best order to vaccum, analyze, create indexes, etc.. ?

(1) By ddevienne on 2020-04-03 15:28:49 [link]

Hi. I'm converting large scientific proprietary files into SQLite databases.  
Many blobs in two large tables, with structured meta-data in these and other  
tables.

That's already working, but I wonder what's the optimum order for doing it.  
Here's how I do it currently:

```
START TRANSACTION
foreach table
  CREATE TABLE ... -- ROWID tables
for millions of rows, GB of data
  INSERT INTO ...
CREATE INDEX
COMMIT
VACUUM
ANALYZE
```

I've already turned OFF journaling, since the DB is not "online",  
not activated FKs (which are expressed in the table definitions),   
used a largish page-size (16K) and page-cache (32MB) to maximize  
import performance, etc...

I have a few small tables, and two very large tables, and all  
inserts are *interleaved*, that's why I vaccum (an expensive  
operation) post-commit, to *defragment* the DB, and maximize  
read speed when the DBs are put "online" (data served over  
WebSocket using a custom protocol).

I have several UNIQUE constraints in my table definitions,  
and I wonder whether I should remove those, and do explicit  
CREATE UNIQUE INDEX DDLs post-insert? (but pre-commit?).

I also wonder whether the CREATE *(non-UNIQUE)* INDEX should be  
pre-VACCUM, or post-VACCUM? I suspect indexes wouldn't be fragmented  
if created post-VACCUM, thus saving that much IO from the VACCUM?

Finally, I wonder if I can somehow *parallelize* inserts into my  
main two BIG tables? Perhaps inserting into two separate DBs, and  
do a CTAS from one DB to the next once both are complete, since  
similar to the CREATE INDEX case, this would *implicitly defragment*  
the table pages, no? Is CTAS optimized to do direct page copies?  
Or must it decode all records from the source, recode them into the  
destination, rebuilding the B-TREE from scratch?

I'm basically asking for advice/guidance on the most performant way  
to do the above, which hopefully is clear enough. TIA, --DD

(2) By Richard Hipp (drh) on 2020-04-03 15:40:11 in reply to 1 [link]

In terms of performance, you could take out all UNIQUE and PRIMARY KEY
constraints from the table definitions and transform those into
CREATE UNIQUE INDEXs that you run at the same time as your other
CREATE INDEXes (as you suggest).  And you could run the CREATE INDEXes
after VACUUM, but do so before ANALYZE (again, as you already suggest).
Other than those tweaks, I think what you are doing is optimal.

(3) By ddevienne on 2020-04-03 16:33:31 in reply to 2 [link]

Thanks Richard. I'll do as you suggest for all indexes, UNIQUE or NOT.

My PRIMARY KEYs are all either INTEGER AUTOINCREMENT as rowid aliases,  
or implicit rowid, so those can't be *externalized*.

But what about the idea of inserting into two separate DB files?  
Those INSERTs can then be done in parallel (in different threads),   
non-interleaved, so probably don't require a VACCUM step, no?

Is CTAS from an attached DB to another DB optimized to do page-level copies?  
Resulting in a non-fragmented continuous span of pages for the copied table?

(4) By David Raymond (dvdraymond) on 2020-04-03 17:24:36 in reply to 3 [link]

Apologies if I'm brain farting on something basic, but what is "CTAS" again?

Create Table As Select?

(5) By ddevienne on 2020-04-03 17:45:59 in reply to 4 [link]

Yes.

(6) By Keith Medcalf (kmedcalf) on 2020-04-03 19:25:56 in reply to 1 [link]

Are you doing in-order inserts on this watch / phone?

(32 MB is a big cache on an Apple Watch but pretty itty bitty on a computer)

(7) By ddevienne on 2020-04-06 09:32:15 in reply to 6

> Are you doing in-order inserts [...]?

I guess I am, since these are rowid tables, with AUTOINCREMENT PKs.  
Or are talking about the Natural Key (NKs), using UNIQUE indexes?  
(those will be added later, post inserts, as discussed in this thread)

> (32 MB is [...] pretty itty bitty on a computer)

OK, sure. (took me a while to understand your comment, cutting through the sarcasm :))  
I could afford more for sure, being on desktop as you guessed, but 64MB  
didn't show any improvements over 32MB, so I didn't both going any higher.  
Do you have evident Keith that even larger cache sizes benefit multi-GB inserts?

(8) By Keith Medcalf (kmedcalf) on 2020-04-06 14:16:38 in reply to 7 [link]

If the inserts are all in-order it will not make a lot of difference since not a lot of pages will be required to keep the entire changeset in the application LRU cache.

I found the sweet spot where the management overhead of cache pages outweighs the benefit derived from that cache at somewhere around a quarter of a million pages.  YMMV.