Best order to vaccum, analyze, create indexes, etc.. ?

(1) By ddevienne on 2020-04-03 15:28:49 [link] [source]

Hi. I'm converting large scientific proprietary files into SQLite databases.
Many blobs in two large tables, with structured meta-data in these and other
tables.

That's already working, but I wonder what's the optimum order for doing it.
Here's how I do it currently:

START TRANSACTION
foreach table
  CREATE TABLE ... -- ROWID tables
for millions of rows, GB of data
  INSERT INTO ...
CREATE INDEX
COMMIT
VACUUM
ANALYZE

I've already turned OFF journaling, since the DB is not "online",
not activated FKs (which are expressed in the table definitions),
used a largish page-size (16K) and page-cache (32MB) to maximize
import performance, etc...

I have a few small tables, and two very large tables, and all
inserts are interleaved, that's why I vaccum (an expensive
operation) post-commit, to defragment the DB, and maximize
read speed when the DBs are put "online" (data served over
WebSocket using a custom protocol).

I have several UNIQUE constraints in my table definitions,
and I wonder whether I should remove those, and do explicit
CREATE UNIQUE INDEX DDLs post-insert? (but pre-commit?).

I also wonder whether the CREATE (non-UNIQUE) INDEX should be
pre-VACCUM, or post-VACCUM? I suspect indexes wouldn't be fragmented
if created post-VACCUM, thus saving that much IO from the VACCUM?

Finally, I wonder if I can somehow parallelize inserts into my
main two BIG tables? Perhaps inserting into two separate DBs, and
do a CTAS from one DB to the next once both are complete, since
similar to the CREATE INDEX case, this would implicitly defragment
the table pages, no? Is CTAS optimized to do direct page copies?
Or must it decode all records from the source, recode them into the
destination, rebuilding the B-TREE from scratch?

I'm basically asking for advice/guidance on the most performant way
to do the above, which hopefully is clear enough. TIA, --DD

(2) By Richard Hipp (drh) on 2020-04-03 15:40:11 in reply to 1 [link] [source]

In terms of performance, you could take out all UNIQUE and PRIMARY KEY constraints from the table definitions and transform those into CREATE UNIQUE INDEXs that you run at the same time as your other CREATE INDEXes (as you suggest). And you could run the CREATE INDEXes after VACUUM, but do so before ANALYZE (again, as you already suggest). Other than those tweaks, I think what you are doing is optimal.

(3) By ddevienne on 2020-04-03 16:33:31 in reply to 2 [link] [source]

Thanks Richard. I'll do as you suggest for all indexes, UNIQUE or NOT.

My PRIMARY KEYs are all either INTEGER AUTOINCREMENT as rowid aliases,
or implicit rowid, so those can't be externalized.

But what about the idea of inserting into two separate DB files?
Those INSERTs can then be done in parallel (in different threads),
non-interleaved, so probably don't require a VACCUM step, no?

Is CTAS from an attached DB to another DB optimized to do page-level copies?
Resulting in a non-fragmented continuous span of pages for the copied table?

(4) By David Raymond (dvdraymond) on 2020-04-03 17:24:36 in reply to 3 [source]

Apologies if I'm brain farting on something basic, but what is "CTAS" again?

Create Table As Select?

(5) By ddevienne on 2020-04-03 17:45:59 in reply to 4 [link] [source]

Yes.

(6) By Keith Medcalf (kmedcalf) on 2020-04-03 19:25:56 in reply to 1 [link] [source]

Are you doing in-order inserts on this watch / phone?

(32 MB is a big cache on an Apple Watch but pretty itty bitty on a computer)

(7) By ddevienne on 2020-04-06 09:32:15 in reply to 6 [link] [source]

Are you doing in-order inserts [...]?

I guess I am, since these are rowid tables, with AUTOINCREMENT PKs.
Or are talking about the Natural Key (NKs), using UNIQUE indexes?
(those will be added later, post inserts, as discussed in this thread)

(32 MB is [...] pretty itty bitty on a computer)

OK, sure. (took me a while to understand your comment, cutting through the sarcasm :))
I could afford more for sure, being on desktop as you guessed, but 64MB
didn't show any improvements over 32MB, so I didn't both going any higher.
Do you have evident Keith that even larger cache sizes benefit multi-GB inserts?

(8) By Keith Medcalf (kmedcalf) on 2020-04-06 14:16:38 in reply to 7 [link] [source]

If the inserts are all in-order it will not make a lot of difference since not a lot of pages will be required to keep the entire changeset in the application LRU cache.

I found the sweet spot where the management overhead of cache pages outweighs the benefit derived from that cache at somewhere around a quarter of a million pages. YMMV.