Best order to vaccum, analyze, create indexes, etc.. ?
(1) By ddevienne on 2020-04-03 15:28:49 [link]
Hi. I'm converting large scientific proprietary files into SQLite databases. Many blobs in two large tables, with structured meta-data in these and other tables. That's already working, but I wonder what's the optimum order for doing it. Here's how I do it currently: ``` START TRANSACTION foreach table CREATE TABLE ... -- ROWID tables for millions of rows, GB of data INSERT INTO ... CREATE INDEX COMMIT VACUUM ANALYZE ``` I've already turned OFF journaling, since the DB is not "online", not activated FKs (which are expressed in the table definitions), used a largish page-size (16K) and page-cache (32MB) to maximize import performance, etc... I have a few small tables, and two very large tables, and all inserts are *interleaved*, that's why I vaccum (an expensive operation) post-commit, to *defragment* the DB, and maximize read speed when the DBs are put "online" (data served over WebSocket using a custom protocol). I have several UNIQUE constraints in my table definitions, and I wonder whether I should remove those, and do explicit CREATE UNIQUE INDEX DDLs post-insert? (but pre-commit?). I also wonder whether the CREATE *(non-UNIQUE)* INDEX should be pre-VACCUM, or post-VACCUM? I suspect indexes wouldn't be fragmented if created post-VACCUM, thus saving that much IO from the VACCUM? Finally, I wonder if I can somehow *parallelize* inserts into my main two BIG tables? Perhaps inserting into two separate DBs, and do a CTAS from one DB to the next once both are complete, since similar to the CREATE INDEX case, this would *implicitly defragment* the table pages, no? Is CTAS optimized to do direct page copies? Or must it decode all records from the source, recode them into the destination, rebuilding the B-TREE from scratch? I'm basically asking for advice/guidance on the most performant way to do the above, which hopefully is clear enough. TIA, --DD
(2) By Richard Hipp (drh) on 2020-04-03 15:40:11 in reply to 1 [link]
In terms of performance, you could take out all UNIQUE and PRIMARY KEY constraints from the table definitions and transform those into CREATE UNIQUE INDEXs that you run at the same time as your other CREATE INDEXes (as you suggest). And you could run the CREATE INDEXes after VACUUM, but do so before ANALYZE (again, as you already suggest). Other than those tweaks, I think what you are doing is optimal.
(3) By ddevienne on 2020-04-03 16:33:31 in reply to 2 [link]
Thanks Richard. I'll do as you suggest for all indexes, UNIQUE or NOT. My PRIMARY KEYs are all either INTEGER AUTOINCREMENT as rowid aliases, or implicit rowid, so those can't be *externalized*. But what about the idea of inserting into two separate DB files? Those INSERTs can then be done in parallel (in different threads), non-interleaved, so probably don't require a VACCUM step, no? Is CTAS from an attached DB to another DB optimized to do page-level copies? Resulting in a non-fragmented continuous span of pages for the copied table?
(4) By David Raymond (dvdraymond) on 2020-04-03 17:24:36 in reply to 3 [link]
Apologies if I'm brain farting on something basic, but what is "CTAS" again? Create Table As Select?
(5) By ddevienne on 2020-04-03 17:45:59 in reply to 4 [link]
Yes.
(6) By Keith Medcalf (kmedcalf) on 2020-04-03 19:25:56 in reply to 1 [link]
Are you doing in-order inserts on this watch / phone? (32 MB is a big cache on an Apple Watch but pretty itty bitty on a computer)
(7) By ddevienne on 2020-04-06 09:32:15 in reply to 6
> Are you doing in-order inserts [...]? I guess I am, since these are rowid tables, with AUTOINCREMENT PKs. Or are talking about the Natural Key (NKs), using UNIQUE indexes? (those will be added later, post inserts, as discussed in this thread) > (32 MB is [...] pretty itty bitty on a computer) OK, sure. (took me a while to understand your comment, cutting through the sarcasm :)) I could afford more for sure, being on desktop as you guessed, but 64MB didn't show any improvements over 32MB, so I didn't both going any higher. Do you have evident Keith that even larger cache sizes benefit multi-GB inserts?
(8) By Keith Medcalf (kmedcalf) on 2020-04-06 14:16:38 in reply to 7 [link]
If the inserts are all in-order it will not make a lot of difference since not a lot of pages will be required to keep the entire changeset in the application LRU cache. I found the sweet spot where the management overhead of cache pages outweighs the benefit derived from that cache at somewhere around a quarter of a million pages. YMMV.