dropping indexes before inserting… good, bad, maybe?

(1) By punkish on 2023-06-14 17:57:36 [link] [source]

In a quest for efficiency, I DROP INDEXes from my db before INSERTing new rows periodically, and then CREATE INDEXes again. If the number of new rows is not large, the time taken to rebuild the indexes is longer than the time to insert the rows. So now I am considering not dropping the indexes if the number of new rows is less than some (arbitrarily chosen) number, say 50000.

I am assuming that SQLite reindexes as new rows are inserted. Also, if I am inserting, say, 5000 rows per transaction, this reindexing will happen after every transaction is committed. I might be better off having SQLite reindex after every transaction than the dropping al indexes, inserting, reindexing dance.

Two questions:

Are my assumptions in para 2 above correct?
Should I run ANALYZE after all the reindexing is done after each periodic insert?

(2) By Keith Medcalf (kmedcalf) on 2023-06-14 18:27:08 in reply to 1 [link] [source]

Are my assumptions in para 2 above correct?

No, your assumptions are incorrect (or perhaps they are correct and you are just sloppy and inexact in your use of the English language).

If there is an index on a table, then the index is updated as part of the insertion of each new record. That means that if you insert 1000 records, the index B-Tree is updated 1000 times. The updated pages (from the addition of the new rows and index entries) it written to disk when you commit the transaction (which occurs at the conclusion of each insert statement if you are not using explicit transactions -- or when the COMMIT statement completes assuming that you are using explicit transactions).

If your cachesize is too small to hold the entire set of changed pages then pages will get flushed to the database file when the cache has no more space to contain the changed pages.

A "properly sized" application cache will obtain an average hit rate of 95% or better across all operations ever performed against the database.

If you cannot maintain the entire changeset (plus, of course, all the "parent pages") because your cache is too small then you will be performing massive I/O to read/write pages that should just be in the cache. And as everyone knows, the fastest way to do I/O is to not do it.

Should I run ANALYZE after all the reindexing is done after each periodic insert?

One would run ANALYZE in order to update the statistics.

It is entirely possible to make a database contain thousands of times the amount of data, yet still have the same statistics (in which case running ANLYZE will do noting).

It is also possible that inserting one row in one table will make a meaningful change in the statistics.

So, ANALYZE is run to update statistics. Only you know if that is meningful.

(6) By punkish on 2023-06-14 20:43:53 in reply to 2 [link] [source]

This super helpful Keith. I now understand the relationship between the insert and the index operations. I realize I do have to increase my db cache_size

re. ANALYZE, since I have no way of knowing whether or not new data will affect the statistics, I would probably be well-advised to run ANALYZE after each update (once every 24 hours). At worst, it would do no harm other than probably grind my SSD a bit more.

(9) By Simon Slavin (slavin) on 2023-06-14 21:09:21 in reply to 6 [link] [source]

Don't bother about ANALYZE. It's not sensitive to the exact data in your table. It cares about the kind of data in your table. Unless you've changed the kind of data in the table, you don't need to run it again.

My normal advice would be to include ANALYZE in a maintenance routine to be run at most once every three months, but probably just once a year.

(3) By Spindrift (spindrift) on 2023-06-14 18:27:33 in reply to 1 [link] [source]

Efficiency is highly dependent on context.

For what purpose are you trying to optimise your database?

For rapid insertions? For rapid retrieval? If both, why, and which is the priority?

I do not wish to patronise, but what you are currently doing sounds like "fiddling".

(5) By punkish on 2023-06-14 20:38:32 in reply to 3 [link] [source]

Many thanks for your response. You may be entirely correct describing my work as fiddling, but I do take my fiddling responsibilities seriously :)

I am trying optimize the db for rapid insertions from my pov, that is, when I update the db. My users couldn't give a rat's behind about my pov… they care only about rapid retrieval. I believe these opposing tensions are quite reasonable.

Fortunately, Keith's explanation is very helpful, and gives me some solid steps to try out going forward.

(4.1) By David Raymond (dvdraymond) on 2023-06-14 18:41:40 edited from 4.0 in reply to 1 [link] [source]

Possible terminology confusion here. There's an actual command called REINDEX which will drop an recreate indexes from the main table. That should only be done in rare cases.

Indexes are dynamic, and as you make changes to a table the indexes are kept up to date at the same time with every single change. They are not static things that have to be refreshed now and then. If the library updated them only on transaction commit that would be horrible. If something changed at the start of a transaction it would make the index unusable if it didn't reflect what had happened earlier.

Keeping indexes up to date individual change by individual change does come with a time cost, which is one of the tradeoffs and why you might not want to make an index for every possible query, because of the increased time costs for updates, inserts, etc.

CREATE INDEX can do things more efficiently, which is why you'll see the advice for bulk loads is to create just the base table, load the data, and only then create the indexes. The size of how big a bulk load has to be to benefit from DROP INDEX-INSERT-CREATE INDEX vs just INSERT will be different for every single table and machine it's running on.

But "having SQLite reindex after every transaction" is horrible. At the end of every transaction every index is up to date already.

ANALYZE only really needs to be re-run if the "shape" of the data has changed. Example: if you have an index on (country, city), then it doesn't really matter how many records you have, so much as if the ratio of distinct cities per country is now very different... or something like that.

(7) By punkish on 2023-06-14 20:46:40 in reply to 4.1 [link] [source]

Thanks David. Yes, implying REINDEX when I meant reindex (CREATE INDEX again) was my mistake. I didn't know of this command. I would certainly not recreate the indexes after every transaction. But, in any case, all this is moot as I will just experiment with inserting new data with a bigger cache_size and see if I can get acceptable performance.

(8) By cj (sqlitening) on 2023-06-14 21:02:37 in reply to 7 [link] [source]

Are you able to use binding with a single insert statement with an array of data?  It is very fast.

(10) By punkish on 2023-06-14 21:20:38 in reply to 8 [link] [source]

yes, of course, that is what I do. All the insert statements are prepared in advance. Each logical insert (a logical record) comprises 5-10 different inserts in 10 different tables. This logical insert (between 5-10 different inserts) is wrapped in a transaction, and 5000 of these logical inserts (so anywhere up to 50000 rows) are wrapped in an outer transaction. I am able to process a few hundred of these inserts per second, so a transaction of 5000 logical inserts takes 10-20 seconds which includes parsing the xml files (source data). The initial bulk load gets done in about an hour with all the indexing post-load.

the subsequent loads (updates) are much smaller. Since I have been dropping the indexes and then re-creating them, this recreating the indexes takes longer than the actual inserts. Hence this thread… but now I (think I) know how to proceed

(11) By curmudgeon on 2023-06-15 06:23:05 in reply to 8 [link] [source]

Could you explain what you mean by this cj? Why would this make the updating of the indexes faster?

(12) By Spindrift (spindrift) on 2023-06-15 06:56:30 in reply to 11 [link] [source]

Mainly it makes bulk updates themselves much faster. If you are not doing this, and switch to doing it, the gains will outweigh whatever you are doing with the index.

(13) By curmudgeon on 2023-06-15 10:01:50 in reply to 12 [link] [source]

Can you give me an example? I'm familiar with carray but, if you have a list of rows that aren't currently anywhere in the db, what would be the advantage of storing them in a c array as compared to inserting them into a temp table then using

insert into tbl select * from table.

(15.1) By Keith Medcalf (kmedcalf) on 2023-06-15 20:35:08 edited from 15.0 in reply to 13 [link] [source]

Why do twice that which only need be done once? In other words, inserting into a temp table, and then inserting the temp table contents into to actual table is TWO inserts -- the insertion can be done into the actual table in the first instance thereby avoiding doing twice that which only need be done once.

THe method by which the one required operation is performed is immaterial. Doing what needs to be done only once, once, rather than twice, will cut the time taken in half. This is irrespective of the manner in which you perform the insert.

(16) By curmudgeon on 2023-06-16 10:50:43 in reply to 15.1 [link] [source]

I agree with you. I'm only trying to work out why cj thought having the records to be inserted in an array would make the insertion faster. Perhaps punkish was correct in that he was only referring to saving time by not having to prepare a statement each time.

(18) By Keith Medcalf (kmedcalf) on 2023-06-16 17:49:32 in reply to 16 [link] [source]

If you records are n an "array" (that does not mean a c-aray -- that is a very particular byte of array). The word array is generic not specific.

Example:

stmt = prepare('insert into t values (?,?,?,?,?)'
for i in 1 to len(array):
    for j in 1 to 7:
        bind(stmt, j, array(j, i)
    step(stmt)
    clear_bindings(stmt)
finalize(stmt)

You could do that in some language wrappers by simply makeing the bindings vector into an array and then doing a normal insert, just with an array of bindings.

(19) By curmudgeon on 2023-06-17 11:34:44 in reply to 18 [link] [source]

I get what you're saying Keith. I initially thought cj was suggesting there was something magical about presenting the inserts as a block insert and couldn't understand why that would make a difference as sqlite would still have to find the position in the btree for each individual insert.

(22) By Keith Medcalf (kmedcalf) on 2023-06-17 21:31:38 in reply to 19 [link] [source]

You can pre-sort the data and do an in-order insert provided that there is only one index. If there is more than one index, then you would need to convolutionally determine an insert order that was equally random with respect to each index in order to optimize insert performance.

You can eliminate the repetitive prepare time (you only need to prepare once) after which you would just "blast through" the insert data.

These sorts of things have been used in Dinosaur Pens operating the entire planet for about 60 years. THere is no way that a bitty-box could do the same thing, especially since the bitty-boxers do not really know how things work.

THe other problem with dropping and re-creating database indexes is that it is only good for entertainment systems.

I would not shut down my Refinery every few hours so some lunkhead can drop and recreate indexes. That is why the IT types are kept out of Operations.

(24) By curmudgeon on 2023-06-18 08:41:47 in reply to 22 [link] [source]

Thanks Keith but, I'm now left wondering why the data being pre-sorted speeds things up. I've very limited knowledge of how btrees work but I'm imagining sqlite starting at the root and finding the insert position for the top record. I'm then imagining it starting at the root for the second record and so on. Is that the case and the speed up comes courtesy of the fact the route to subsequent insert points is already in the cache OR does sqlite check back to its previous insert point and see it's beyond that?

I'm not expecting a btree tutorial, a or b will do as answer assuming I've guessed correctly.

(25) By cj (sqlitening) on 2023-06-18 11:19:54 in reply to 24 [link] [source]

Here is a post with tips and links.
I can't do any testing now, but will get back.

https://sqlite.org/forum/info/fd254d2db851f542

(26) By curmudgeon on 2023-06-18 11:50:46 in reply to 25 [link] [source]

That seems more about the order of columns in the stmt not mattering as opposed to the records being sorted cj.

(27) By cj (sqlitening) on 2023-06-18 12:24:32 in reply to 26 [link] [source]

Within one of the posts this link is about inserting a billion rows.

https://avi.im/blag/2021/fast-sqlite-inserts/

(28.2) By Keith Medcalf (kmedcalf) on 2023-06-18 17:34:57 edited from 28.1 in reply to 24 [link] [source]

When you insert records in a table and the table has an index, then the location in the index to insert the entry for the new record must be located (usually by tracing all the way from the root) and the insertion performed.

If the records and inserted in-order (fastest) reverse-order (next fastest), or pivot order (least fastest of the sort orders). As each record is iserted, the position where the index entry must be inserted must be ascertained (it is therefore perspicacious to ensure that the application cache is big enough that the application does not need to generarate an I/O operation to access the data that is is going to need over and over and over and over again. If the records are pre-sorted into one of the above orders, then the requirement to "split an index page" so there is romm to insert the record occurs with lesser frequency than if the inserts are in "random" order, where many of the inserts will require page splits.

Like I said, this has been known for about 60 years.

Clearly if you have more than one index, optimal performance is obtained where the dispersion is equal for all indexes being updated simultaneously.

REINDEX or CREATE INDEX also creates a bad index for update purposes because it is "packed". That means that nothing can be inserted into the index without splitting and folding it for each record -- which will be very slow.

Some DBMS systems (most actually) let you specify the "packing rate" for indexes when they are created. This is so that the index can be "pre-holed" for subsequent inserts without having to spindle and fold the pages.

That is why vaccum/reindex and such like operations slow down incrediby subsequent updates. Left to themselves, the indexes will, after a time, contain sufficient holes (be unpacked sufficiently) that subsequent index updates do not require so much spindling, mutilating, and folding.

Mutatis Mutandis the entire database and all its contents. Re-packing it is the abslute worst thing you can do as that will totally destroy update performance.

ADDED: Note that in the old days you had to "clean out the exess holes" from time to time, generally because the database was stored on spinning disk and the database blocks became scattered (slowing I/O performance) or the database hits its extent limit and the number of extents needed to be reduced (for example, several of the Dinosaurs will not let a file be extended beyond 8 extents (blocks -- no matter the size of the block) and when the database reached 8 extents it needed re-oragnizing into fewer extents) or it can no longer expand.

(30) By curmudgeon on 2023-06-18 17:44:54 in reply to 28.2 [link] [source]

Thanks for that Keith.

(14.1) By punkish on 2023-06-15 10:23:46 edited from 14.0 in reply to 11 [link] [source]

only cj can speak for cj, but from what I understand, cj is merely referring to using prepared statements, and executing them with an array of data, which would be faster than not using prepared statements. As I read it, cj is not implying anything with respect to making indexing faster.

(17) By ddevienne on 2023-06-16 12:01:05 in reply to 8 [link] [source]

Are you talking about multi-row inserts, like I benchmarked in https://sqlite.org/forum/forumpost/baf9c444d9 or as described in https://stackoverflow.com/questions/1609637?

If not, what are you talking about?

PS: The gains of multi-row inserts are small, especially compared to Keith's cache size tip, but 20% is not bad either.

(20) By cj (sqlitening) on 2023-06-17 13:54:30 in reply to 17 [source]

Exception: With a SQLite server a transmit is performed for each insert.

(21) By Keith Medcalf (kmedcalf) on 2023-06-17 19:08:41 in reply to 20 [link] [source]

And just what, exactlu, has that to do with the price of tea?

(23) By cj (sqlitening) on 2023-06-18 02:43:38 in reply to 21 [link] [source]

Executing a TCP statement in a loop may be fine on a single machine, but
the trips to the server would dramatically decrease performance using client/server.

(29) By Keith Medcalf (kmedcalf) on 2023-06-18 17:19:37 in reply to 23 [link] [source]

SQLite does not have TCP nor is it intended to operate "remotely" -- the situation you posit does not and cannot exist for SQLite3.

Of course, you could have your server and database in Amsterdam and have someone fly each record (one by each) from LA to Amsterdam and insert each record by manually once they reach the server. That would be a very expensive proposition but would not change the "insert time", merely the "latency", which are two entirely different things.