Automatic indexing (idle question)
(1) By Simon Slavin (slavin) on 2021-09-05 21:53:05 [link] [source]
I'm just pondering this. I don't have a specific project or use in mind.
Various parts of SQLite look like they're part of a move towards automatic indexing. I'm wondering how far we are away from this. The average programmer would not bother using CREATE INDEX …
at all, it might be left to programmers with unusually good understanding of SQL. I'm thinking of SQLite not only creating its own indexes, but also deleting indexes which go unused, or should be subsumed by a longer index.
Presumably the programmer would somewhere set a number which changes the balance between filespace used and time saved. And another number which says whether actual index creation can be done by everyday API calls, or whether it should be left for runs of a special PRAGMA
.
Is this a solved problem ? Or perhaps a problem that someone proved cannot be usefully solved ?
(2) By Gunter Hick (gunter_hick) on 2021-09-06 12:26:44 in reply to 1 [link] [source]
Idle answer: Apart from indices required to fulfil constraints (e.g. UNIQUE and FOREIGN KEY), the utility of an index is exposed by it being used in the queries run against the database. This is by principle available only in retrospect. AFAIK SQLite does consider creating an index to run just one statement fast, but only if the cost of creating such an index is expected to be amortised within that single statement. Even if the application programmer has a defined set of queries, there is no guarantee that SQLite will have seen all of said set; and it may well be that the one index never before used is justified by a need for a blindingly fast response to an "emergency query" required to "save the planet from an asteroid strike" (gross exaggeration), in which case Murphy ensures that SQLite will have deleted exactly this index just 5 minutes before it is needed. And then there is still the issue of how much time one is prepared to sacrifice in each and every query for an as yet unknown benefit that may arise from an automatically added/dropped ondex.
(3) By Ryan Smith (cuz) on 2021-09-06 13:18:23 in reply to 1 [link] [source]
Nice Idea. Full-self-driving for SQL Engines. I suspect however it will be much more delayed than the Tesla version.
The problem I think is that the Query planner and expert mode suggestions require a posed query to assess.
The only realistic way a DB can be full-self-driving is if every column in every table was also an index, and, every combination of columns (all permutations) up to and including the full set of covering indexes.
Obviously this would grow DB data exponentially, but perhaps if we could prune down the needed indexes....
To prune that list you have to consider removing (or not-adding) every index. To remove any index IA on column A (or any Index IABC on columns A, B and C), you have to answer the question:
"Can I guarantee that there will never be a circumstance where querying by this specific column (or combination of columns) will be useful/appropriate?"
Of course that is not an answer that can be known algorithmically at design-time. Perhaps a sort-of learning algorithm can observe the DB in-use for some time and prune the never-used indexes. But for this to be acceptable you have to ensure that any query that will ever be needed of the DB is run within that learning time-frame, and as Gunter already pointed out with his World-ending analogy, that is not possible or even feasible.
Perhaps then a kind of mode where the requirement states: "Run the learning algorithm while using the DB for every possible query that may be needed of it" to produce a robust set of indexes that is guaranteed to service at least those queries. I feel like this can already be achieved by a script taking as input a DB, a set of Queries, and outputs the distinct list of indexes obtained from running said queries using the CLI .expert mode.
Actually, that may be a worthwhile script to design. Perhaps with the added function option of simply adding those indexes. Not sure I would blindly trust it myself, but may be a good start.
(4) By Richard Damon (RichardDamon) on 2021-09-06 14:48:46 in reply to 3 [link] [source]
The problem with indexing every column is that this is actually an anti-pattern in most cases.
Indexes are an optimization tool, and almost always have a trade-off. They make some operation, a lookup, faster at the cost of space (which itself can slow some things down), and the need to update the index for every database update.
Update frequency is a very important factor for deciding on indexes. If you rarely lookup by the value of a column, but the table is updated a lot, it may be faster to just do the table scan to do that lookup, especially if the table size is somewhat limited because most of the updates are changing values rather than adding new values (or we are doing a lot of deletes also).
On the other hand, a very static table that is updated rarely, but looked up a lot can benefit a lot from a good set of indexes, and good table statistics for the planner.
So, not only do you need to run all the select queries that you will use, but you also need a good representation of the updates you will be doing so you can get an estimate of the full cost for each index.
(5) By Simon Slavin (slavin) on 2021-09-06 18:58:01 in reply to 4 [link] [source]
I was envisioning SQLite keeping a set of statistics, like the ones prepared by ANALYZE
. In its simplest state, for each set of search/sort conditions (combinations of WHERE
and ORDER BY
), keep stats of how often it is used. Every so often analyze this data to figure out which indexes would be most useful. Then decide whether to make the changes right now or not.
Yes, there are a ton of considerations not included in the above, but that was my top-level understanding of how it would work.
And as stated upthread, making an index on each column is not a solution to any reasonable problem. When I find databases where people have done this they usually betray other misunderstandings of SQL.
(6.1) By J.M. Aranda (JMAranda) on 2021-09-06 19:45:54 edited from 6.0 in reply to 3 [link] [source]
My grandfather started programming with the IBM 2300 magnetic drum. Then came tracks and cylinders, files, ISAM, SQL. Each step absorbs the previous one. Indexes will be automatic one day for sure. But for now the design of it is the responsibility of the programmer. Indexes usually take up more space than tables. I've seen systems with terabytes of indexes. Indexes are a drag. Quantum computers will finish them off.
(7) By J.M. Aranda (JMAranda) on 2021-09-06 19:15:30 in reply to 5 [link] [source]
N columns require 2^n indices.
(8) By Simon Slavin (slavin) on 2021-09-07 15:24:59 in reply to 7 [link] [source]
Not in this context, because the order of the columns matters.
I don't understand why you mentioned it in the first place.
(9) By Ryan Smith (cuz) on 2021-09-07 21:36:58 in reply to 6.1 [link] [source]
I think maybe you are mistaking this forum somewhat.
Your statements seem to try and prompt the sort of musings that entertain old men with puffs of smoke exiting their pipes while wielding wrinkled brows, deep stares and slow beard-rubs in that far-away mystical land, known only as "reddit".
More to the point - they don't add anything of value to any of the conversations, yet they appear with startling regularity. "Sounding clever" is not an admired trait here... one could say it achieves the opposite.
I won't go so far as another poster to wish you moderated away, you are welcome to stay and be ignored - Or stay, and join us proper by making the next statement one that is both relevant and helpful. Not only may it help the others, but I for one might come to enjoy reading them. And I'm not suggesting appeasing me - I am nobody, but I can't be the only one feeling this way, so I hope that if I am open about my thoughts it might be useful to you.
PS: Apologies to everyone else - The irony is not lost on me of now having posted a response that is neither helpful, nor relevant, to the original question.
(10) By Larry Brasfield (larrybr) on 2021-09-07 22:36:29 in reply to 9 [link] [source]
Ryan,
Occasional reminders (or original lessons) on forum topicality appear to be needed from time to time. This seems to be one of them. Your response is helpful insofar as it helps promote honoring the forum's purpose and respecting other participants' use of their time here.
I'm sure you speak for many others, although many of them have likely learned the merits of a "don't feed the troll" approach to promoting topicality.
Mr. Aranda,
I see that you are somewhat clever, and that makes me hope that you are able to appreciate how useful this forum is to busy people who come here because threads tend to be helpful, on-topic and friendly. A side of entertainment is not going to draw much objection, but please avoid making it the main course in your posts. It would be a shame for your transition to the set of unmoderated posters to become a source of regret.
(11) By Ryan Smith (cuz) on 2021-09-08 00:10:41 in reply to 10 [link] [source]
Well put Larry, and I would like to add this somewhat in his defense (lest he thinks we are only negative):
For all I said/pleaded with Mr. Aranda, I honestly do believe he is not a troll. That is - I've not seen a post intended to solicit argument or tried to infuriate/bait others or derail the thread topic, merely some that are off-topic or not adding substance.
(12) By Max (Maxulite) on 2021-09-09 10:42:01 in reply to 5 [source]
As a probable variation of creating indexes with a "based on ..." principle in mind, I wonder how hard it is (for majority of cases) to implement something like
SUGGEST INDEX { sql query}command. So given a query, it suggest necessary indices if they don't exist already. Basically the code should parse the query and find all index-related pieces like ORDER BY, GROUP BY, WHERE in the query, identify fields and field sets and suggest new indexes if necessary.
I suspect that although SQLite itself is best fit for the task, a third party tool may also be a good candidate using the SQLite as a help. For example, since EXPLAIN QUERY PLAN output usually contains Scan {table} lines, it can be used for feedback loop in this tool, so the tool may create a memory-based db with the same schema (and no data) and iterate with different create index command analyzing whether scan {table} lines are gone or not.