What indexes help Window Functions
I have a table with many columns of numbers, and I am calculating various statistics on these numbers using window functions. In particular, I am assigning each row a rank (based on both rank() as well as row_number()), a percentile rank (cume_dist()), a percentile bin (ntile(100)), a running sum and a running average. I was wondering what indexes it might be helpful to create and maintain on this table so that these window functions work as quickly and efficiently as possible. Would it be helpful to create separate indexes on all the numerical columns in this table? Are there rules of thumb for when too many indexes cause a slowdown because their maintenance uses up more time than I save while running queries that use those indexes? For context, there are about 20 numerical columns in the table and the table seldom contains more than 5000 rows. Right now, my query for calculating these statistics takes about 15 to 20 seconds. Thank you.
Imagine you were doing the task manually – yourself, using pencil and paper – but had a magic device which made indexes for you. Like the people whose job it is to make the index for a book. Figure out which index you'd use for each task. If your task involves a window, you have to pretend to do that yourself too. Remember that each query can use either zero or one index, no more. There's no way to switch between two indexes during one search.
Also remember that indexes can contain multiple columns. Don't think of each column being a candidate for an index, think of each SELECT or WHERE clause as potentially having a "most helpful" index, perhaps a compound (multicolumn) index. For instance, in indexing a long phone book you wouldn't have two indexes, one on givenname, one on familyname; you'd be better off with a single index on (familyname, givenname).
Creating indexes you don't use is, as you wrote, wasteful. Not only does the work of maintaining the index take time, but the extra pages of space that the index uses in the database file make it bigger, more annoying to handle, and need more resources each time you back it up. One does try to avoid it.
(3.1) By Keith Medcalf (kmedcalf) on 2021-04-05 06:23:10 edited from 3.0 in reply to 1 [link] [source]
All the columns in the partition by clause, in order, followed by all the columns in the order by clause, in order. If you wish the index to be covering then you need to append any columns used in filter expressions, followed by any columns actually used in calculations or in the output projection.
Thank you, Simon and Keith. Your answers make perfect sense. I had done some other research on the internet and the consensus seems to be that it is best to index on the partition by columns and then on the sort by columns.
But one part of Simon's answer makes me think either this whole exercise needs to be rethought from scratch or I need to give up on making my query more efficient using just indexes. And that part is the sentence "a query can use zero or one index".
Right now, in my query, I have 14 different windows because I am calculating these stats for 14 different measures. Obviously each window partitions and sorts by a different measure. I could create an index for each of them, but if the query can use only one of those indexes, it probably won't make a huge difference. Either I have to live with the time it takes my query right now (which is not bad at all), or I have to come up with a different query structure where I have one window in each query, and then join up the results of all those queries to get the final output I want. I have a feeling the latter will be slower than what I currently have even if each of the individual queries are sped up greatly by the creation of better indexes.
Thanks for your insights and help.
I don't think it's quite true that "a query can use zero or one index". That would be true for a simple projection query, selecting from a single table. But in general subqueries and joins will involve multiple tables and possibly as many indexes as table usages.
Looking at query plans will help you decide what indexes may help and do help.
The 'one index per query' is not correct. The number of indexes used will generally be limited to one each per table involved in the query. Note that something like:
select a, b, ( select sum(c) from t where a == o.a ) from t as o order by b ;
uses two tables and each one will use a different index (if available). The most efficient index for the outer query is t(b) (to be able to traverse the rows of t in order and avoid having to sort) and that the most efficient index for the correlated subquery will be a covering index t(a,c) to permit the appropriate traversal of the table in order and generating the sum without accessing the underlying table).
The CLI has a
.expert command that will attempt to tell you what indexes, if available) would produce the least-cost solution.
In fact, the most efficient index for the outer query would be a t(b,a) covering index so that access to the underlying table is not required.
If there were both an index on t(b,a) and t(a,c) then the above query could be solved by scanning/searching the indexes only and no access to the underlying table would be required at all.
(8) By David Raymond (dvdraymond) on 2021-04-05 17:58:15 in reply to 4 [link] [source]
Maybe "one index per table instance in the query" is more accurate?
What we're saying is that if you have a table t with fields a and b, with an index on a and an index on b, then in...
select * from t where a = 5 or b = 7;
Then SQLite with use only one of those two indexes, or do a full table scan. It will not use the A index for "where a = 5", use the B index for "or b = 7" and then de-duplicate the results.
If you have
select * from t as t1 join t as t2 on t1.id1 = t2.id2 where t1.a = 12 or t2.b = 42;
then again for t1 it will use one index or the full table, and for t2 it will use one index or the full table. It can't use an index on t2.id2 to speed up the ON clause and use an index on t2.b to speed up the WHERE and do some cross comparing. It can only pick one.
...or something like that.
Your latter query best demonstrates what it is that the query planner does.
Taking the indexes and tables available it then enumerates all the possible solution methods to the query. It may also consider generating indexes just for the duration of the single query. It then decides which of the possible millions of alternate solution methods is likely to have the "lowest cost" given all the available information (including distribution statistics for the tables and indexes, if available).
In the 'olden days' professional programmers were responsible for designing the access path and writing the code to execute the retrieval efficiently.
The whole purpose of SQL is that it is a declarative language. You describe 'what' you want, not 'how' to get it. And the computer using all the information available to it computes the most efficient 'how' to to get your 'what' each time it is requested. After all, the purpose of computers is to compute.
The creation of indexes is not really an optimization. The Query Planner will obtain your 'what' according to its computation of the most efficient 'how' giving appropriate consideration to the extant circumstances. Adding indexes and statistics gives the Query Planner more 'possibilities' to consider when generating the 'how' to get 'what' you asked for but does not actually change the process or the result, merely perhaps the ability to generate a better 'how' for your 'what'.
Everyone is right. I should have written 'zero or one index per table'. The various responses to the post are excellent and I have nothing to add.
Thank you, everyone. Very educational for me. Since my entire query uses just one table with windows on multiple fields in that table, there doesn't seem to be a way to create an index that will encompass the needs of each of those windows.