Clustering tables on an expression index

(1) By Andrea Aime (andrea.aime) on 2020-04-30 13:37:58 [link]

Hi all,
I'm trying to figure out if it's possible to cluster a table along an expression index. 
In particular, I'd like to cluster a table on a index that would take a geometry, and compute the geohash of it, before creating a R-Tree index, in a similar way to what it's done in PostGIS to create balanced R-Tree indexes:
https://postgis.net/workshops/postgis-intro/clusterindex.html

The objective is to improve the index layout and thus the search times.

Now, the geohash function I could add as a custom function, but I cannot find a simple way to cluster a table along a index

The closest I got is this page, https://www.sqlite.org/withoutrowid.html, but it would require to create the table without rowids and the new key would have to be unique.

Maybe I could insert the data using geohash, find some way to append bits to the geohash making it unique (two geometries close enough will get the same geohash).
And then I guess the table would be clustered... but I don't need the geohash, and it cannot be removed, right?

Then I guess I could create If I created a new table, would copying the data keep the order? If so, then the r-tree could be created on top of the second table, in a way that's hopefully clustered.

Would it work? Any better way to achieve the same?

Best regards
Andrea

(2) By ddevienne on 2020-04-30 15:12:37 in reply to 1

A table's rows are ordered either in rowid-order (or in PK-order for  
without-rowid tables), there's nothing you can do about that. Modulo  
page-fragmentation of course, which you get rid of with vacuum.

So either you do your inserts in geohash-order, when you don't want to  
control the rowid, or you alias the rowid with an INTEGER PRIMARY KEY  
column, which is derived from your geohash **and** is UNIQUE of course...  
That's not too different from your without-rowid analysis, which is correct to me.

If you don't care about controlling the rowid, you can also do a [CTAS][CTAS]  
to *re-order* your table into another one, and replace (delete+rename) the  
old one. Because CTAS's doc says:

> Rows are assigned contiguously ascending rowid values, starting with 1,  
> in the order that they are returned by the SELECT statement

Thus if the CTAS's right-hand-side SELECT is ordered by GeoHash, so will  
the table be as well. That's basically equivalent to a _custom vacuum_  
for that table, kinda...  But in all cases, that geohash ordering  
(i.e. clustering) won't be maintained against further DMLs, which might  
be OK if you're *read-mostly*.

I guess you could also create an non-unique covering index, with the geohash  
as the first column, which would act as a kind of _shadow table_, with enough  
columns in it that most queries don't need to go back to the _main_ table.  
You gain automatic maintenance of that index, in geohash order, but at the cost  
of duplicating your table...

[CTAS]: https://www.sqlite.org/lang_createtable.html#createtabas