Clustering tables on an expression index

(1) By Andrea Aime (andrea.aime) on 2020-04-30 13:37:58 [source]

Hi all, I'm trying to figure out if it's possible to cluster a table along an expression index. In particular, I'd like to cluster a table on a index that would take a geometry, and compute the geohash of it, before creating a R-Tree index, in a similar way to what it's done in PostGIS to create balanced R-Tree indexes: https://postgis.net/workshops/postgis-intro/clusterindex.html

The objective is to improve the index layout and thus the search times.

Now, the geohash function I could add as a custom function, but I cannot find a simple way to cluster a table along a index

The closest I got is this page, https://www.sqlite.org/withoutrowid.html, but it would require to create the table without rowids and the new key would have to be unique.

Maybe I could insert the data using geohash, find some way to append bits to the geohash making it unique (two geometries close enough will get the same geohash). And then I guess the table would be clustered... but I don't need the geohash, and it cannot be removed, right?

Then I guess I could create If I created a new table, would copying the data keep the order? If so, then the r-tree could be created on top of the second table, in a way that's hopefully clustered.

Would it work? Any better way to achieve the same?

Best regards Andrea

(2) By ddevienne on 2020-04-30 15:12:37 in reply to 1 [link] [source]

A table's rows are ordered either in rowid-order (or in PK-order for
without-rowid tables), there's nothing you can do about that. Modulo
page-fragmentation of course, which you get rid of with vacuum.

So either you do your inserts in geohash-order, when you don't want to
control the rowid, or you alias the rowid with an INTEGER PRIMARY KEY
column, which is derived from your geohash and is UNIQUE of course...
That's not too different from your without-rowid analysis, which is correct to me.

If you don't care about controlling the rowid, you can also do a CTAS
to re-order your table into another one, and replace (delete+rename) the
old one. Because CTAS's doc says:

Rows are assigned contiguously ascending rowid values, starting with 1,
in the order that they are returned by the SELECT statement

Thus if the CTAS's right-hand-side SELECT is ordered by GeoHash, so will
the table be as well. That's basically equivalent to a custom vacuum
for that table, kinda... But in all cases, that geohash ordering
(i.e. clustering) won't be maintained against further DMLs, which might
be OK if you're read-mostly.

I guess you could also create an non-unique covering index, with the geohash
as the first column, which would act as a kind of shadow table, with enough
columns in it that most queries don't need to go back to the main table.
You gain automatic maintenance of that index, in geohash order, but at the cost
of duplicating your table...