SQLite Forum

another performance question related to `geopoly`
Login

another performance question related to `geopoly`

(1) By punkish on 2020-09-15 04:20:51 [link] [source]

I am back with a performance question about geopoly_within using the geopoly extension. First, the schema (showing only the columns relevant to the queries)


-- ~350K rows
CREATE TABLE treatments (  
    treatmentId TEXT NOT NULL UNIQUE, 
    treatmentTitle TEXT
);

-- ~397008 rows of which ~135K have latitude and longitude values
CREATE TABLE materialsCitations ( 
    materialsCitationId TEXT NOT NULL, 
    treatmentId TEXT NOT NULL, 
    latitude REAL, 
    longitude REAL,
    UNIQUE (materialsCitationId, treatmentId)
);

CREATE VIRTUAL TABLE vtreatments USING FTS5(treatmentId, fullText);
/* vtreatments(treatmentId,fullText) */;

CREATE VIRTUAL TABLE vlocations USING geopoly(treatmentId, materialsCitationId)
/* vlocations(_shape,treatmentId,materialsCitationId) */;

Since I only have point data, I have filled the vlocations table with really tiny polygons constructed around each point. Now, on to the queries…

First, a fulltext search which is very fast, as expected

sqlite> .timer on
sqlite> EXPLAIN QUERY PLAN SELECT Count(*) AS num FROM vtreatments WHERE vtreatments MATCH 'Meshram';
QUERY PLAN
∟--SCAN TABLE vtreatments VIRTUAL TABLE INDEX 0:m
Run Time: real 0.000 user 0.000070 sys 0.000010

sqlite> SELECT Count(*) AS num FROM vtreatments WHERE vtreatments MATCH 'Meshram';
35
Run Time: real 0.014 user 0.001111 sys 0.003128

Then, given the following poly, using a radius of 50 kms around "lat":25.6532, "lng":3.48

poly = '[[3.9291576420597614,25.653199999999963],[3.9205272039128367,25.57418480393311],[3.8949675523700438,25.498156902884112],[3.853460930506155,25.428046553409356],[3.797602414522217,25.366560752342917],[3.7295386158616766,25.316077273341982],[3.6518851881364403,25.278551030320266],[3.567626309025444,25.25543670701288],[3.4800000000000004,25.24763096138966],[3.3923736909745563,25.25543670701288],[3.30811481186356,25.278551030320266],[3.230461384138324,25.316077273341982],[3.1623975854777835,25.366560752342917],[3.106539069493846,25.428046553409356],[3.0650324476299566,25.498156902884112],[3.0394727960871637,25.57418480393311],[3.03084235794024,25.653199999999963],[3.0394727960871637,25.73216289757996],[3.0650324476299566,25.808041865571695],[3.106539069493846,25.877929322171752],[3.162397585477784,25.93915220501075],[3.2304613841383247,25.989372767781138],[3.3081148118635615,26.026676122749052],[3.3923736909745577,26.049641517820053],[3.4800000000000013,26.05739496695326],[3.5676263090254454,26.049641517820053],[3.6518851881364416,26.026676122749052],[3.7295386158616783,25.989372767781138],[3.7976024145222183,25.93915220501075],[3.853460930506156,25.877929322171752],[3.894967552370044,25.808041865571695],[3.920527203912837,25.73216289757995],[3.9291576420597614,25.653199999999963]]'

a geopoly_within search which is not as fast as an FTS search, but still reasonable

sqlite> EXPLAIN QUERY PLAN SELECT Count(*) AS num FROM vlocations WHERE geopoly_within(_shape, poly) != 0;
QUERY PLAN
∟--SCAN TABLE vlocations VIRTUAL TABLE INDEX 4:fullscan
Run Time: real 0.000 user 0.000087 sys 0.000019

sqlite> SELECT Count(*) AS num FROM vlocations WHERE geopoly_within(_shape, poly) != 0;
2
Run Time: real 1.606 user 1.527031 sys 0.059115

Finally, the really problematic query wherein I combine the above two queries

sqlite> EXPLAIN QUERY PLAN SELECT Count(*) AS num FROM treatments JOIN vtreatments ON treatments.treatmentId = vtreatments.treatmentId JOIN vlocations ON treatments.treatmentId = vlocations.treatmentId WHERE vtreatments MATCH 'Meshram' AND geopoly_within(_shape, poly) != 0;
QUERY PLAN
|--SCAN TABLE vtreatments VIRTUAL TABLE INDEX 0:m
|--SEARCH TABLE treatments USING COVERING INDEX sqlite_autoindex_treatments_1 (treatmentId=?)
∟--SCAN TABLE vlocations VIRTUAL TABLE INDEX 4:fullscan
Run Time: real 0.002 user 0.000289 sys 0.001414

sqlite> SELECT Count(*) AS num FROM treatments JOIN vtreatments ON treatments.treatmentId
= vtreatments.treatmentId JOIN vlocations ON treatments.treatmentId = vlocations.treatmentId WHERE vtreatments MATCH 'Meshram' AND geopoly_within(_shape, poly) != 0;
2
Run Time: real 46.382 user 44.862766 sys 1.394483

For comparison, below I am using FTS5 but not the VIRTUAL vlocations table

sqlite> EXPLAIN QUERY PLAN SELECT Count(*) AS num FROM treatments JOIN materialsCitations ON treatments.treatmentId = materialsCitations.treatmentId JOIN vtreatments ON treatments.treatmentId = vtreatments.treatmentId WHERE vtreatments MATCH 'meshram' AND latitude = 25.6532 AND longitude = 3.48;
QUERY PLAN
|--SCAN TABLE vtreatments VIRTUAL TABLE INDEX 0:m
|--SEARCH TABLE treatments USING COVERING INDEX sqlite_autoindex_treatments_1 (treatmentId=?)
∟--SEARCH TABLE materialsCitations USING INDEX ix_materialsCitations_treatmentId (ANY(deleted) AND treatmentId=?)
Run Time: real 0.001 user 0.000278 sys 0.000701

sqlite> SELECT Count(*) AS num FROM treatments JOIN materialsCitations ON treatments.treatmentId = materialsCitations.treatmentId JOIN vtreatments ON treatments.treatmentId = vtreatments.treatmentId WHERE vtreatments MATCH 'meshram' AND latitude = 25.6532 AND longitude = 3.48;
2
Run Time: real 0.047 user 0.002550 sys 0.012607

And finally, the same query as above except that I am searching inside a box to simulate a geopoly_within query

sqlite> EXPLAIN QUERY PLAN SELECT Count(*) AS num FROM treatments JOIN materialsCitations ON treatments.treatmentId = materialsCitations.treatmentId JOIN vtreatments ON treatments.treatmentId = vtreatments.treatmentId WHERE vtreatments MATCH 'meshram' AND latitude BETWEEN 25.6530 AND 25.6534 AND longitude BETWEEN 3.47 AND 3.49;
QUERY PLAN
|--SCAN TABLE vtreatments VIRTUAL TABLE INDEX 0:m
|--SEARCH TABLE treatments USING COVERING INDEX sqlite_autoindex_treatments_1 (treatmentId=?)
∟--SEARCH TABLE materialsCitations USING INDEX ix_materialsCitations_treatmentId (ANY(deleted) AND treatmentId=?)
Run Time: real 0.000 user 0.000237 sys 0.000185

sqlite> SELECT Count(*) AS num FROM treatments JOIN materialsCitations ON treatments.treatmentId = materialsCitations.treatmentId JOIN vtreatments ON treatments.treatmentId = vtreatments.treatmentId WHERE vtreatments MATCH 'meshram' AND latitude BETWEEN 25.6530 AND 25.6534 AND longitude BETWEEN 3.47 AND 3.49;
2
Run Time: real 0.002 user 0.001776 sys 0.000296

Summary: the VIRTUAL locations queries are a few orders of magnitude slower than FTS5 queries, but combining VIRTUAL location with VIRTUAL text really slows down to a crawl. Seems like I am better off completely abandoning RTree index and geopoly. I am using the latter only because it enables me to use geopoly_within, thereby allowing me to do radius searches.

Question: Am I doing something wrong? Can I improve the performance of geopoly, esp when joining a geopoly table to other tables?

(2) By punkish on 2020-09-15 09:24:06 in reply to 1 [link] [source]

I can't figure out why my geopoly code is slow and how to make it fast. It also seems that either I am asking the question in a wrong way or there is not much interest or insight in the workings of geopoly (RTree) on this forum since three of my geopoly related questions have gone without a reply. Consequently, I have decided to stop using geopoly esp since a standard BETWEEN clause seems to work very well (as detailed in my last example above).

In case someone else stumbles upon this with a similar problem – I am using turf.js library to create a bbox around my point with the provided radius (actually I first create a buffer and then create a bbox for the buffer). That gives me the min and max lat and lng which I use in my latitude BETWEEN min_lat AND max_lat and longitude BETWEEN min_lng AND max_lng SQL. Hope this is helpful.

(4) By anonymous on 2020-09-15 13:03:45 in reply to 2 [source]

I have a similar use case which selects lines within proximity of a point. I used RTrees for this and when GeoPoly became available I tried that but as you did, found it far too slow. However, I didn't need it because I wasn't dealing with polygons as you aren't.

Although GeoPoly and RTrees are related and one is dependent on the other, there is a difference and you should ask yourself if you really need GeoPoly and maybe a straight RTree would be better and it would certainly be faster. Polygons after all, are more complex and require much more processing.

It's likely that neither of these functions will perform all the functions you need and only serve as an index to narrow the search so that your application can process the results. A straight RTree query will be more approximate than GeoPoly function because it is all rectangles, but again it depends on your needs if that will suffice.

Constructing the index is as simple as providing max/min coordinates half the required size less and more than the point coords, and the rest is done with the query.

I would encourage you to have a good read of the RTree page to see if that will work for you, in particular the simplified new Auxiliary Columns section so that subqueries are not required in the WHERE clause for the index.

If you need more then perhaps something like Spatialite might be useful, which is purpose built for the task and provides a very quick KNN function.

(6) By punkish on 2020-09-15 14:29:26 in reply to 4 [link] [source]

the reason I was using geopoly was because I wanted to use the geopoly_within function which, as the docs imply, is supposed to be very fast because it uses the RTree indexes. I must say, this is the first instance of being disappointed at SQLite's performance. I am hardly better than an intermediate user, but I have been using SQLite forever. And I don't ever recall throwing anything at it and not being amazed at how wonderfully it performs. Until now, with geopoly. Given the additional overhead of having to create and maintain a VIRTUAL locations table, it hardly seems worth the effort given its lackluster performance.

In my case, using RTree alone makes no sense as I can get very fast (though slightly less precise) results doing just a naive bbox query using min and max lats and lngs.

And no, Spatialite is not an option. I would much rather just jump to PostGIS if it ever came to that. Though I seriously hope it won't. I do love the simplicity and agility of SQLite too much to give it up (easily).

(3) By Keith Medcalf (kmedcalf) on 2020-09-15 10:20:42 in reply to 1 [link] [source]

Have you tried something like:

select count(*) 
  from treatments
 where treatmentId in (select treatmentid from vtreatments where vtreatments MATCH 'meshram'
                       INTERSECT
                       select treatmentid from vlocations where geopoly_within(_shape, poly)
;

By the way, you can use .eqp on so that you do not need to issue the same SQL statement twice to get the EXPLAIN QUERY PLAN output together with the results.

(5) By punkish on 2020-09-15 14:23:38 in reply to 3 [link] [source]

Thanks Keith. I tried your suggestion, and yes, it is much better, much more performant. It is as fast as doing a simple geopoly query in the examples in my OP. Here it is

SQLite version 3.32.3 2020-06-18 14:00:33
Enter ".help" for usage hints.
sqlite> .eqp on
sqlite> .timer on
sqlite> SELECT Count(*) FROM treatments WHERE treatmentId IN (SELECT treatmentid FROM vtreatments WHERE vtreatments MATCH 'meshram' INTERSECT SELECT treatmentid FROM vlocations WHERE geopoly_within(_shape, '[[3.9291576420597614,25.653199999999963],[3.9205272039128367,25.57418480393311],[3.8949675523700438,25.498156902884112],[3.853460930506155,25.428046553409356],[3.797602414522217,25.366560752342917],[3.7295386158616766,25.316077273341982],[3.6518851881364403,25.278551030320266],[3.567626309025444,25.25543670701288],[3.4800000000000004,25.24763096138966],[3.3923736909745563,25.25543670701288],[3.30811481186356,25.278551030320266],[3.230461384138324,25.316077273341982],[3.1623975854777835,25.366560752342917],[3.106539069493846,25.428046553409356],[3.0650324476299566,25.498156902884112],[3.0394727960871637,25.57418480393311],[3.03084235794024,25.653199999999963],[3.0394727960871637,25.73216289757996],[3.0650324476299566,25.808041865571695],[3.106539069493846,25.877929322171752],[3.162397585477784,25.93915220501075],[3.2304613841383247,25.989372767781138],[3.3081148118635615,26.026676122749052],[3.3923736909745577,26.049641517820053],[3.4800000000000013,26.05739496695326],[3.5676263090254454,26.049641517820053],[3.6518851881364416,26.026676122749052],[3.7295386158616783,25.989372767781138],[3.7976024145222183,25.93915220501075],[3.853460930506156,25.877929322171752],[3.894967552370044,25.808041865571695],[3.920527203912837,25.73216289757995],[3.9291576420597614,25.653199999999963]]') != 0);
QUERY PLAN
|--SEARCH TABLE treatments USING COVERING INDEX sqlite_autoindex_treatments_1 (treatmentId=?)
`--LIST SUBQUERY 2
   `--COMPOUND QUERY
      |--LEFT-MOST SUBQUERY
      |  `--SCAN TABLE vtreatments VIRTUAL TABLE INDEX 0:m
      `--INTERSECT USING TEMP B-TREE
         `--SCAN TABLE vlocations VIRTUAL TABLE INDEX 4:fullscan
1
Run Time: real 1.813 user 1.594897 sys 0.087140

It also give a better result because of the tighter selection (I get a count = 1 as opposed to a count = 2 for my bbox query where I search for latitude BETWEEN min_lat AND max_lat AND longitude BETWEEN min_lng AND max_lng. But the bbox query is much faster. This INTERSECT approach is also much more complicated for me to construct programmatically. So, I think I will stick to my bbox methodology. But how on earth would I have discovered this if you hadn't been around? I ask this seriously in a quest to understand what is going on?

And why are RTree queries (geopoly uses RTree in the background) so slow as compared to, say, FTS5 which is like magic.