Improve initial bulk insert performance of rtree index

(1.3) By Pieter (theroggy) on 2023-05-03 08:36:53 edited from 1.2 [source]

As you are probably aware of, Geopackage is a popular geo data format based on sqlite.

It is a great format, but the creation of the spatial (rtree) index on larger files seems significantly slower than for some other formats. To illustrate the difference I ran some tests using the gdal GIS library to write a test dataset in several file formats. The test file is as simple as possible: the data table has only one column (the geometry) and contains a grid (= rectangles) with ~1 million rows:

Shapefile: write data: 15.05 secs, create + fill spatial index: 5.07 secs
FlatGeobuf: write data: 6.25 secs, create + fill spatial index: 7.15 secs
Geopackage: write data: 5.65 secs, create + fill spatial index: 14.26 secs

I also did a small test creating +- the same rtree index as in the tests above in pure sqlite without dependency on gdal,... and filling up the rtree index like that took +- 12 seconds, so comparable. The sql statements used for this test can be found below in the post.

While looking up some more info about rtree indexes I noticed that specific bulk-loading functions are common and can be significantly faster than a row-per-row insert for initial loading of many rows:

I did a small test using the python "rtree" library, and loading the same 1 million rows in an rtree using that library took > 3 minutes using the row-per-row insert, but 9 seconds using the bulk/stream option. So, even though the row-by-row insert is a lot slower than the sqlite rtree implementation, the bulk option is (slightly) faster.

I'm not sure, but having a look at the sqlite rtree source code I didn't immediately find a specific code path for initial/bulk loading of the rtree. So, if this would indeed be the case, it wonder if it would be possible to speed up the initial load of the index using par example such an approach.

Sql statements used:

-- Create table for bboxes
CREATE TABLE bboxes (
  id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
  minx FLOAT,
  maxx FLOAT,
  miny FLOAT,
  maxy FLOAT
);

-- Fill the table with 1 million bboxes
INSERT INTO bboxes(minx, maxx, miny, maxy)
  WITH RECURSIVE
    cnt(x) AS (
       SELECT 0
       UNION ALL
       SELECT x+10 FROM cnt
        LIMIT 1000
    )
  SELECT minx, maxx, miny, maxy 
    FROM (SELECT x AS minx, x+10 AS maxx FROM cnt) x,
         (SELECT x AS miny, x+10 AS maxy FROM cnt) y;

-- Create and fill up rtree.
CREATE VIRTUAL TABLE bboxes_rtree USING rtree(id, minx, maxx, miny, maxy);
INSERT INTO bboxes_rtree
  SELECT id, minx, maxx, miny, maxy FROM bboxes;

(2) By Richard Hipp (drh) on 2023-04-24 15:59:44 in reply to 1.1 [link] [source]

A percentage improvement in both performance and in the size of the resulting index can be obtained if you insert the boxes in a random order:

INSERT INTO bboxes_rtree
  SELECT id, minx, maxx, miny, maxy FROM bboxes ORDER BY random();

(3) By Richard Hipp (drh) on 2023-04-24 19:42:36 in reply to 1.1 [link] [source]

did a small test using the "rtree" library, and loading the same 1 million rows took > 3 minutes using the row-per-row insert,

What platform are you running on, and what version of SQLite are you using? I ask because when I build using "./configure --enable-rtree CFLAGS=-Os" on Linux or MacOS/m1, I'm seeing about 14 seconds to run the row-by-row insert test that you show in your original post - 10 seconds if I add the "ORDER BY random()" hack shown in my previous reply. What are you doing to get it to run for more than 3 minutes?

(4.4) By Pieter (theroggy) on 2023-04-25 07:15:43 edited from 4.3 in reply to 3 [link] [source]

With the "rtree" library I mean the python library called "rtree" I linked to two lines higher, not sqlite. So the library linked to with this link: bulk/stream loading in the (python) rtree library.

So the 3 minutes is not at all related to the sqlite performance, it is only to illustrate that apparently a bulk insert strategy can improve initial insert performance dramatically (for that "rtree" library from 3 minutes to 9 seconds). Also in the case of this specific (non-sqlite) "rtree" library they use sort tile recursive (STR), the algorithm explained in the second link. For completeness, according to google there are several options for rtree bulk loading algorithms.

(5) By Pieter (theroggy) on 2023-04-24 21:17:00 in reply to 3 [link] [source]

The timing of my run of the script was mentioned in this section:

I also did a small test creating +- the same rtree index as in the tests above in pure sqlite without dependency on gdal,... and filling up the rtree index like that took +- 12 seconds, so comparable. The sql statements used for this test can be found below in the post.

So I got 12 seconds, tested on windows 10, using the latest conda-forge build (sqlite 3.40.0). So, to be clear, the build after this issue in the conda-forge version was fixed: sqlite-feedstock/issues/89.

(6.1) By Pieter (theroggy) on 2023-04-25 06:51:07 edited from 6.0 in reply to 3 [link] [source]

Not sure why, but I cannot reproduce the performance improvement using ORDER BY random().

For me, the INSERT into the rtree index with ORDER BY random() takes +-35 secs versus the 12 secs without. Tested on the same environment as before:

on windows 10,
using the latest conda-forge build (sqlite 3.40.0). To be clear, the build after this issue in the conda-forge version was fixed: sqlite-feedstock/issues/89.

(7) By anonymous on 2023-04-26 12:24:36 in reply to 6.1 [link] [source]

Not sure why, but I cannot reproduce the performance improvement using ORDER BY random().

I made a test with a SpatiaLite Windows binary (spatialite.exe) from http://www.gaia-gis.it/gaia-sins/windows-bin-amd64/spatialite-tools-5.0.1-win-amd64.7z. The SQLite version seems to be 3.34.1.

I run your commands in a memory database and for me ORDER BY random() is faster. The timings are roughly 18 seconds vs 12 seconds. Filling in the test table with one million rows takes about a second so maybe the index creation feels a bit slow but perhaps it is the best that the implementation of the R*Tree module can do.

-Jukka Rahkonen-

(8) By Richard Hipp (drh) on 2023-04-26 12:45:38 in reply to 7 [link] [source]

I have a change on the rtree-batch-insert branch that makes bulk insert much faster. However, this branch still contains bugs. I have some higher priority projects to work on at the moment. I'll try to get back to this branch and fix the problems as soon as I can.

If you would like to suggest bug fixes to this branch, your help will be welcomed. :-)

(9.1) By Pieter (theroggy) on 2023-04-26 15:38:26 edited from 9.0 in reply to 8 [link] [source]

Wow, a 20x performance improvement would be great.

I would love to help, but I'm afraid I haven't done any decent C/C++ programming in 15 years, so just getting a development/debug C environment up and running and getting used to the tooling again would take me quite some time...

I'll do some further testing though at why the "ORDER BY random()" behaves so weird in my tests.

(10.1) By Pieter (theroggy) on 2023-04-27 08:02:49 edited from 10.0 in reply to 6.1 [link] [source]

I had another look, and I can get the ~20% speed improvement of "ORDER BY random()" instead of the decreased performance if I increase the sqlite cache size. The -50000 (50 MB) cache is arbitrarily chosen.

PRAGMA cache_size=-50000;

I tested this on windows, using the conda-forge build of sqlite as well as the sqlite.exe as downloaded from sqlite.org and both gave the same results.

Some additional observations:

Using the increased cache size without ORDER BY random() doesn't give a speed difference.
If I insert the data in de data table with the ORDER BY random() and then INSERT to the rtree virtual table without ORDER BY random() I get exactly the same behaviour: without increased cache size 35 secs, with increased cache size ~9 secs. So it seems to specifically be the rtree building based on the data in random order that likes to have more cache in my environment, not the sorting.

I also gave it a try with the spatialite.exe Jukka Rahkonen tested with, and I also got the faster timings, even though PRAGMA cache_size in spatialite.exe returns the default -2000 (2 MB cache). I did notice though that spatialite.exe on windows reports as being compiled using "x86_64-w64-mingw32". So apparently it is also compiled with a gcc compiler versus msvc?

So, I'm not sure at all, but based on this last test the different behaviour of ORDER BY random() performance might be due to the compiler used (gcc vs. msvc)?

(12) By Keith Medcalf (kmedcalf) on 2023-06-28 21:45:47 in reply to 10.1 [link] [source]

Yes. Setting of an adequate page cache is always useful. If you are getting a hit ratio of less that 95% then your cache size is insufficient.

This is expecially important when inserting simultaneously into packed-too-full indexes. You need to have sufficient page cache to hold the entire parent tree(s) plus the entire changeset.

I/O takes time, and the fastest way to do I/O is not to do it -- that is, make sure the data is already/stays available in the application (user mode) cache.

The "filesystem or OS cache" is nice, but do not forget that it sits on the supervisor side of the computer, not the user side where your user programs are running. It is quite common to "thrash the cache" if you do not know how to tune the cache size properly. And every time you have to transition to supervisor mode takes time (and may deschedule your user process) so the fastest way to do I/O is to not do it and have the data you require already loaded in a user mode cache (ie, the application cache).

Doing an INSERT in fully random() order will ensure balanced behaviour for the multiple indexes involved, however, this will quickly devolve into massive slownes if your application (USER mode) cache is insufficient. You can only optimize for one insert order at a time. WHere multiple indexes are involved, there are ways to determine an optimal insert order -- though usually random inserts are the fastest (or rather the most balanced expense).

(11.1) By Pieter (theroggy) on 2023-06-28 21:08:24 edited from 11.0 in reply to 1.3 [link] [source]

I found another open source rtree implementation in C: https://github.com/tidwall/rtree.c.

Based on it's own performance benchmarks is seems very fast for both searching and inserting, even without having a specific implementation for bulk insert.

In the readme there is some high level documentation on the differences in implementation that lead to the improved performance compared to the algorithms described in the original paper by Guttman.

Not sure if these performance optimizations could be applied in the sqlite rtree implementation... but it might be worth a look?

(13.3) By Pieter (theroggy) on 2023-09-13 09:28:07 edited from 13.2 in reply to 11.1 [link] [source]

The developer of the above implementation was so kind to do some real-world testing of the possible performance improvements in SQLite. Based on the SQLite Amalgamation he integrated the optimizations from his rtree implementation into it and ran some tests.

In short: for larger numbers of rtree inserts (tested with 1 mio), the optimizations give a ~50% throughput improvement for random inserts and ~100% when they are inserted in a spatially sorted order (Hilbert order). Also interesting to note: inserting in a spatially sorted/clustered order is significantly faster in general.

Searching the index gets a ~5% improvement.

He put both the (diff of the) changes and the test results in the following github repo: tidwall/sqlite-rtree-optz

It would be great if you would be able to give it a look.

For information, in the following github issue we exchanged some more ideas about this topic: tidwall/rtree.c/issues/2

(14) By Richard Hipp (drh) on 2023-09-13 16:08:01 in reply to 13.3 [link] [source]

Tidwell implements three changes:

When choosing a candidate rect for insertion, the first rect that completely contains the new entry without needing to be enlarged is chosen, without checking any others.
Implement a cache to help find (1) quickly if it exists.
Omit the reinsert algorithm.

I tried implemented both (1) and (3) separately and together. Both result in indexes that are larger and slower to query. I don't know how you got a 5% search improvement - I always see a search degradation, in addition to requiring more storage space. And without (1), (2) is moot.

Nevertheless, change (1) is a hint that helped me to optimize INSERTs into RTree. When looking for a candidate rect, RTree now first looks for rects that completely enclose the new entry. If multiple existing rects completely enclose the new entry, the one with the smallest area is chosen. The previous sentence is the key difference from Tidwell's item (1) - all rects are still checked. There are now two passes over the rects. The first pass looks for rects that completely enclose the new entry. If none are found, only then is the second (slower) pass performed looking for rects that will have the least amount of enlargement in order to accept the new entry. This enhancement is about 10% faster (12% better through-put). I have not yet found a way to increase INSERT performance that does not degrade SELECT performance.

The enhancement is still on a branch, but you can expect it to land on trunk soon.

(15) By Richard Hipp (drh) on 2023-09-13 17:21:57 in reply to 14 [link] [source]

I must have made some measurement errors. I do sometimes see a performance reduction when Reinsert is disabled, but sometimes I see a performance increase. It depends on the query and on the input data. I did almost always see an increase in size of the index, but it isn't huge.

So I'm still running experiments. I might yet Tidwell item (3).

(16) By Pieter (theroggy) on 2023-09-14 09:48:33 in reply to 14 [link] [source]

Thanks for looking into this!

The tests that were run by Tidwall were all based on points being inserted/queried. Maybe some of the differences in performance test results can be attributed to that?

The 5% search improvement is based on the test results Tidwall posted in tidwall/sqlite-rtree-optz

Current implementation:

RANDOM ORDER: search-1% 1,000 ops in 0.020 secs 19825.0 ns/op 50,441 op/sec
HILBERT ORDER: search-1% 1,000 ops in 0.019 secs 18665.0 ns/op 53,576 op/sec

Changed implementation:

RANDOM ORDER: search-1% 1,000 ops in 0.019 secs 19014.0 ns/op 52,592 op/sec
HILBERT ORDER: search-1% 1,000 ops in 0.018 secs 17624.0 ns/op 56,740 op/sec

For both search-1% tests the op/sec increased by ~5%.

Obviously this test is not thorough enough at all to make big conclusions, but for this test-case specifically at least it doesn't seem to give a search performance degradation.

(17.1) By Pieter (theroggy) on 2023-09-14 10:09:02 edited from 17.0 in reply to 14 [link] [source]

Based on his morale for implementing 2) described in more detail here I suppose this change will mainly make a (positive) difference if the data is inserted in an (at least somewhat) spatially ordered way.

Possibly the 100% throughput increase for inserting hilbert ordered data versus 50% increase for inserting random data - based on his specific tests - can be attributed to this.

(18.3) By Pieter (theroggy) on 2023-09-20 18:17:48 edited from 18.2 in reply to 1.3 [link] [source]

Great that some significant optimizations were found to increase general insert performance. Thanks for looking into that.

To get back to the initial topic of this thread, past weekend I encountered a bulk loading algorithm in a paper (algorithm on page 10) that is quite easy to implement, even using only SQL. According to the paper the rtree created is very fast, both for creation and for querying. Really filling the binary sqlite rtree data structures from sql isn't possible as far as I know, but I saved the bulk-insert-rtree in similar tables as a proof of concept.

I used the above test case again with a grid of 1 mio rectangles. I ran the test on a windows laptop:

Inserting them in the rtree virtual table using the current rtree implementation in the order they are takes 19.4 s.
Inserting them in the rtree virtual table using the current rtree implementation in randomized order takes 12.9 s, if I use 50 MB cache_size (= necessary on windows).
The sql proof of concept of a bulk insert takes 2.7 s, so 5 times faster than the random insert. Probably a dedicated C implementation will be even faster?

This is a visualisation of the rtree created: rtree

SQL proof of concept. The zorder extension was downloaded here:

-- Create test data table
-- ----------------------
-- Load zorder extension
SELECT load_extension('zorder.dll');

--DROP TABLE bboxes;
--DROP TABLE bboxes_rtree;

-- Create table for bboxes
CREATE TABLE bboxes (
  id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
  minx FLOAT,
  maxx FLOAT,
  miny FLOAT,
  maxy FLOAT
);

-- Fill the table with 1 million bboxes
INSERT INTO bboxes(minx, maxx, miny, maxy)
  WITH RECURSIVE
    cnt(x) AS (
       SELECT 0
       UNION ALL
       SELECT x+10 FROM cnt
        LIMIT 1000
    )
  SELECT minx, maxx, miny, maxy
    FROM (SELECT x AS minx, x+10 AS maxx FROM cnt) x,
         (SELECT x AS miny, x+10 AS maxy FROM cnt) y;

--PRAGMA cache_size=-2000;
--PRAGMA cache_size=-50000;
--PRAGMA cache_size;

-- Create rtree index
CREATE VIRTUAL TABLE bboxes_rtree USING rtree(id, minx, maxx, miny, maxy);

-- Fill up in current order
-- 19.4 s (timings on my laptop on windows)
--INSERT INTO bboxes_rtree SELECT id, minx, maxx, miny, maxy FROM bboxes;

-- Fill up in random order
-- 31.886 s with default cache_size = 2 MB
-- 12.9 s with cache_size = 50 MB
INSERT INTO bboxes_rtree
  SELECT id, minx, maxx, miny, maxy FROM bboxes order by random();

-- Fill up in zorder/morton code order
-- Remark: the DESC sorted insert is faster than random, but I cannot explain why it is
-- that much slower to insert sorted ASC.
-- 15.5 s if ordered ASC
-- 11.6 s if ordered DESC
--INSERT INTO bboxes_rtree
--  SELECT id, minx, maxx, miny, maxy 
--  FROM bboxes order by zorder(minx+(maxx-minx)/2, miny+(maxy-miny)/2) DESC;

-- BULK RTREE TEST
-- ---------------
--DROP TABLE bboxes_rtree_node_bulk;
--DROP TABLE bboxes_rtree_parent_bulk;
--DROP TABLE bboxes_rtree_rowid_bulk;

CREATE TABLE bboxes_rtree_rowid_bulk (
    rowid  INTEGER PRIMARY KEY,
    nodeno INTEGER,
    minx   FLOAT,
    maxx   FLOAT,
    miny   FLOAT,
    maxy   FLOAT
);

CREATE TABLE bboxes_rtree_node_bulk (
    nodeno INTEGER PRIMARY KEY,
    level INTEGER,
    minx   FLOAT,
    maxx   FLOAT,
    miny   FLOAT,
    maxy   FLOAT
);

CREATE TABLE bboxes_rtree_parent_bulk (
    nodeno     INTEGER PRIMARY KEY,
    parentnode INTEGER
);

-- Group leafs per 50 elementes, based on zorder sorting
-- 2.053 s
INSERT INTO bboxes_rtree_rowid_bulk
  SELECT rowid, (row_number() over ()-1)/50+2 as nodeno, minx, maxx, miny, maxy
    FROM (SELECT rowid, * FROM bboxes 
          ORDER BY zorder(minx+(maxx-minx)/2, miny+(maxy-miny)/2)
         );

--DELETE FROM bboxes_rtree_rowid_bulk;
--SELECT nodeno, count(*) nb FROM bboxes_rtree_rowid_bulk group by nodeno order by nb asc;

-- Determine the MBR of the level 0 nodes based on the leaves + insert in 
-- bboxes_rtree_node_bulk
-- 0.537 s
INSERT INTO bboxes_rtree_node_bulk
  SELECT nodeno, 0, min(minx), max(maxx), min(miny), max(maxy)
    FROM bboxes_rtree_rowid_bulk 
   GROUP BY nodeno;

--DELETE FROM bboxes_rtree_node_bulk;

-- Group level 0 nodes per 50 elements ordered by nodeno
-- 0.036 s
INSERT INTO bboxes_rtree_parent_bulk
  SELECT nodeno
        ,(SELECT MAX(nodeno) 
          FROM bboxes_rtree_node_bulk
         ) + (row_number() over ()-1)/50+1 AS parentnode
    FROM (SELECT rowid, * FROM bboxes_rtree_node_bulk
          WHERE level = 0
          ORDER BY nodeno
         );

--DELETE FROM bboxes_rtree_parent_bulk;

-- Determine the MBR of the level 1 nodes based on the level 0 nodes + insert in
-- bboxes_rtree_node_bulk
-- 0.019 s
INSERT INTO bboxes_rtree_node_bulk
  SELECT parentnode, 1, min(minx), max(maxx), min(miny), max(maxy)
    FROM bboxes_rtree_parent_bulk rtree_parent
    JOIN bboxes_rtree_node_bulk rtree_node ON rtree_parent.nodeno = rtree_node.nodeno
   WHERE rtree_node.level = 0
   GROUP BY rtree_parent.parentnode;

--SELECT * FROM bboxes_rtree_node_bulk WHERE level = 1;

-- Group level 1 nodes per 50 elements ordered by nodeno
-- 0.017 s
INSERT INTO bboxes_rtree_parent_bulk
  SELECT nodeno
        ,(SELECT MAX(nodeno) 
          FROM bboxes_rtree_node_bulk
         ) + (row_number() over ()-1)/50+1 as parentnode
    FROM (SELECT rowid, * FROM bboxes_rtree_node_bulk
          WHERE level = 1
          ORDER BY nodeno
         );

-- Determine the MBR of the level 2 nodes based on the level 1 nodes + insert in
-- bboxes_rtree_node_bulk
-- 0.018 s
INSERT INTO bboxes_rtree_node_bulk
  SELECT parentnode, 2, min(minx), max(maxx), min(miny), max(maxy)
    FROM bboxes_rtree_parent_bulk rtree_parent
    JOIN bboxes_rtree_node_bulk rtree_node ON rtree_parent.nodeno = rtree_node.nodeno
   WHERE rtree_node.level = 1
   GROUP BY rtree_parent.parentnode;

-- Determine the MBR of the root node (nodeno 1) + insert in bboxes_rtree_node_bulk
-- 0.011
INSERT INTO bboxes_rtree_node_bulk
  SELECT 1, 3, min(minx), max(maxx), min(miny), max(maxy)
    FROM bboxes_rtree_node_bulk 
   WHERE level = 2;

(19) By Even Rouault (rouault) on 2023-10-22 23:08:07 in reply to 1.3 [link] [source]

Hi,

I wanted to share very recent developments done to speed up RTree creation. I've created a C library sqlite_rtree_bulk_load, used by GDAL, which creates a in-memory R*Tree and then populates the SQLite _node, _parent and _rowid tables from that. This borrows massively to tidwall/rtree.c, and a bit from Sqlite rtree.c itself to bring R*Tree specificities (in particular the enhanced node splitting)

On a 3.2 million row tables, this decreases the R*Tree creation (using sqlite master that has already the optimization to disable forced reinsertion) from 26 seconds to 6 seconds. Of course at the expense of RAM usage (~124 MB on that example)

(20) By JT Ar (jtarchie) on 2023-11-06 11:48:07 in reply to 19 [link] [source]

I am going to give this a shot. I've been trying to convert OSM PBF files into sqlite databases using r-tree. Hitting the ceiling, I believe, of the insertion size (100m). It looks like it just sits there on the insert. I've let it run for a day, and nothing, still INSERT INTO executing.

(21) By Pieter (theroggy) on 2023-11-06 13:11:33 in reply to 20 [link] [source]

I don't know how you are doing this conversion, but if you are using gdal for it... the version of gdal that uses this new way of creating the index, version 3.8, is being released as we speak. So, if you wait a few more hours or days...

A second remark to be clear. This new way to create the index is technically not using a "real" rtree bulk algorithm, so most likely there is still room for significant improvement.

(22.1) By JT Ar (jtarchie) on 2023-11-07 00:17:10 edited from 22.0 in reply to 21 [link] [source]

I am doing directly in SQL with an INSERT INTO. I am building my own database, for proprietary reasons.

CREATE VIRTUAL TABLE entry_rtrees USING rtree(
  id,
  -- Corresponds to the id in the entries table
  minLat,
  maxLat,
  -- Minimum and maximum latitude
  minLon,
  maxLon,
  -- Minimum and maximum longitude,
)

INSERT INTO
  entry_rtrees (id, minLat, maxLat, minLon, maxLon)
  SELECT
    id,
    minLat,
    maxLat,
    minLon,
    maxLon
  FROM
    entries;

This is the simplest form I have. The entries table is a pre-calculated bounding box of things I need and generating the rtree based on that.

With GDAL, is it using sqlite library under the hood? That might be the best way.

Edit: I was looking at the code for the rtree bulk insert. It looks like specifically built your database. I wonder if I could pull this in for my work, however. It would require some changes.

(23.5) By Pieter (theroggy) on 2023-11-07 16:01:49 edited from 23.4 in reply to 22.1 [link] [source]

The GDAL library mainly offers support to read/write/convert many GIS data formats using one API.

Some of these GIS formats are based on sqlite, so for those formats/drivers GDAL uses SQLite under the hood. E.g. Geopackage, which is an ISO certified standard for GeoSpatial data and Spatialite. Those formats typically use the SQLite rtree mechanism to store its spatial index...

For information, OSM PBF, also based on sqlite if I remember correctly, is also supported.

It depends a bit on how specific your needs are whether it is a good idea to use GDAL, use the "new" code directly or (if you need/like even better performance) adapt a real bulk insert algorithm to create the rtree index faster.