Looking for cause of slowdown in RTree app
Would anyone expect that adding two non-indexed columns to one of three tables, two regular and one RTree, comprising a couple dozen columns over two DB files, cause ingest to slow down by a factor of four or more?
The app repo sub-directory: https://github.com/drbitboy/Tycho2_SQLite_RTree/tree/master/gaia
The diff: https://github.com/drbitboy/Tycho2_SQLite_RTree/commit/8ded77765d76794f8ceaf91a4d58cef0639d0fd8#diff-fa44835d0ed33e8aa3b6b8964ec77bfd
The .schema is below.
I have a Python app (see URL above and .schema below) that puts a subset of the Gaia star catalog into two SQLite databases, light and heavy. The light DB has two tables. The first, gaiartree, is a 3D RTree table: RA (Right Ascension); Dec(lination); Mag(nitude); it also has an integer index [idoffset]. The second table, gaialight, in the light DB has associated data (RA, Dec, parallax, proper motion; photometry) indexed with the same [idoffset] column for JOINs*. The heavy DB has one table. gaiaheavy, with additional data (uncertainties of the values from the light DB) and again with [idoffset] for JOINs. Each table is around 100GB, which is why I separate out the heavy and light: not all users will need the second DB so there are less data for them to move around.
* I know I can put those in the RTree table with a leading +, but I need them to be double precision.
The code I wrote works in two steps:
- download compressed CSV data from the Gaia website, about 570GB, filter those data for max magnitude and a subset of the avaialable columns, and put them in pickle files; this took about 12h to complete a month ago.
- open the pickle files and ingest the data into the light and heavy SQLite3 DB files; this took about 7h to complete when I ran it a month ago; I limited the DB to stars with magnitudes of 18 or less (i.e. more bright), which is about a third of the 1.6Gstars+ in the full Gaia catalog.
Yesterday I (d'Oh!) finally noticed that the RTree table uses 32-bit floats, so I added two columns to gaialight to ensure that the full resolution of the data was available. See the second URL above for a diff of the code changes: there is code for the new columns around lines 750 and 800; I also changed the index name to [idoffset], because [offset] is a reserved word that worked, but that I did not need to use.
I deleted the DB files and started another ingest last night, expecting it to be done sometime this morning like last time. However, it is still running and looks like it will take a couple of days or more.
That's not the end of the world per se, but it did surprise me.
So the question is, is it likely that the code change (cf. the commit diff URL above) would cause a slowdown of 75% or more?
I am suspicious that something else has gone south in my development system e.g. disk or memory faults, or maybe summat in the OS (Linux Mint) is broken, but need to eliminate this as an issue. Of course I could revert to, and re-run, the previous code to see if it would go as fast as it did last time, but that would take several hours to see if the difference persists, and I'd prefer to have the current 200GB of DBs complete first.
A preliminary browsing of [dmesg] and /var/log/syslog did not show any obvious deterioration in hardware or software in the Linux system logs.
I don't know if it's related, but in the last week or so, all web browsers have become unusable on my development laptop; the MainThread and Web Content processes jump to 100-300%, and total over 400%, of CPU per the [top] utility whenever I start Chrome or Firefox. I would not think that is related but should mention it anyway.
-- Gaia Light DB, gaia.sqlite3 CREATE VIRTUAL TABLE gaiartree using rtree(idoffset,ralo,rahi,declo,dechi,lomag,himag) /* gaiartree(idoffset,ralo,rahi,declo,dechi,lomag,himag) */; CREATE TABLE IF NOT EXISTS "gaiartree_rowid"(rowid INTEGER PRIMARY KEY,nodeno); CREATE TABLE IF NOT EXISTS "gaiartree_node"(nodeno INTEGER PRIMARY KEY,data); CREATE TABLE IF NOT EXISTS "gaiartree_parent"(nodeno INTEGER PRIMARY KEY,parentnode); CREATE TABLE gaialight (idoffset INT PRIMARY KEY ,ra REAL NOT NULL <== New column ,dec REAL NOT NULL <== New column ,parallax REAL DEFAULT NULL ,pmra REAL DEFAULT NULL ,pmdec REAL DEFAULT NULL ,phot_g_mean_mag REAL NOT NULL ,phot_bp_mean_mag REAL DEFAULT NULL ,phot_rp_mean_mag REAL DEFAULT NULL ); -- Gaia Heavy DB, gaia_heavy.sqlite3 CREATE TABLE gaiaheavy (idoffset INT PRIMARY KEY ,source_id BIGINT NOT NULL ,ra_error REAL DEFAULT NULL ,dec_error REAL DEFAULT NULL ,parallax_error REAL DEFAULT NULL ,pmra_error REAL DEFAULT NULL ,pmdec_error REAL DEFAULT NULL ,ra_dec_corr REAL DEFAULT NULL ,ra_parallax_corr REAL DEFAULT NULL ,ra_pmra_corr REAL DEFAULT NULL ,ra_pmdec_corr REAL DEFAULT NULL ,dec_parallax_corr REAL DEFAULT NULL ,dec_pmra_corr REAL DEFAULT NULL ,dec_pmdec_corr REAL DEFAULT NULL ,parallax_pmra_corr REAL DEFAULT NULL ,parallax_pmdec_corr REAL DEFAULT NULL ,pmra_pmdec_corr REAL DEFAULT NULL );