Does row header metadata leak INSERT order ?

(1) By PyHead (PyHead979) on 2021-01-13 16:35:25 [link] [source]

Hello,

I have a simple table made of 2 columns :

a random unique INT rowid
a text value

Let's say I INSERT the following 2 rows in order (first then second):

rowid=21 ; text=abc
rowid=11 ; text=def

My goal is to make sure that there is no way (even with full access) to know the original INSERT order of the rows.

I guess I must set the pragma "journal_mode = OFF" but is it enough ?

Is there any metadata that can leak the information, like a transaction id, an insert time, ... ? I tried reading the specs and I did not find any obvious reference to such thing for the rows headers, but I am not an expert.

Should I prefer a specific hardware storage : HDD, SSD, in memory ? Maybe HDD should be avoided due to the risk of sectors positions leak ?

Thank you very much in advance for your insight.

(2) By Richard Hipp (drh) on 2021-01-13 16:47:39 in reply to 1 [link] [source]

VACUUM should erase any traces that might be used to help guess the original insertion order.

If you are in WAL mode and are hyper-paranoid, then you might also consider doing:

PRAGMA journal_mode=DELETE;
PRAGMA journal_mode=WAL;

In other words, cycle the write-ahead log to ensure that it also is free from any traces.

(3) By PyHead (PyHead979) on 2021-01-13 17:04:01 in reply to 2 [link] [source]

VACUUM will probably erase the traces indeed, but if I am not mistaken it requires a lock and could take some time on a big DB so this is a command I should avoid spamming too much.

Thus my question : is there really any "trace" to worry about to begin with ?

(4) By Richard Hipp (drh) on 2021-01-13 17:12:43 in reply to 3 [link] [source]

Yes there are. There are no definitive traces - in other words, you cannot prove a particular insertion order based on meta data, but you can make an informed guess. That is leaking information, technically.

Run the experiment. Create two databases with the same schema and content, but insert the rows in different orders. Do a binary diff of the two files and see that they are not byte-for-byte identical.

(5) By PyHead (PyHead979) on 2021-01-13 17:46:09 in reply to 4 [link] [source]

Thank you very much, I think I am getting closer to the information I am looking for.

So, if I understand correctly :

there are no definitive row traces in SQLite, contrary to some other DB engines that store an incrementing transaction id per row for example
but it can still (more or less) leak through the DB data position structure

My conclusion is that :

in case of a data breach (ie : SQL injection), there is no risk involved
the only problem would be a hack involving full root or physical access to analyse the DB storage structure

Am I correct ?

(6) By Richard Hipp (drh) on 2021-01-13 18:10:54 in reply to 5 [source]

I think so, yes. There are some corner cases to watch. For example, if a hacker can use SQL injection to gain access to the sqlite_dbpage virtual table, then that is basically the same as getting access to the raw SQL file. Apart from that, I don't think there are any traces about insertion order that are accessible via SQL.

The above assumes that you are setting the rowid manually, perhaps by having an INTEGER PRIMARY KEY column that you initialize yourself, rather than letting SQLite fill in automatically. Obviously if you let SQLite fill in the rowid itself, those rowids will be sequential.

(7) By Simon Slavin (slavin) on 2021-01-14 12:20:51 in reply to 5 [link] [source]

Worth making the distinction between SQLite API access and disk access here.

VACUUM prevents information from leaking out if you have access to the drive the database is stored on. It rearranges things on disk. But if your hacker has access to the drive the database is stored on, you have bigger problems than the SQLite API, because the file format for SQLite is published and they can attack it many. other ways. One can assume anyone who actually cares about this is using

PRAGMA secure_delete = ON

or setting SQLITE_SECURE_DELETE for compilation.

If your attacker is restricted to using SQLite API calls already in use by the program they're hacking, they have a harder job. They can check autoincrement integer keys where your program uses them. They may be able to tell what data was entered together, by doing fussy timing checks to see whether getting one piece of data into the cache also got another piece fo data into a cache. They can exploit any injection vulnerabilities a bad programmer provided. But there's really not much they can do because SQLite (excellently) lacks ways to look at and overwrite raw data. My fears about this revolve around badly written extensions, Virtual File Systems, external functions that use system calls, etc..

(8) By Gunter Hick (gunter_hick) on 2021-01-14 13:52:26 in reply to 1 [link] [source]

AFAIK the BTree structure keeps the records themselves in sorted order in the leaf nodes, whereas internal nodes contain a sorted list of child nodes and their respective maximum key values. BTrees also keep track of sorted via unsorted insert order and adapt the node split when it fills up, and will also attempt to keep the tree balanced. This is enough to distinguish between sorted (nodes mostly more then 50% full) and unsorted (nodes mostly less than 50% full) loads.

I expect VACCUM to make the BTree look like a sorted load, thus disguising any clues hidden in the "shape" of the tree.

Of course this works only if your actual data does not contain any hints of a time order too.