Table and index in separate files

(1) By anonymous on 2021-02-19 14:42:08 [link]

Is it possible to have a table and index in separate files, like in the old dbase days...

I tried something like this with attach which doesn't work:

.open data.sqlite as datafile;
attach index.sqlite as indexfile;
create index indexfile.indexname on datafile.table(field);

Any other way or not at all possible?

CLI usage.

Philip

(2) By Ryan Smith (cuz) on 2021-02-19 14:49:07 in reply to 1 [link]

Both not possible and completely unnecessary.

What advantage could you possibly imagine to gain from it over the single file solution?

(4) By JayKreibich (jkreibich) on 2021-02-19 17:10:26 in reply to 2 [link]

Presumably, putting the files on different storage devices.

(5) By Keith Medcalf (kmedcalf) on 2021-02-19 18:13:45 in reply to 4 [link]

Putting the journal on a different storage device and channel (not a different partition on the same device) is useful.  

Separating the index and the data, even on different storage devices and channels, provides almost no utility.

The only reason to separate data/index/journal files to separate devices (on separate channels) is to permit overlapped physical I/O.  Logical separation is not particularly useful.

Bitty-boxes typically do no have I/O facilities which have separate communication channels to independent devices so there is no benefit to be achieved by any separation of journal, data, or index.

(6) By Larry Brasfield (larrybr) on 2021-02-19 18:39:05 in reply to 5

> Separating the index and the data, even on different storage devices and channels, provides almost no utility.

I was thinking there could be some utility, but it would require care.

If a fast, volatile device was claimed to be where an index is kept, it could be rebuilt upon opening a DB that claims to have such an index. [a] Then a speed advantage might be obtained relative to what page caching does already. This would require feature change to SQLite to be reliable.

[a. Alternatively, timestamping or hash-keeping could be used to verify that a persistent index still corresponds to the table it is supposed to index into. Messy to be sure. ]

(7) By Keith Medcalf (kmedcalf) on 2021-02-19 18:59:18 in reply to 6 [link]

> I was thinking there could be some utility, but it would require care.

No, there can be no utility.  

Index -vs- Index + Table access is inherently one-after-each (sequential).  First you access the index in order to find out what table row you want to access, and only after that access to the index is complete do you access the table row itself.

This means that no advantage can be gained by separating the channel access to the index & data -- the only thing that can be eliminated is the latency incurred between access to the index and access to the data -- a problem more efficiently and cost effectively satisfied by judicious use of caching (which you would need anyway even if the index and data were located on separate channels).

In fact, even separation of data (and index) from the journal onto separate channels provides the most advantage only when using traditional rollback journals and even then only when "rollback" is an infrequent occurrence.

Write-Ahead journals will gain no advantage from separate channels.  In fact, they gain the most utility from a single-channel with data-phase disconnect, which is currently the predominant storage implementation on todays bitty-boxes.

(9) By Larry Brasfield (larrybr) on 2021-02-19 19:22:31 in reply to 7 [link]

> > I was thinking there could be some utility, but it would require care.

> No, there can be no utility.

I was not so sure of that, which is why I said, "a speed advantage might be obtained relative to what page caching does already." And your case against this possibility is not convincing. More on this below.

> Index -vs- Index + Table access is inherently one-after-each (sequential). First you access the index in order to find out what table row you want to access, and only after that access to the index is complete do you access the table row itself.

I agree as to the access time ordering (and understood it going in.)

> This means that no advantage can be gained by separating the channel access to the index & data -- the only thing that can be eliminated is the latency incurred between access to the index and access to the data -- a problem more efficiently and cost effectively satisfied by judicious use of caching (which you would need anyway even if the index and data were located on separate channels).

Here, I believe you have overlooked an advantage that a whole-index-in-memory scheme would have over "judicious use of caching" or even a well written caching implementation. The difference would be in latency.

By explicitly declaring that a whole index should be in "fast" storage, instead of waiting for a caching scheme to get portions of that index into "fast" in a piecemeal fashion, the total latency hit may be reduced and it can also be moved in time to be prior to demand for the index. Both would be a win in many circumstances, assuming the memory/speed trade was a win.

Mind you, I do not claim it would always be a win, but I think it easily could be for spinning iron oxide storage.

(3) By Larry Brasfield (larrybr) on 2021-02-19 14:50:58 in reply to 1 [link]

The [CREATE INDEX syntax](https://sqlite.org/lang_createindex.html) provides no way to specify such creation. Your attempt failed to compile.

It is not possible and I submit that it never should be. If that was possible, the index would easily become out of date and lead to bad results when used.

In general, SQLite allows queries involving multiple schema (or databases), but does not support ostensibly permanent linkages such as you want(ed.)

As for the old DBase days: Did its authors ever add referential integrity enforcement to it? Before it entered the dustbin of history?

(8) By Kevin Youren (KevinYouren) on 2021-02-19 19:18:10 in reply to 3 [link]

Yes, checking and occasionally rebuilding indexes was common.

Moving from disks that spin around, and hence have rotational delay, to solid state drives is probably the "newest" innovation. I read an article in "Popular Mechanics", in 1967 or 68. 

My SQLITE apps are about 3 times faster on the removable devices I bought last year.

In 1990, for an IMS database, I suggested moving the indexes to separate disks, but my manager started shaking and trembling, so I dropped the suggestion.

(10) By anonymous on 2021-02-19 22:07:44 in reply to 1 [link]

It might be possible by the use of a custom VFS, although the VFS code would have to figure out which pages the index uses and which pages the table uses (by parsing and traversing all of the links between pages), and there would be other problems as mentioned in this thread.

(11.1) By Keith Medcalf (kmedcalf) on 2021-02-20 02:49:18 edited from 11.0 in reply to 10 [link]

Well no.  The index b-tree has a different root page than the table b-tree.  They are not intermixed.  A page is EITHER an index page OR a table page.  Not a mishmash of both.

Also, in a relational database the pointers from one "thingimy" to another "thingimy" (whether that thingimy is a table or an index) is based on duplication of data, only the duplication of data, and nothing but the duplication of the data.  

Only more efficient CODASYL style databases store "pointers" between BDAM records.

While CODASYL style storage of record pointers is far more efficient and far more useful (and performant) than storing duplicated data, CODASYL style databases are pretty much unrecoverable if corrupted.

Data duplication defining relationships (rather than pointers) is the whole point of a relational database and the reason that it was invented.

You can emulate each format in the other.

(12) By anonymous on 2021-02-20 04:55:38 in reply to 11.1 [link]

We have a fairly widely distributed system with parts postrgresql and parts dbase.
Parts of dbase are updated in remote areas, some remain static for long periods, and some are updated from other locations.

The nice thing about some of the dbase parts are the separate index files which can contain all sorts of formulae to create sort of a pre-populated view, and which can be used stand alone, ie. like a index only 
database/table. This is very fast to change and distribute without remote areas/systems having to do re-indexing or other processing.

(I have a vfs with access to some of the dbase functionality.)

Thanks All!!
Philip.

(13) By anonymous on 2021-02-20 10:23:15 in reply to 12 [link]

Came accross

https://www.sqlite.org/withoutrowid.html

which I thinnk is a way SQLite can closely similate the separate db/index setup:

I'll create a SQLite database containing all `WITHOUT ROWID` tables representing the dbase calculated index entries:

For example:

CREATE TABLE IF NOT EXISTS Index01(
  DBaseFieldToIndex TEXT PRIMARY KEY,
  DBaseRecordNr INTEGER
) WITHOUT ROWID;
CREATE TABLE IF NOT EXISTS Index02(
  DBaseCalculatedFieldToIndex TEXT PRIMARY KEY,
  DBaseRecordNr INTEGER
) WITHOUT ROWID;

And populate them by selecting needed subsets from dbase files

I can actually use it to separate SQLite database and index files also!!

Philip
z

(14) By Ryan Smith (cuz) on 2021-02-20 12:31:09 in reply to 13 [link]

Well I was going to suggest exactly this after your original post (assuming you are the same anonymous), but then convinced myself that this could not be what you intended.

This scheme will work just fine if all you want is an external reference to be used like an index for rows in another table.

The caveats to this scheme is important to note (should you still want to do this):

1. It will be a LOT (orders of magnitude) slower than a normal index because it isn't an Index, it is an external indexed lookup which requires subsequent re-lookup into the real index (the rowid of the data table).
2. It becomes outdated the moment any CRUD operation happens on the data table and then has to be rebuilt (and perhaps re-dstributed)
3. Even if re-making and distributing the new index is fast and efficient, there will always be a time-frame within the index and data will be mismatched. If a lookup is done during this phase, a client system can request data for say "John Smith" and end up getting the record for "Joan Smythe" which requires internal revalidation of obtained data, and even then, if the client can "know" it is not the right data it received, what is it to do in that case? Just wait x amount of time? Report non-function to the user?

Perhaps you could "version-control" your index files.
Either way, you will need to add so much work in checking mechanisms around all this to ensure system integrity. Is it really worth the saving of not having the index included in the data?

(15) By anonymous on 2021-02-20 13:16:47 in reply to 14 [link]

Same anonymous

What I am after for now is just to be able to send some users an `update` to the `view` for some of the databases via text message, which can now be done easily with the SQLite CLI interface.

The only problem encountered so far is that it can only be created with unique keys and there are a few dbase fields where the result will contain duplicates.

Other than that it looks like it is going to be perfect.

At the moment I don't have concerns regarding speed, just result set/view and ease of updating.

Thanks for the much appreciated words of wisdom!

Philip

(16) By anonymous on 2021-02-21 05:38:02 in reply to 11.1 [link]

> Well no. The index b-tree has a different root page than the table b-tree. They are not intermixed. A page is EITHER an index page OR a table page. Not a mishmash of both.

Yes, I know that. A custom VFS would have to keep track of which pages are which, put them into separate files, and then read back those files and reassign them the correct page number in the combined database. That isn't very useful of course, since if you update the database and then use the old version of one file and the new version of the other one, the pages will get mixed up and it won't work; also it would have be complicated to implement since it would have to parse the database file format.

There is no reason that this would be useful, I think. Even if you managed to implement this, the separate table and index files could not be used individually, anyways. And it would be too less efficient than the normal way, anyways.

(17) By tom (younique) on 2021-02-21 12:02:29 in reply to 1 [link]

I have never understood why one might want to have separate files for that. In fact, I still hate Shapefiles for being a collection of about ~10 different files which is horror to me when having to copy them.

Using a single file for all is one of the most lovely things of SQLite!👍

(18) By Gunter Hick (gunter_hick) on 2021-02-22 10:42:48 in reply to 1 [link]

If you insist on having data and index in separate files, then you need to implement a virtual table module that uses this kínd of storage model. Faircom CTree comes to mind - but that would require a separate file for each table too.