General question concerning database stucture and number of databases to use.

(1) By Gary (1codedebugger) on 2020-11-06 02:51:44 [link] [source]

I apologize for asking another rather novice question.

I'm attempting to move an indexedDB web extension database structure to a native C application compiled with SQLite3. In the extension, users build portfolios of study modules. In indexedDB, each portfolio was a separate database of four object stores because it was supposed to be a poor set up to perform a database version change each time a user added a new portfolio requiring another set of object stores. Data for all modules within a portfolio was stored in each of the four object stores using a three element array as a primary key.

In SQLite3, it appears that adding new tables is rather simple. So, my question is what should be considered in determining whether to create a separate database for each portfolio or one database for all possible portfolios, and whether to create a set of four tables for each module or include all modules in one set of four tables?

I'd appreciate any guidance you could provide. Below are some characteristics of the database needs of this extension tool. I'm assuming it is a rather simple little tool from the perspective of its database needs at least.

Thank you for you assistance.

All the portfolios a user could ever build, plus any media files they choose to include, would never exceed a small fraction of the maximum SQLite3 database size.
Users could choose to open multiple portfolios with multiple modules open within each portfolio, either in separate browser tabs or within the same tab. Only one user can ever have a database open because it all runs on the local machine like a desktop application.
There will be frequent database write requests (to capture the current state), almost all of which will involve very small amounts of data in a single row of each of two tables, or the readonly selection of ten to thirty rows from each of two tables in a single request. The only potential for operations involving larger amounts of data is when users store images, PDFs, audio, and/or video files to include in their modules and request the data to display a small number of them at any one time.
Lastly, I'd like it to be easy for users to share the portfolios they've built with others, and to incorprate those portfolios into the user's database or set of databases.

(2) By Gunter Hick (gunter_hick) on 2020-11-06 07:02:55 in reply to 1 [source]

Using database or table names as data is generally a bad idea, unless there are compelling reasons to do so.

One of the reasons is that the table name must be already know at prepare time and you will be wasting a lot of time building nearly identical SQL queries with just the table name changing instead of preparing just one statement and binding the portfolio name.

Another is that preparing a statement includes searching for the definitions of the tables and that takes longer for a large number of tables.

(15) By Gary (1codedebugger) on 2020-11-08 02:12:37 in reply to 2 [link] [source]

Thank you for pointing this out. Even if this were the only argument against adding a new set of tables for each portfolio, it is enough in itself.

(3) By David Butler (gdavidbutler) on 2020-11-06 16:10:02 in reply to 1 [link] [source]

Yours is not a novice question. SQLite is my C programming secret weapon. But when to use is key.

Yes, it is easy to add tables, but there is a cost. A schema is not compiled. For an extreme example, I have a schema for TR-181 (machine generated from the XML schema):

sqlite3 tr-181.db
sqlite> select type,count(*) from sqlite_schema group by type;
type|count(*)
index|2644
table|1637
trigger|10993

It is 41,230,336 byes on disk with no data loaded. It takes a moment to open it. But the advantage of completely contained representation of a single device is worth every cost.

Points 1, 2, and 3 fit perfectly with SQLite.

Point 4 leads to separating things in separate databases and use of attach. This, too, fits perfectly with SQLite.

I think you will find SQLite a good fit.

(5) By ddevienne on 2020-11-07 01:04:44 in reply to 3 [link] [source]

[...] a schema for TR-181 (machine generated from the XML schema):

Any chance you'd share details on how it was machine generated?
This is an area I'm interested.

(11) By David Butler (gdavidbutler) on 2020-11-07 22:21:01 in reply to 5 [link] [source]

I use a trivial XML callback parser to capture the structure and details of the data model and spit out DDL. It is rather simple to do. In the case of TR-181, it is hard because of actions that must occur on DML (and why the number of triggers is large). I would not recommend something has hard as CWMP for your first project!

The idea is the spit out "CREATE TABLE XML element name(" when an XML element starts. Then ",XML sub element name type" for each sub element that's a column. Then a closing ");" when the first element closes. Depending on other attributes / sub elements, you can add CHECK constraints, etc. Again, it's trivial, until it's not.

(6) By Gary (1codedebugger) on 2020-11-07 01:21:14 in reply to 3 [link] [source]

Thank you for the information about the cost of adding tables, the example of the large schema, the links, and confirming SQLite is a good fit considering the four points. The information was very helpful.

(4) By Lifepillar (lifepillar) on 2020-11-06 20:09:43 in reply to 1 [link] [source]

I'd appreciate any guidance you could provide.

Review the staples of Relational database technology. In particular, read about data integration and the distinction between schema and instances.

Data integration

Just consider this: usually, DBMSs do not allow you to perform queries across databases. So, if you have one database per portfolio, you cannot (easily) perform queries across all portfolios. Besides, with different databases you will most likely end up storing duplicate and/or inconsistent data. There are other reasons to prefer integrated data, but for “a rather simple tool” these two should suffice.

Schema vs instance

A database schema is not just a bunch of “tables”: it describes a language (in a very formal sense) with which you describe facts (by means of tuples in relations, or, informally, as “records” in “tables”). As a rule (with few exceptions), adding a new relation schema (i.e., a “table”, in common parlance) is justified only if you need to describe a new, distinct, fact type. Concretely, when you need to add data, say, about a new portfolio, you shouldn't have to create a new table, but add data to one or more existing, properly designed, tables that accurately capture all the relevant fact types about “portfolios”. Of course, that requires determining such fact types.

So, your problem essentially amounts to how to properly model your requirements as a (one) well-designed database. There is currently (only) one (ISO/IEC/IEEE) standard for Relational database design, which is ISO 31320-2 (note: only the “key-style” modeling described in the standard is relevant to Relational database design). Unfortunately, the ISO document is not freely available, but the relevant parts are nearly identical to the former US government's Federal Information Processing Standard (FIPS) PUB 184, which can be found here. The latter also contains a detailed description of a sound methodology for database design. Even if you decide not to adopt IDEF1X as your modelling methodology, I'd recommend that you read at least Annex A of the FIPS document, which clearly describes how to approach database design.

Lastly, I'd like it to be easy for users to share the portfolios they've built with others, and to incorprate those portfolios into the user's database

I assume, from your description, that each user has its own SQLite-backed app, and that you want users to share data between their respective databases. If that is the case, when your database is well-designed—in particular, if you have devised good relational keys—that becomes (almost) trivial.

(7) By Gary (1codedebugger) on 2020-11-07 02:08:52 in reply to 4 [link] [source]

Thank you for the explanations and links. It appears that my thinking was rather backward, for I thought a larger number of tables containing fewer records would be more efficient in some manner. The concept of a relation schema as opposed to a table being a generic empty container to hold records is very helpful. In that respect, I don't need four tables to hold the data across all portfolios. Since nearly all the data is being generated by the user from within the web extension, all the tables are comprised only of some type of compound key(perhaps multiple columns, indicating portfolio, module, tab, page, component) and a JSON string. The only exception is the media data types as blobs. If it's better to keep the blobs separate, then two tables would be sufficient; otherwise, one table would be.

Although the JSON strings contain different collections of information, there is only one distinct fact type and no relationships as long as the strings are never parsed in C. I will never need to perform a join in this particular application; only inserts, replaces, and selections based on the compound keys.

If the JSON strings were parsed into separate columns, then there would be different facts and relationships, but, in this simple case, it just makes more work for the application to stringify in Javascript to parse in C in order to write to SQLite database (and almost never use the data in C) and then retrieve and stringify it in C to send back to JavaScript to parse again.

I have an upcoming project in which the scenario is more complex (for my skill level anyway) and the information you provided corrects my view of a number of items that will be helpful to me. Thanks again.

(8) By Simon Slavin (slavin) on 2020-11-07 10:45:45 in reply to 7 [link] [source]

Relating to storing BLOBs …

Given the data structure you describe above, if it would be natural to make a 'mediaData' column in the existing single table you describe then you should do it, and can do it efficiently. Because of the way SQLite stores and retrieves data, the 'mediaData' column should be the last (right-most) column defined for the table.

Here's why: SQLite keeps all the data for a row together in the file: column1, column2, column3, etc.. Each of these columns can have a different length for each row of data. When SQLite needs to read some data from a row it reads from the first column of the row to the last column the SQL command needs. In other words if you had a six column table and did

SELECT column3 FROM myTable WHERE column1=200

SQLite would have to read column2, because it needs to know where it ends to find column3. Once it has read column3, SQLite knows it has everything it needs and doesn't bother reading c4, c5 or c6. By putting the column holding BLOBs at the end of the sequence you arrange that SQLite will never read it (or reserve memory for it) unless the SQL command needs it.

(The above explanation is simplified for clarity. e.g. covering indexes.)

(9) By Keith Medcalf (kmedcalf) on 2020-11-07 16:07:52 in reply to 8 [link] [source]

Except, of course, that if you update the row, you have to re-write the blob.

When any part of the row is changed the WHOLE row is re-written. This means that if you put a 4 GB blob as the last column of a table with 4 other data fields before it, whenever you update one of the other 4 fields you will re-write the entire record INCLUDING the 4 GB blob which you will have to both read into memory and write back out to disk, perhaps multiple times.

This is one of the reasons that large blobs (and text fields, which are the same thing, just with overloaded meaning) should be stored in their own table.

(13) By Gary (1codedebugger) on 2020-11-08 01:48:32 in reply to 8 [link] [source]

Thank you very much; I was unaware of how that worked in SQLite. Even if it may cause additional work for table rows that will be updated, I assume it would be beneficial for tables with static rows containing reference information.

(14) By Gary (1codedebugger) on 2020-11-08 01:50:54 in reply to 13 [link] [source]

Thank you for pointing this out. For my particular use case, it would be quite a problem to place the blob on the same row with the rest of the columns because of the high frequency of user-initiated updates.

(16) By Simon Slavin (slavin) on 2020-11-08 02:29:08 in reply to 14 [link] [source]

I think you have figured out your answer to this part of the problem. If you frequently update a data row without changing the BLOB, then it makes sense to store the BLOBs in a different table.

(17) By tom (younique) on 2020-11-08 10:15:57 in reply to 8 [link] [source]

SQLite would have to read column2, because it needs to know where it ends to find column3. Once it has read column3, SQLite knows it has everything it needs and doesn't bother reading c4, c5 or c6. By putting the column holding BLOBs at the end of the sequence you arrange that SQLite will never read it (or reserve memory for it) unless the SQL command needs it.

Sorry, but I do not understand that. There has to be some length information for the Blob, because otherwise one couldn't know where it ends. So all SQLite has to do is to skip (not read!) n bytes. Shouldn't be any slower than to skip a floating point number or an integer in column2.

For writing, it should also be indifferent whether 1 MB blob is written at the beginning or at the end. The whole row should always be the same size no matter of the order of columns.

(18) By Kees Nuyt (knu) on 2020-11-08 17:48:13 in reply to 17 [link] [source]

The point with BLOBs is, they can easily be larger than a database page, so overflow pages get involved. They have to be read from disk if not cached.

If the BLOB itself is not referenced by the query, reading those overflow pages can be avoided by placing all short columns at the beginning of the row.

-- 
Regards,
Kees Nuyt

(19) By Keith Medcalf (kmedcalf) on 2020-11-08 21:10:31 in reply to 18 [link] [source]

Mutatis Mutandis TEXT fields which can also be larger than a page.

There is no difference between CLOB (TEXT) and BLOB other than the overloading of type information, and the same issues apply to both (or to any record that is longer than the payload size of a page).

It does not matter whether the field is a 2 GB movie or a 2 GB novel written in ASCII prose. What matters it the size.

(20) By Gary (1codedebugger) on 2020-11-09 04:01:12 in reply to 19 [link] [source]

If all text fields were limited to the page size (by natue of the application itself) or they were not limited but at least 95% were expected to be less than the page size, would there be any reason to hold the text data in a separate table from BLOB data that was expected to always exceed the page size?

I don't mean with respect to how columns are arranged or the performing of updates, but only in the case of having a two-column table of key and text, and key and BLOB. Is it bad practice to have tables that mix rows of small size with rows of large size when it is known in advance which rows will be in each size category?

In my particular case, there is no other reason that I know of at this point for placing the user input text and BLOB data in different tables other than a considerable difference in size.

Thank you.

(21) By Keith Medcalf (kmedcalf) on 2020-11-09 04:37:40 in reply to 20 [link] [source]

There is no point in moving "small fields" from the "main" record object to a "secondary" record object if the entire record will fit as payload on one page (and it is not part of the data normalization process).

The only time a difference comes in to play is if the "entire record" will not fit on a page and must consume an overflow page (particularly more than one overflow page) because then access to any field after the overflow requires retrieving the overflow page and finding the pointer therein to the next overflow page and finding the pointer therein to the next overflow page and so on and so forth until you have finally found the "column" you are looking for.

Similarly, an "entire record" which cannot fit on a single page will require the I/O of the page on which the initial fragment is located plus all of the other pages comprising the overflow chain when that record is "re-written" (as in INSERT/REPLACE/UPDATE/DELETE).

The reason that this is usually mentioned for BLOBs is that the very name suggests largness (Binary Large OBject). But it also applies to CLOBs (Character Large OBjects) -- otherwise called TEXT fields. It is just that the common connotation of TEXT is that it is of piddling size, not of "Large" size.

If there were a MPREAL (MultiPrecision REAL) that could consume more than one page of space, then it would have the same considerations.

In other words, consideration applies to the size of the record and not the type of the data. Putting 1 million REALs in a single row would have exactly the same considerations because that would require 8 MEGABYTES of page payload to be I/O'd on every INSERT/REPLACE/UPDATE/DELETE and multiple pages to be read via the overflow chain until the requested data in the record is located.

(22) By Gary (1codedebugger) on 2020-11-12 06:18:18 in reply to 21 [link] [source]

Thank you. I didn't ask the question correctly nor provide sufficient information. I wasn't trying to ask about separating data on the same row into two tables but whether or not CLOBs and BLOBs of different sizes should be stored in the same column of the same table.

There are two categories of data that need stored for the application I'm asking about. Users build small documents inside the application through adding UI elements and then input text or link media files (audio, video, PDFs) within those elements.

The position, sequence order, size, and everything else concerning each document element except the actual text or media file itself is stored within a map object in a different table. Each map object can have any number of text blocks and media files associated with it. They are stored in another table and have the primary key of the map object as a column. That table has only three columns: the auto-generated rowid, the primary key of the associated map object, and the text or media data. The only column in that table that will ever be updated/replaced is the text/media column.

My question is are two tables required for the text and media data, that is, one for CLOBs and one for BLOBs? I ask for two reasons.

One is that due to the nature/structure/purpose of these small documents, at least 95% of the time each CLOB will never exceed more than a small paragraph of text and will always be much smaller than any of the BLOB media files. I think the text rows will almost always be less than the default page size and the BLOB rows will always exceed it. The other is that I wasn't sure if it is safe practice to declare a column as a BLOB and frequently write text to it also, relying on the column affinity rules to accept text in the BLOB column.

There will be two general database tasks associated with this portion of the data. One is to retrieve a map object and all CLOBs and BLOBs associated with that map's primary key, and pass them to the application to contruct the requested document. The other is to either overwrite a CLOB or add/delete a BLOB record as a user builds/edits a document.

Thank you.

(23) By Keith Medcalf (kmedcalf) on 2020-11-12 22:22:44 in reply to 22 [link] [source]

There is no difference between a BLOB and TEXT provided that the interfaces used to access the text field are the same as the database encoding. That is, if the default encoding is UTF8 and the APIs you use to access the field are the _text variant (which expect to input 8bit characters and output 8 bit characters) OR the database encoding is UTF16 and the APIs used to access the data is only ever the _text16 variants, then there will be no "diddling around" with the data you have provided when it goes "into" or "out of" the database record.

The only difference between a BLOB and TEXT is that TEXT is assumed to contain TEXT, meaning that is can be converted between the various API and internal encoding formats, UTF-8 and UTF-16BE and UTF=16LE, depending on which combinations of APIs for dealing with TEXT fields and what the default database encoding is set as, and the particular platform. Functions for dealing with "TEXT" also assume that a 0 codepoint terminates the text (since this is part of the innate definition of what comprises a text string).

If you declare something as "TEXT" when it is not well-formed TEXT may result in various explosions, meltdowns, and proper behaviours which the user feels are improper because they failed to compy with the requirements that TEXT be properly formed and therefore shot themself in the foot.

There is nothing which prevents you from storing TEXT in a BLOB field. How your application "interprets" the bytes is entirely a matter for your application.

The attribute TEXT is only important if you want SQLite3 to treat the data as TEXT (for the purposes of diddling about with it and interpreting its contents) in which case it must be "properly formed" text.

(10) By Lifepillar (lifepillar) on 2020-11-07 18:26:54 in reply to 7 [link] [source]

It appears that my thinking was rather backward, for I thought a larger number of tables containing fewer records would be more efficient in some manner.

If I may, your thinking is backward in another sense: you are caring about physical details before having resolved all the logical issues. What's most important, that your user's data is correct at all times or that they get garbage, but fast, responses? “Premature optimization is the source of all evils“ they say, and that applies to databases more than to anything else. My recommendation is: think about your data logically, designing as many “tables“ (relation schemas) as needed, no more, no less; then, trust SQLite to do its job egregiously. If at some point it fails you, only then investigate why and take corrective actions. You might be surprised at the end to discover how many of your assumptions turned out to be wrong. You might not need to take any corrective action, after all.

The concept of a relation schema as opposed to a table being a generic empty container to hold records is very helpful.

The concept is not only helpful, it is essential to the Relational model (albeit way too many times ignored). Tables (more formally, relation schemas) are far from “generic containers” holding a bunch of “records”. They carry meaning: a precise semantics decided by the database designer. A relation schema corresponds to a predicate, which you must define and document; and each tuple corresponds to a fact, whose meaning is given by the predicate associated to the relation schema. I have seen many databases whose data nobody could understand or had different opinions about (!), such as event tables whose timestamps were meaningless because three different applications had three different notions of what an "event" was. That happens precisely because a database is treated as a closet (the engineering term for that being “persistence layer”). Again, treat your database schema as a language instead, and design a proper one for your application domain. Then, your queries will be simple, easy to understand, and well performing.

In that respect, I don't need four tables to hold the data across all portfolios.

Well, that depends. You may need less, you may need more. From what I may infer about your application (you haven't provided many details), your database schema might be considerably more complex that one or two or four tables.

If the JSON strings were parsed into separate columns, then there would be different facts and relationships,

No. Again, you are thinking in physical terms (the JSON data format). Think in logical terms: what is a portfolio? How do you identify a portfolio and how do you distinguish it from another portfolio? How do you describe it (what are the relevant attributes)? Etc. The database descends from this kind of considerations, not from the byte format of your data source.

but, in this simple case, it just makes more work for the application to stringify in Javascript to parse in C in order to write to SQLite database (and almost never use the data in C) and then retrieve and stringify it in C to send back to JavaScript to parse again.

You may certainly store some data as JSON blobs (SQLite has some support for JSON after all), but consider such a choice carefully. Is the content of such blobs something you won't ever need to query in more or less complex ways? You may corner yourself into a situation where you have to write ad hoc code (in the worst case, at the application level) to extract information from JSON fields, when a simple SQL query would have done the job, had the data been properly organized.

(12) By Gary (1codedebugger) on 2020-11-08 01:44:34 in reply to 10 [link] [source]

Thank you for taking the time to write this response. I greatly appreciate it.

I think I have the data set up in a fairly logical manner at this point and have been focusing on efficiency, but I have much to consider after reading your advice and warnings. The application works now as a web extension utilizing indexedDB for all the data apart from the media files. I was surprised by how quickly the database requests are processed. The UI is never updated until it is known that the database transaction has succeeded; and I was concerned that it would provide a clunky user experience but it has been very quick.

To build their study moudles, users add pages to tabs (custom tabs in a single web page, not browser tabs, and only one page is loaded at any time in a tab), and add UI components to their pages.

Most of the data fields are stored in mapping objects that include items such as sequence arrays, undo-chain "pointers", state data, and "pointers" to other data elements to restore the pages and components. Portfolios, modules, tabs, pages, and some page components each have a mapping object.

Mapping objects comprise the greatest number of data elements but comprise the smallest amount of data overall. Apart from adding and organizing pages and components which are captured in the mapping objects, the only data users really input are text and media files.

Everything that needs to be known about the text and media files are held in the mapping objects, apart from the actual text strings and media blobs themselves. Each media blob is a separate row and text strings are stored in sets by component category, one set per row.

In indexedDB, I had object stores set up by tab, and text rows mixed in with mapping-object rows. That now seems to be a poor approach; and, perhaps, mapping objects should be a table, text data another, and media blobs a third.

IndexedDB returns sets of data ordered by primary key; so, sorts are performed in the JavaScript to order the data by sequence array. Perhaps, that can now be handled by the SQL queries directly such that ordered data can be passed to the JS. However, most operations on the data are easily performed in JS when they act on a single page within a tab, such as validating the user input on a page. This is because all data for the curretly loaded page is held in RAM. However, to validate all pages in a tab or an entire module through one user click, required opening an indexedDB cursor and stepping through each row of data in turn and validating relevant values. I think this would be the only time it would be easier to use the data in C and have it in tables in real individual columns rather than in a single JSON string. This became easier when the text and media were completely separated from the mapping objects because only the maps need to be reviewed for validation. But, as you warned, there may be other operations desired in the future that would be much easier to perform in SQL / C had the data been loaded in real columns rather than JSON strings.

At first, I thought it was going to be an advantage to store the data in columns until I considered the type and frequencey of database requests the application requires. There seems to be a balance between making edits to pages and building the next desired page. When page component data is stored in a larger number of specific rows, edits are simpler because less data is extracted and updated for each edit request. However, to build a new page, more rows of data must be extracted and assembled. Also, it appears that, when the full JSON string is held in RAM, it is less work to replace the full string than to update a portion of it. Of course, even if I am understanding this correctly, the differences in SQLite performance between them may be insignificant to the user experience.

That isn't a design issue, for the design is primarily the mapping objects, but it still needs to be decided, for example, whether every text container should be a row or all text containers within the same page component should be in the same row.

Those are the kinds of things I was thinking about when using indexedDB, after I had a data structure in place. Of course, I changed that structure about three or four times along the way as I learned more and have much more yet to learn.

Thanks again for all the advice and warnings.