General question concerning database stucture and number of databases to use.
(1) By Gary (1codedebugger) on 2020-11-06 02:51:44 [link]
I apologize for asking another rather novice question. I'm attempting to move an indexedDB web extension database structure to a native C application compiled with SQLite3. In the extension, users build portfolios of study modules. In indexedDB, each portfolio was a separate database of four object stores because it was supposed to be a poor set up to perform a database version change each time a user added a new portfolio requiring another set of object stores. Data for all modules within a portfolio was stored in each of the four object stores using a three element array as a primary key. In SQLite3, it appears that adding new tables is rather simple. So, my question is what should be considered in determining whether to create a separate database for each portfolio or one database for all possible portfolios, and whether to create a set of four tables for each module or include all modules in one set of four tables? I'd appreciate any guidance you could provide. Below are some characteristics of the database needs of this extension tool. I'm assuming it is a rather simple little tool from the perspective of its database needs at least. Thank you for you assistance. 1. All the portfolios a user could ever build, plus any media files they choose to include, would never exceed a small fraction of the maximum SQLite3 database size. 2. Users could choose to open multiple portfolios with multiple modules open within each portfolio, either in separate browser tabs or within the same tab. Only one user can ever have a database open because it all runs on the local machine like a desktop application. 3. There will be frequent database write requests (to capture the current state), almost all of which will involve very small amounts of data in a single row of each of two tables, or the readonly selection of ten to thirty rows from each of two tables in a single request. The only potential for operations involving larger amounts of data is when users store images, PDFs, audio, and/or video files to include in their modules and request the data to display a small number of them at any one time. 4. Lastly, I'd like it to be easy for users to share the portfolios they've built with others, and to incorprate those portfolios into the user's database or set of databases.
(2) By Gunter Hick (gunter_hick) on 2020-11-06 07:02:55 in reply to 1 [link]
Using database or table names as data is generally a bad idea, unless there are compelling reasons to do so. One of the reasons is that the table name must be already know at prepare time and you will be wasting a lot of time building nearly identical SQL queries with just the table name changing instead of preparing just one statement and binding the portfolio name. Another is that preparing a statement includes searching for the definitions of the tables and that takes longer for a large number of tables.
(15) By Gary (1codedebugger) on 2020-11-08 02:12:37 in reply to 2 [link]
Thank you for pointing this out. Even if this were the only argument against adding a new set of tables for each portfolio, it is enough in itself.
(3) By David Butler (gdavidbutler) on 2020-11-06 16:10:02 in reply to 1 [link]
Yours is not a novice question. SQLite is my C programming secret weapon. But [when to use](https://sqlite.org/whentouse.html) is key. Yes, it is easy to add tables, but there is a cost. A [schema](https://sqlite.org/schematab.html) is not compiled. For an extreme example, I have a schema for [TR-181](https://cwmp-data-models.broadband-forum.org/tr-181-2-11-0.html) (machine generated from the XML schema): ``` sqlite3 tr-181.db sqlite> select type,count(*) from sqlite_schema group by type; type|count(*) index|2644 table|1637 trigger|10993 ``` It is 41,230,336 byes on disk with no data loaded. It takes a moment to open it. But the advantage of completely contained representation of a single device is worth every cost. Points 1, 2, and 3 fit perfectly with SQLite. Point 4 leads to separating things in separate databases and use of [attach](https://sqlite.org/lang_attach.html). This, too, fits perfectly with SQLite. I think you will find SQLite a good fit.
(5) By ddevienne on 2020-11-07 01:04:44 in reply to 3 [link]
> [...] a schema for TR-181 (machine generated from the XML schema): Any chance you'd share details on how it was machine generated? This is an area I'm interested.
(11) By David Butler (gdavidbutler) on 2020-11-07 22:21:01 in reply to 5 [link]
I use a trivial XML callback parser to capture the structure and details of the data model and spit out DDL. It is rather simple to do. In the case of TR-181, it is hard because of actions that must occur on DML (and why the number of triggers is large). I would not recommend something has hard as CWMP for your first project! The idea is the spit out "CREATE TABLE *XML element name*(" when an XML element starts. Then ",*XML sub element name* **type**" for each sub element that's a column. Then a closing ");" when the first element closes. Depending on other attributes / sub elements, you can add CHECK constraints, etc. Again, it's trivial, until it's not.
(6) By Gary (1codedebugger) on 2020-11-07 01:21:14 in reply to 3
Thank you for the information about the cost of adding tables, the example of the large schema, the links, and confirming SQLite is a good fit considering the four points. The information was very helpful.
(4) By Lifepillar (lifepillar) on 2020-11-06 20:09:43 in reply to 1 [link]
>I'd appreciate any guidance you could provide. Review the staples of Relational database technology. In particular, read about data integration and the distinction between schema and instances. **Data integration** Just consider this: usually, DBMSs do not allow you to perform queries across databases. So, if you have one database per portfolio, you cannot (easily) perform queries across all portfolios. Besides, with different databases you will most likely end up storing duplicate and/or inconsistent data. There are other reasons to prefer integrated data, but for “a rather simple tool” these two should suffice. **Schema vs instance** A database schema is not just a bunch of “tables”: it describes a **language** (in a very formal sense) with which you describe **facts** (by means of *tuples* in *relations*, or, informally, as “records” in “tables”). As a rule (with few exceptions), adding a new relation schema (i.e., a “table”, in common parlance) is justified only if you need to describe a new, distinct, fact type. Concretely, when you need to add data, say, about a new portfolio, you shouldn't have to create a new table, but add data to one or more existing, properly designed, tables that accurately capture all the relevant fact types about “portfolios”. Of course, that requires determining such fact types. So, your problem essentially amounts to how to properly model your requirements as a (one) well-designed database. There is currently (only) one (ISO/IEC/IEEE) standard for Relational database design, which is [ISO 31320-2](https://www.iso.org/standard/60614.html) (note: only the “key-style” modeling described in the standard is relevant to Relational database design). Unfortunately, the ISO document is not freely available, but the relevant parts are nearly identical to the former US government's Federal Information Processing Standard (FIPS) PUB 184, which can be found [here](https://www.idef.com/wp-content/uploads/2016/02/Idef1x.pdf). The latter also contains a detailed description of a sound **methodology** for database design. Even if you decide not to adopt IDEF1X as your modelling methodology, I'd recommend that you read at least Annex A of the FIPS document, which clearly describes how to approach database design. >Lastly, I'd like it to be easy for users to share the portfolios they've built with others, and to incorprate those portfolios into the user's database I assume, from your description, that each user has its own SQLite-backed app, and that you want users to share data between their respective databases. If that is the case, when your database is well-designed—in particular, if you have devised good relational keys—that becomes (almost) trivial.
(7) By Gary (1codedebugger) on 2020-11-07 02:08:52 in reply to 4 [link]
(8) By Simon Slavin (slavin) on 2020-11-07 10:45:45 in reply to 7 [link]
Relating to storing BLOBs … Given the data structure you describe above, if it would be natural to make a 'mediaData' column in the existing single table you describe then you should do it, and can do it efficiently. Because of the way SQLite stores and retrieves data, the 'mediaData' column should be the last (right-most) column defined for the table. Here's why: SQLite keeps all the data for a row together in the file: column1, column2, column3, etc.. Each of these columns can have a different length for each row of data. When SQLite needs to read some data from a row it reads from the first column of the row to the last column the SQL command needs. In other words if you had a six column table and did <code>SELECT column3 FROM myTable WHERE column1=200</code> SQLite would have to read column2, because it needs to know where it ends to find column3. Once it has read column3, SQLite knows it has everything it needs and doesn't bother reading c4, c5 or c6. By putting the column holding BLOBs at the end of the sequence you arrange that SQLite will never read it (or reserve memory for it) unless the SQL command needs it. (The above explanation is simplified for clarity. e.g. covering indexes.)
(9) By Keith Medcalf (kmedcalf) on 2020-11-07 16:07:52 in reply to 8 [link]
Except, of course, that if you update the row, you have to re-write the blob. When any part of the row is changed the WHOLE row is re-written. This means that if you put a 4 GB blob as the last column of a table with 4 other data fields before it, whenever you update one of the other 4 fields you will re-write the entire record INCLUDING the 4 GB blob which you will have to both read into memory and write back out to disk, perhaps multiple times. This is one of the reasons that large blobs (and text fields, which are the same thing, just with overloaded meaning) should be stored in their own table.
(13) By Gary (1codedebugger) on 2020-11-08 01:48:32 in reply to 8 [link]
Thank you very much; I was unaware of how that worked in SQLite. Even if it may cause additional work for table rows that will be updated, I assume it would be beneficial for tables with static rows containing reference information.
(14) By Gary (1codedebugger) on 2020-11-08 01:50:54 in reply to 13 [link]
Thank you for pointing this out. For my particular use case, it would be quite a problem to place the blob on the same row with the rest of the columns because of the high frequency of user-initiated updates.
(16) By Simon Slavin (slavin) on 2020-11-08 02:29:08 in reply to 14 [link]
I think you have figured out your answer to this part of the problem. If you frequently update a data row without changing the BLOB, then it makes sense to store the BLOBs in a different table.
(17) By tom (younique) on 2020-11-08 10:15:57 in reply to 8 [link]
> SQLite would have to read column2, because it needs to know where it ends to find column3. Once it has read column3, SQLite knows it has everything it needs and doesn't bother reading c4, c5 or c6. By putting the column holding BLOBs at the end of the sequence you arrange that SQLite will never read it (or reserve memory for it) unless the SQL command needs it. Sorry, but I do not understand that. There has to be some length information for the Blob, because otherwise one couldn't know where it ends. So all SQLite has to do is to skip (not read!) n bytes. Shouldn't be any slower than to skip a floating point number or an integer in column2. For writing, it should also be indifferent whether 1 MB blob is written at the beginning or at the end. The whole row should always be the same size no matter of the order of columns.
(18) By Kees Nuyt (knu) on 2020-11-08 17:48:13 in reply to 17 [link]
The point with BLOBs is, they can easily be larger than a database page, so overflow pages get involved. They have to be read from disk if not cached. If the BLOB itself is not referenced by the query, reading those overflow pages can be avoided by placing all short columns at the beginning of the row. ``` -- Regards, Kees Nuyt ```
(19) By Keith Medcalf (kmedcalf) on 2020-11-08 21:10:31 in reply to 18 [link]
Mutatis Mutandis TEXT fields which can also be larger than a page. There is no difference between CLOB (TEXT) and BLOB other than the overloading of type information, and the same issues apply to both (or to any record that is longer than the payload size of a page). It does not matter whether the field is a 2 GB movie or a 2 GB novel written in ASCII prose. What matters it the size.
(20) By Gary (1codedebugger) on 2020-11-09 04:01:12 in reply to 19 [link]
If all text fields were limited to the page size (by natue of the application itself) or they were not limited but at least 95% were expected to be less than the page size, would there be any reason to hold the text data in a separate table from BLOB data that was expected to always exceed the page size? I don't mean with respect to how columns are arranged or the performing of updates, but only in the case of having a two-column table of key and text, and key and BLOB. Is it bad practice to have tables that mix rows of small size with rows of large size when it is known in advance which rows will be in each size category? In my particular case, there is no other reason that I know of at this point for placing the user input text and BLOB data in different tables other than a considerable difference in size. Thank you.
(21) By Keith Medcalf (kmedcalf) on 2020-11-09 04:37:40 in reply to 20 [link]
There is no point in moving "small fields" from the "main" record object to a "secondary" record object if the entire record will fit as payload on one page (and it is not part of the data normalization process). The only time a difference comes in to play is if the "entire record" will not fit on a page and must consume an overflow page (particularly more than one overflow page) because then access to any field after the overflow requires retrieving the overflow page and finding the pointer therein to the next overflow page and finding the pointer therein to the next overflow page and so on and so forth until you have finally found the "column" you are looking for. Similarly, an "entire record" which cannot fit on a single page will require the I/O of the page on which the initial fragment is located plus all of the other pages comprising the overflow chain when that record is "re-written" (as in INSERT/REPLACE/UPDATE/DELETE). The reason that this is usually mentioned for BLOBs is that the very name suggests largness (Binary Large OBject). But it also applies to CLOBs (Character Large OBjects) -- otherwise called TEXT fields. It is just that the common connotation of TEXT is that it is of piddling size, not of "Large" size. If there were a MPREAL (MultiPrecision REAL) that could consume more than one page of space, then it would have the same considerations. In other words, consideration applies to the size of the record and not the type of the data. Putting 1 million REALs in a single row would have exactly the same considerations because that would require 8 MEGABYTES of page payload to be I/O'd on every INSERT/REPLACE/UPDATE/DELETE and multiple pages to be read via the overflow chain until the requested data in the record is located.
(22) By Gary (1codedebugger) on 2020-11-12 06:18:18 in reply to 21 [link]
Thank you. I didn't ask the question correctly nor provide sufficient information. I wasn't trying to ask about separating data on the same row into two tables but whether or not CLOBs and BLOBs of different sizes should be stored in the same column of the same table. There are two categories of data that need stored for the application I'm asking about. Users build small documents inside the application through adding UI elements and then input text or link media files (audio, video, PDFs) within those elements. The position, sequence order, size, and everything else concerning each document element except the actual text or media file itself is stored within a map object in a different table. Each map object can have any number of text blocks and media files associated with it. They are stored in another table and have the primary key of the map object as a column. That table has only three columns: the auto-generated rowid, the primary key of the associated map object, and the text or media data. The only column in that table that will ever be updated/replaced is the text/media column. My question is are two tables required for the text and media data, that is, one for CLOBs and one for BLOBs? I ask for two reasons. One is that due to the nature/structure/purpose of these small documents, at least 95% of the time each CLOB will never exceed more than a small paragraph of text and will always be much smaller than any of the BLOB media files. I think the text rows will almost always be less than the default page size and the BLOB rows will always exceed it. The other is that I wasn't sure if it is safe practice to declare a column as a BLOB and frequently write text to it also, relying on the column affinity rules to accept text in the BLOB column. There will be two general database tasks associated with this portion of the data. One is to retrieve a map object and all CLOBs and BLOBs associated with that map's primary key, and pass them to the application to contruct the requested document. The other is to either overwrite a CLOB or add/delete a BLOB record as a user builds/edits a document. Thank you.
(23) By Keith Medcalf (kmedcalf) on 2020-11-12 22:22:44 in reply to 22 [link]
There is no difference between a BLOB and TEXT provided that the interfaces used to access the *text* field are the same as the database encoding. That is, if the default encoding is UTF8 and the APIs you use to access the field are the _text variant (which expect to input 8bit characters and output 8 bit characters) *OR* the database encoding is UTF16 and the APIs used to access the data is only ever the _text16 variants, then there will be no "diddling around" with the data you have provided when it goes "into" or "out of" the database record. The *only* difference between a BLOB and TEXT is that TEXT is assumed to contain TEXT, meaning that is can be converted between the various API and internal encoding formats, UTF-8 and UTF-16BE and UTF=16LE, depending on which combinations of APIs for dealing with TEXT fields and what the default database encoding is set as, and the particular platform. Functions for dealing with "TEXT" also assume that a 0 codepoint terminates the text (since this is part of the innate definition of what comprises a text string). If you declare something as "TEXT" when it is not well-formed TEXT may result in various explosions, meltdowns, and proper behaviours which the user feels are improper because they failed to compy with the requirements that TEXT be properly formed and therefore shot themself in the foot. There is nothing which prevents you from storing TEXT in a BLOB field. How your application "interprets" the bytes is entirely a matter for your application. The attribute TEXT is only important if you want SQLite3 to treat the data as TEXT (for the purposes of diddling about with it and interpreting its contents) in which case it must be "properly formed" text.
(10) By Lifepillar (lifepillar) on 2020-11-07 18:26:54 in reply to 7 [link]
(12) By Gary (1codedebugger) on 2020-11-08 01:44:34 in reply to 10 [link]