SQLite Forum

What would be the recommended way to hold an array of ~600 double values?
Login

What would be the recommended way to hold an array of ~600 double values?

(1) By PazO (JohnSmith) on 2020-11-18 04:53:08 [link] [source]

For historical reasons my application holds many records in the database where each record holds an array of ~600 double values.
The number of the records is mostly around 15K but theoretically can reach 200K.

The current format of these values is in a string.
The benefit of this string is that all double values that hold short values such as '0' or '1' are taking only 1 byte (plus 1 for delimiter).
The down side is the parsing and recreation of this string everytime any value is modified or read.

What would be the preferred format to hold such data?
I was considering:
1. Use BLOB that holds an array of 'double' values. The downside here is that the size of this array is very large: 600 * 8 (size of double), and this will be multiplied by the number of records.
2. Create a new table with a column for each double value - This is creating a table with ~600 columns... The benefit is to use sqlite optimization for space for such values.
3. Leave things as they are and work with string.

Thank you for any tip,
Pazo

(2) By Keith Medcalf (kmedcalf) on 2020-11-18 05:57:29 in reply to 1 [link] [source]

create table Records
(
   row integer not null,
   col integer not null,
   value numeric,
   primary key (row, col)
) without rowid;

(3) By Warren Young (wyoung) on 2020-11-18 06:07:53 in reply to 1 [link] [source]

the size of this array is very large

On what machine supporting SQLite is 4.6875 kiB "very large?"

I mean, yes, it's very large on an Apple //e, but that won't run SQLite at all, so...

Quoting Donald Knuth:

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

Here I think you're over-valuing space optimization, paying a cost not only in CPU optimization during parsing, but also CPU time to re-convert doubles back to strings for storage. Worse, I think you're overspending on programmer time, the most precious commodity of all on most projects. How much RAM and disk space does your hourly rate buy?

(4) By PazO (JohnSmith) on 2020-11-18 07:26:47 in reply to 3 [link] [source]

This answer did not really help as it did not suggest anything.

The size does matter because each session saved by our customer is held by a single sqlite database file. It does make a different if avg. file-size for a single session will be ~50MB or ~1GB.

As for performance - our application must save integrity in cases of crashes or power loss thus if every modification our customer is doing has to go through text parsing prior to database save it does make a noticeable difference.

(6.4) By Warren Young (wyoung) on 2020-11-18 08:56:10 edited from 6.3 in reply to 4 [link] [source]

it did not suggest anything.

It suggested that you defend your claim that 4k is a big problem.

I support Keith's basic design, though I think his "col" column is pointless, given the little you've shared of your program's design. That feature of his schema assumes you'll have more than one array of doubles per SQLite row, but I don't see that you've actually said that.

I challenged you on an ROI basis before. Let's say you're not actually a software developer, but you're using SQLite in support of some other job. Or, maybe you're outside the US. Either way, I'm going to say your complete salary and benefits package is $52k to make the math easy. If you spend a week fiddling around with space optimizations this fiscal year, you will have spent $1000 of your organization's money.

The sweet spot of the storage market is about $32/TB these days, so $1000 will currently buy you 10 TB of triply-redundant disk storage. DDR4 ECC UDIMM server RAM is going for about $4/GB, so the same $1000 will buy you 250 GiB of RAM. You can split those two values any way you like: 128 GiB of RAM in the server plus a 3-disk RAID-1 of 4 TB disks to back the DB, for example.

Is your salary + benefits package more than $52k? Okay, then your week of time doing micro-optimizations will buy even more storage.

The question then becomes, how many of these sessions are there?

Let's first take your 50 MiB session size. I'm going to guess this is your ~4k number multiplied by 8, being the 1-byte 0/1 value vs. 8-byte doubles. That's probably insupportable, since as you said earlier, you need some kind of delimeter for ASCII storage, but I'll give you the 8x multiplier anyway, because I am kind. :)

50 MB divides into our 4 TB RAID as 800000 sessions. Do you have that many customers? Oh, and realize that you've now got enough RAM to cache the entire DB for your $1000 in hardware cost.

Now let's take your worst-case value of 1GB: we've got room for 4000 of those in the DB. Do you have more than 4000 active customer sessions? Are they all maximally-sized, or is there a mix of max-sized sessions and lesser ones?

You can work it inversely, too: if each customer costs you $0.25, using the same worst-case numbers, are they each paying you at least a quarter-dollar a year? If so, then why are you spending their money fighting for every last byte of storage?

It does make a different if avg. file-size for a single session will be ~50MB or ~1GB.

Please don't shout. (Mind, this is a moderator speaking to you now.)

if every modification our customer is doing has to go through text parsing prior to database save it does make a noticeable difference.

Yes, thus my point: give up on ASCII storage of doubles and use the platform's actual binary representation. Now you can update single values, if you want, which will not only be good for I/O time, it'll be good for durability, because fewer bytes of the DB will be changing at any one instant.

(9) By PazO (JohnSmith) on 2020-11-18 10:55:11 in reply to 6.4 [link] [source]

Please don't shout...

Sorry, the bold came accidently, not sure how (I did not use '*')

(10) By Stephan Beal (stephan) on 2020-11-18 11:03:29 in reply to 9 [link] [source]

Sorry, the bold came accidently, not sure how (I did not use '*')

Just FYI, you can always see the original source code of a post by following the "source" link next to its header:

https://sqlite.org/forum/forumpost/7230766ac6?raw

(You managed to get a <b> tag in there.)

(14) By PazO (JohnSmith) on 2020-11-18 13:16:12 in reply to 10 [link] [source]

...typo - it meant to be a line-break - <br>

(5) By Gunter Hick (gunter_hick) on 2020-11-18 07:59:48 in reply to 1 [link] [source]

What do those 600 double values actually mean?

And why are they lumped together in one record?

Are they actually related or just a random collection?

Do they represent a sparse array (in which case the suggestion to store them separately would make perfect sense)?

(8) By PazO (JohnSmith) on 2020-11-18 10:51:05 in reply to 5 [link] [source]

I will give more insight about the problem:

The application is high-end product used by top professional people in this particular industry. It is used to control a defined-set of hardware switches, gauges and controls associated with single flow of data. This single flow of data contains ~600 distinctive real values that can be modified by both moving the switches and the controls using the application or by the external hardware controls.

Once all these values are set they are saved as a single setup for this particular single flow of data. This is called a 'preset'. We can save a number of different presets for the same flow of data. On top of this there could be a number of different flows of data for other logical components.

Focusing on the single 'preset' - traditionally we managed this data in an XML string, that was eventually saved to the database. This was not very effective, as it demand that in each modification of any value, this XML-string will be rebuilt re-saved in the database.

My current task is to move the 'preset' data format from XML-string to SQL data types, that can be more easily viewed and edited.

For this I considered the followings:
1. Use BLOB to hold the array of real values,
2. Create a dedicated table that has one column per real value (hence ~600 columns),
3. Leave things as they are today - use a string

Hope this clarifies the motivation

(11.1) By Warren Young (wyoung) on 2020-11-18 12:01:18 edited from 11.0 in reply to 8 [link] [source]

The application is high-end product used by top professional people in this particular industry.

So invest $2 per customer in a conservative 5% return account, taking the resulting $0.10 per year to pay for the storage each customer will use. Now you can store their data indefinitely, assuming you roll over server hardware once every 3 years.

For this I considered the followings:

Why are you leaving out Keith's suggestion? 4. Store each preset vertically, rather than horizontally.

CREATE TABLE Customer(id, ...)

CREATE TABLE Presets(id, customer_id, description, ...)

CREATE TABLE PresetData(preset_id, order, value)

SELECT value FROM PresetData WHERE preset_id = ?1 ORDER BY order

Each value takes at least 2 bytes in ASCII form, one for a single-digit integer and a delimiter. (e.g. Comma, taking little-brother's JSON suggestion.) The ratio is therefore 4:1 for that easy case, not 8:1, coming down to 1:1 by the time you're at 5 decimal digits. 8-byte doubles can encode 15 digits to the right of the decimal point, so the doubles might actually be cheaper, if the preponderance of data with 6+ decimal digitsis high enough.

(12.1) By Warren Young (wyoung) on 2020-11-18 12:03:56 edited from 12.0 in reply to 11.0 [link] [source]

Alternate plan: I assume these measured control values go through an ADC at some point, so they're probably better stored as integers anyway. If it's an 8-bit ADC, the value 12 is better stored as INT(12) rather than as 12.0 / 256.0, whether ASCII or FP decimal.

Now you're storing data in 32-bit values, so the ratio between your present storage method and this one is 2:1 best case, not 4:1. As soon as you get to tenths of precision, it's 1:1 between the two cases: "A.B," is 4 bytes.

You may want to add a "scale" constant to the Presets table so you can divide that into the measured integers to recover the decimal form. This will then let you work with systems with different levels of precision, such as a later 10-bit high-precision system, where the scale would be 1024.

(15) By PazO (JohnSmith) on 2020-11-18 14:05:33 in reply to 11.1 [link] [source]

So invest $2...

Regarding file-size:
Think that you create and save 6 WORD documents.
Looking at your hard-disk now you find that each document size is ~1GB instead of ~20MB.
Now you cannot attach them to email easily, and it is slower to copy them.
So what is the point to talk about cost of storage?
This is missing the point.


CREATE TABLE PresetData(preset_id, order, value)

This is a good idea but keep in mind that in this case we hold extra integers per each real value.
Where as in binary array BLOB we will hold only the values ({0.234; 0.8745; 23.646;, ...}),
in this case per each real value, the database is holding extra two integers: {(0, 0, 0.23423), (0, 1, 0.234234), (0, 2, 34.3443),...}

(18) By Warren Young (wyoung) on 2020-11-18 14:19:23 in reply to 15 [link] [source]

each document size is ~1GB instead of ~20MB.

Where's that 50x multiplier coming from? We've already established that the worst case difference here is 4x, and that only in the case where all values are single-digit integers.

what is the point to talk about cost of storage?

If you spend $1000 in developer time to save $50 in storage, you've wasted someone's money.

(Which is why this is probably my last message in this thread: time to move on to something more productive.)

the database is holding extra two integers

Yes, two 64-bit ints and a 64-bit float per datum = 24 bytes, vs. strings like "1.2345," 8 ASCII characters for 4-decimal digit precision, giving a 3:1 size difference in this case, while running faster because it isn't doing all this ASCII string manipulation.

TANSTAAFL.

(19) By Richard Damon (RichardDamon) on 2020-11-18 15:07:57 in reply to 18 [source]

But remember that SQLite doesn't store integers as 64 bit numbers (unless they are very bit) but uses less bits for smaller numbers, so the cost isn't anywhere near as large.

Yes, the blob of 600 float will be smaller, by.a bit, but at the cost of imposing technical debt on the system if anything needs to change, and if some systems don't need all the values, then could be bigger.

(20.1) By Warren Young (wyoung) on 2020-11-18 15:30:48 edited from 20.0 in reply to 19 [link] [source]

Right, so more like two 16-bit ints and a double, allowing up to 64k presets and 64k values per preset before the rows start getting larger.

If you add in my "store ADC measurements as ints" idea, the double becomes a 1-bit int for 8-bit ADCs or a 2-bit int for any reasonable ADC beyond that, since ADCs beyond 16-bit are exceedingly rare outside of audio-land.

(21) By Richard Damon (RichardDamon) on 2020-11-18 15:50:03 in reply to 20.1 [link] [source]

(Bytes, not bits) And they only get bigger for the rows that use the bigger values, as the number of bits in the value is part of the internal type of the field.

You also get that you can omit values that aren't needed (one setting disables a channel, no need to store all the unused settings for that channel)

(13) By Richard Damon (RichardDamon) on 2020-11-18 12:12:44 in reply to 8 [link] [source]

One other option for storing the results would be a table with a multipart primary key, One part bringing the preset number, the second being the setting number, and then a data column for the value.

Something like:

CREATE TABLE Presets ( preset_no INTEGER, value_no INTEGER, value REAL, PRIMARY KEY(preset_no, value_no) );

You can then get a complete set of values for a given preset with something like

SELECT value_no, value FROM Presets WHERE preset_no = ?;

where you bind the ? to the preset_no you want to fetch.

This has the ability to allow you to omit values from a preset if they don't matter to that one.

(16) By Gunter Hick (gunter_hick) on 2020-11-18 14:17:52 in reply to 8 [link] [source]

Are you by any chance talking about a sound (or light) mixing application? Those tend to have a bunch of controls, but would be structured into channels, buses, etc.

(17.1) By Gunter Hick (gunter_hick) on 2020-11-18 14:18:04 edited from 17.0 in reply to 8 [link] [source]

Deleted

(7) By little-brother on 2020-11-18 09:11:19 in reply to 1 [link] [source]

The current format of these values is in a string.

At least you could use json by adding open and closed brackets [10.23,0,12,...]. It allows you to use json1 function to manipulate data.

How to store data depends on your priorities: CPU, storage size, easy to view/changing and etc. There is no silver bullet.