Remaining useful space before spilling

(1) By anonymous on 2023-08-25 21:59:29 [link] [source]

I was reading the file format and there is a talk about value of P that can be calculated using U-35 for table b-tree to find how much the cell content can be before it is force to spill and go to another overflow page. The calculation seems simple but how one can know, how much is available?
As U is said to be the size of page minus the reserved bytes. But a page maybe be already filled by a couple of records. And how one know the exact size of those rows? It should be harder because of variant values and no public api to get the size of fields as far as I know.

Now I am wondering, what is the best way to calculate how much available useful storage is in the current page?
For example, if we have a simple rowid table and only one blob column, what will be the maximum size of the blob for next insert to fill this page and not spill?

Bonus question: 
If a value is forced to go to another overflow page, what will happen to the free space of that overflow page? For example considering a 4kb page, if the value is 6kb, and it takes one and half page, is the remaining 2kb never get to be used? Or another overflow can use that?

(2) By anonymous on 2023-08-29 08:46:31 in reply to 1 [link] [source]

As I didn't get any answer or comment, I should ask if my question is irrelevant or wrong?

(3) By Stephan Beal (stephan) on 2023-08-29 09:04:29 in reply to 2 [link] [source]

As I didn't get any answer or comment, I should...

There is, to the very best of my knowledge, no public API for asking how much space is left of a given db page.

(4) By anonymous on 2023-08-29 09:09:15 in reply to 3 [link] [source]

Is there any clear way to calculate it?
I have a lot of blob and I saved them as chunks to limit memory usage and blob limit size.
But my database gets a lot higher than I want as there are many half used pages because of those blobs.
So I prefer to split them dynamically.

(5) By Stephan Beal (stephan) on 2023-08-29 09:22:10 in reply to 4 [link] [source]

Is there any clear way to calculate it?

Not without, perhaps, navigating the sqlite3 internals. The internals are private APIs which can and do change over time, so any reliance on them would be ill-advised.

In client code, if you'll do:

#include "sqlite3.c"
// instead of:
//#include "sqlite3.h"

the code in that file will have full access to sqlite's internals.

(6) By anonymous on 2023-08-29 09:27:31 in reply to 5 [link] [source]

That is unfortunate.
So there is no way to better for storing blobs without wasting extra space?

Maybe if there was an official SizeOf() I could, calculate how much data I inserted or deleted, and use that to know what page I am.

(10) By Ryan Smith (cuz) on 2023-08-29 10:03:32 in reply to 6 [link] [source]

There are many way to more efficiently store Blobs, but not in SQLite.

You are trying to make a Database into a Random Access File. Those two things serve different purposes.

A much faster and better way to store BLOBs would be to simply append all BLOB data to a single file and store (in SQLite perhaps) the offset and length of each. This is however a terrible idea if you mean to change those blobs over time.
One way to do that is to invalidate the current data for a Blob and overwrite it if the new BLOB is smaller, or re-append it with new offset and length stored, à la Tape-drives, then repack it if space-waste crosses a certain margin.

A better way is to write the blobs into preset Block sizes, or pages, if you will. Problem is with this method you will get gaps due to half-filled pages, and still have to relocate a blob once it outgrows its allocated pages.

SQLite actually does all the above, but it comes at the problem from the objective of typical data storage efficiency (as opposed to random length BLOBs) and a very wide Querying/Indexing and other RDBMS capability objective, as opposed to a BLOB-specific one. It's a jack-of-all-trades and while it does most of the trades at very well-honed-over-time efficiencies, it will never be the most efficient at the very narrow specific subset that you are trying to optimize. That is also probably why such specific tuning APIs are not available.

Further to this, you can highly optimize for speed, or space, but not both. SQLite does a fair job at both but probably leans more toward speed than space optimization.

The way I see it is that you (along with all of us) have two options:

If your BLOB storage is simple enough and you can live with some space wastage at relative quickness, let SQLite do it.
If your BLOB storage is stringent, and needs to be highly tuned for either space or speed, do it yourself in a more efficient way, perhaps using SQLite to keep the indexing.

(11) By anonymous on 2023-08-29 10:59:48 in reply to 10 [link] [source]

Your analysis is on point.
- Beside indexing and keeping data, SQLite gives you ACID too, and I want that.
- You can not do it safe a secure with a file system.
- Also note that for small blobs (64kb) SQLite is faster most of the time.
- And keeping copy, backup and such tasks much easier as you have one source or file for that.

My only problem is to reduce that "space wastage" by inserting chunk of those blobs according to what I have in the page, so I don't trigger SQLite to make half used overflow pages.
I understand SQLite is a generic database, focused on SQL. But it is far from that now, and using it as a blob storage seems quite a useful task.

(12) By Ryan Smith (cuz) on 2023-08-29 11:33:13 in reply to 11 [link] [source]

I agree with all your points, except the "...but it is far from that now,.." bit. It really still is just that - A generic RDBMS, which is a strength of it btw, not a shortcoming.

That said I don't disagree, it would be lovely to "help" SQLite place those data bytes more efficiently - I'm however saying that the use case is not common, nor important to the point of adding such APIs and maintaining them. You're a bit left out in the cold by virtue of your uncommon specific needs.

---Aside---

If you could find a way to jiggle the SQLite code somewhat to achieve it, using what others suggested in this thread perhaps (and it sounds like it is within your competencies), and make your version pass the test-suite, then perhaps you could maintain such a fork and maybe enlist help maintaining it. While the need is uncommon, it's certainly not unique. I have seen other posts over the years interested in upping BLOB storage efficiency.

Vanilla SQLite, for now, has no public methods to achieve it.

(15) By anonymous on 2023-08-29 12:58:14 in reply to 12 [link] [source]

Yes I think I can make it work. But one big no-no for me is to mess things up with SQLite internals. I like to let it do it's job and me doing mine. 
This post is to find thoughts and ways on the best ways of doing this without making hacked ways.
And you are right, you can make a fork, but I respect the thought process of @drh and if he thinks there is no need for it, I would respect that and find a more vanilla way, with a sad face.

(13) By Richard Hipp (drh) on 2023-08-29 12:21:33 in reply to 10 [link] [source]

A much faster and better way to store BLOBs would be to simply append all BLOB data to a single file and store (in SQLite perhaps) the offset and length of each.

You would think that would be the case, wouldn't you. But experimental evidence is not quite that clear-cut. See for example:

There are indeed cases where reading directly from the filesystem is faster than reading from SQLite. However, SQLite is surprisingly competitive, and depending on your circumstances might be the faster option.

(14) By anonymous on 2023-08-29 12:55:19 in reply to 13 [link] [source]

Indeed and I saw the results.

My problems as I posted before are:
- Speed of inserting BLOB is lower than Filesystem. 1.5GB/S in a 2.5 GB/S machine. And I guess it is the cost of processing pages.
- Second, it is the topic of this post. Problem of pages overhead. Looking for a better way to organize BLOB to generate as low as possible of wasted storage.

It would be lovely to have your thought on this, especially as you designed sqlar and suggested storing content as blob through the years.

(7) By ddevienne on 2023-08-29 09:46:59 in reply to 1 [link] [source]

As stated by others, there's no API for that.

But there's the dbpage extension, which allows to get raw pages, which you can parse on your own (if adventurous).

There's also the new recover code, used in the CLI's .recover, which uses the dbdata extension which gives fine-grained access to what's in pages.

That's as good as it gets, w/o yourself parsing pages. FWIW. --DD

(9) By anonymous on 2023-08-29 09:49:56 in reply to 7 [link] [source]

Greta ideas to explore.
Thank you!

(8) By anonymous on 2023-08-29 09:49:00 in reply to 1 [link] [source]

A question for @drh.
Is this topic useful for sqlar? Or was it came to your mind but felt not useful?

I will be happy to hear any opinions about such topic, to be clear, storing blobs preferably chunked (to prevent size limit and memory usage) while keeping the overhead of overflow pages down.

(16) By ddevienne on 2023-08-29 14:13:00 in reply to 8 [source]

I will be happy to hear any opinions about such topic

I've longed wished and advocated on this list over the years for better blob support.

SQLite blob support works really well for small blobs (a few MBs or less).

But I also need to store very large blobs too, and have random-access on (large) blobs.

The latter is slow because of the overflow chain of pages to seek into a large blob.
And very large blobs (> 1GB) need to be explicitly chunked manually.
And the usual advise of putting blobs at the end of the row, or storing blobs in separate tables,
does not solve issues of blob values being owned by a single parent row,
and is only managed properly on delete only if FKs are enabled, when they are OFF by default.

In the past I've pointed to Oracle's two-stage approach, where the top-level in-row blob
is just an array of pointers (page numbers) to its "overflow pages", and those overflow pages
are a new type of page which store the bytes of a single blob value (minus pageno and possibly
back-pointer to the page that owns that (new type of) overflow page).

That "index-of-pages" blob value is stored as usual. Appears as a plain blob to the outside.
But is not the real blob value, just page indexes where the value is actually stored.
With such a scheme, you gain random access. There's no room in SQLite's format for another
variable sized type, but it could be opt-in at the schema level, via a new type in the SQL
that maps to blob in the DB, except that blob value is the page index, not the real bytes.

There's also the scheme of PostgreSQL and Toast tables. Which transparently does the chunking
and optional compression (of the whole value, not the 2K chunks, though), but like SQLite is
limited to 1GB (these are bytea columns, equivalent to SQLite's blob, while PostgreSQL's
blob is something else). It's different from SQLite's overflow approach, because substr(bytea)
is optimized for efficient random-access (provided the bytea is NOT compressed), unless SQLite.

Better blobs (i.e. for large blobs) would IMHO benefit Fossil, SQLar, and users like me.

I still continue to wish / hope for them to arrive one day :). FWIW.

(17) By anonymous on 2023-08-29 15:36:04 in reply to 16 [link] [source]

As you note, there's no space to extend the file format. Also, you'd need to revamp the intermediate levels to add support for substringing without reading the whole value into memory. This adds up to SQLite 5....

(18) By anonymous on 2023-08-29 15:44:34 in reply to 1 [link] [source]

To simplify the problem, I am trying to find the optimal size of a BLOB that:
- Create just one overflow page and use it all
- Allocate as little as possible from the leaf page.


Reading the documentation at https://www.sqlite.org/fileformat2.html#b_tree_pages
Shows that for a database with 4096B pages and no reserved bytes:
U = PageSize - ReservedBytes = 4096 - 0 = 4096
X = U - 35 = 4096 - 35 = 4061
M = ((U-12)*32/255)-23) = 489.5019
K = M+((P-M)%(U-4)) = ? (As we do not know P for now)

It is stated that, each overflow page, has 4 bytes at the beginning, and the rest is the content, so the useful bytes for the overflow page is:
O = U - 4 = 4096 - 4 = 4092

From documentation, I think I want this part:
"If P>X and K>X then the first M bytes of P are stored on the btree page and the remaining P-M bytes are stored on overflow pages."

From what I understand, the optimal size of P must be:
P = O + M = 4092 + 489.5019 = 4581.5019 (What can be in a overflow page + The minimum it must be stored on the leaf)
I can interpret this value as 4581 or 4582 depending on the rounding as I do not know where is this code of this part is.

I tried this to insert just one record with one blob in a table as:
CREATE TABLE IF NOT EXISTS "Value" ("Data" BLOB);

And get the stats like:
sqlite3_analyzer-32.exe --stats "test.db"

The result for P as 4581 is:

BEGIN;
CREATE TABLE stats(
  name       STRING,           /* Name of table or index */
  path       INTEGER,          /* Path to page from root */
  pageno     INTEGER,          /* Page number */
  pagetype   STRING,           /* 'internal', 'leaf' or 'overflow' */
  ncell      INTEGER,          /* Cells on page (0 for overflow) */
  payload    INTEGER,          /* Bytes of payload on this page */
  unused     INTEGER,          /* Bytes of unused space on this page */
  mx_payload INTEGER,          /* Largest payload size of all cells */
  pgoffset   INTEGER,          /* Offset of page in file */
  pgsize     INTEGER           /* Size of the page */
);
INSERT INTO stats VALUES('sqlite_schema','/',1,'leaf',1,56,3928,56,0,4096);
INSERT INTO stats VALUES('Value','/',2,'leaf',1,492,3587,4584,4096,4096);
INSERT INTO stats VALUES('Value','/000+000000',3,'overflow',0,4092,0,0,4096,4096);
COMMIT;

It shows that no unused byte for page 3, the overflow page.

But, because of curiosity, I checked all P from 4577 to 4582 and here are the result for page 2 and 3, table value:

P = 4577 : 
INSERT INTO stats VALUES('Value','/',2,'leaf',1,489,3590,4580,4096,4096);
INSERT INTO stats VALUES('Value','/000+000000',3,'overflow',0,4091,1,0,4096,4096);
There is 1 unused bytes on page 3, and 3590 bytes for page 2.

P = 4578:
INSERT INTO stats VALUES('Value','/',2,'leaf',1,489,3590,4581,4096,4096);
INSERT INTO stats VALUES('Value','/000+000000',3,'overflow',0,4092,0,0,4096,4096);
There is 0 unused bytes on page 3, and 3590 bytes for page 2.

P = 4579:
INSERT INTO stats VALUES('Value','/',2,'leaf',1,490,3589,4582,4096,4096);
INSERT INTO stats VALUES('Value','/000+000000',3,'overflow',0,4092,0,0,4096,4096);
There is 0 unused bytes on page 3, and 3589 bytes for page 2.

P = 4580:
INSERT INTO stats VALUES('Value','/',2,'leaf',1,491,3588,4583,4096,4096);
INSERT INTO stats VALUES('Value','/000+000000',3,'overflow',0,4092,0,0,4096,4096);
There is 0 unused bytes on page 3, and 3588 bytes for page 2.

P = 4581:
INSERT INTO stats VALUES('Value','/',2,'leaf',1,492,3587,4584,4096,4096);
INSERT INTO stats VALUES('Value','/000+000000',3,'overflow',0,4092,0,0,4096,4096);
There is 0 unused bytes on page 3, and 3587 bytes for page 2.

P = 4582:
INSERT INTO stats VALUES('Value','/',2,'leaf',1,493,3586,4585,4096,4096);
INSERT INTO stats VALUES('Value','/000+000000',3,'overflow',0,4092,0,0,4096,4096);
There is 0 unused bytes on page 3, and 3586 bytes for page 2.

The optimal value seems to be 4578 as lower values, leave extra space on page 3 and higher one use from page 2.


Now I am curious, was my math wrong, or did I have a mistake somewhere else, or is there an undocumented behavior in SQLite internals for processing overflow?

(19) By Dan Kennedy (dan) on 2023-08-29 20:20:43 in reply to 18 [link] [source]

  P = 4578:
  INSERT INTO stats VALUES('Value','/',2,'leaf',1,489,3590,4581,4096,4096);
  INSERT INTO stats VALUES('Value','/000+000000',3,'overflow',0,4092,0,0,4096,4096);
  There is 0 unused bytes on page 3, and 3590 bytes for page 2.

This entry looks like it has a payload of 4581 bytes, making your arithmetic correct. The payload is an SQLite record. If you insert a 4578 byte blob into your table, I think the record size is 4581 bytes - a single byte for the size-of-header field, then a 2-byte serial-type for the blob, then the blob itself:

https://www.sqlite.org/fileformat2.html#record_format

Dan.

(20) By anonymous on 2023-08-30 11:25:03 in reply to 19 [link] [source]

Thank you for pointing that out!
Can you help me how did you know you should add 3?

Please correct me if I am wrong but what I understand from documentation, a "record format" is like:
Header:
- Size in bytes, stored as variant.
- Array of "serial type" variant for each column.
Body:
- Array of value of each column.

From my understanding of documentation and your post, actual size is "Serial Type Size" + "Value Size".

For BLOB of length is "(N-12)/2", N being the serial type.
For this sample, length is 4578 bytes, so:
L = (N-12)/2 = 4578
Then we can calculate N as:
N = (L*2)+12 = 9168
This number needs 14 bits, and as far as I understand the variant type, it uses 7 bits for each byte for first 8 bytes. So it will need 16 bits to store this 7 bits number, and it is 2 byte.

Size should be:
4578 + 2 = 4580
But you showed that it is 4581.

What is that missing one? What is the size-of-header?

(21) By Dan Kennedy (dan) on 2023-08-30 11:45:12 in reply to 20 [link] [source]

Please correct me if I am wrong but what I understand from documentation, a "record format" is like:

Header:

Size in bytes, stored as variant.

Array of "serial type" variant for each column.

Body:

Array of value of each column.

That's correct.

What is that missing one? What is the size-of-header?

The "size in bytes, stored as a varint" field at the start of the record format is the size of the record header only - up to the end of the array of serial types. In this case 3 bytes (1 for itself and 2 for the serial_type of the blob). 3 as a varint consumes 1 byte.

Dan.

(22) By anonymous on 2023-08-30 11:55:36 in reply to 21 [link] [source]

So the payload is the whole record without the key (in this case ROWID)?
Then if I have multiple columns other than the BLOB one, I should get the size of them and calculate the payload as described to find out the actual size of the payload?

If that is the case, as far as I know there is no SizeOf() function to get the size of each value, correct?
What will be the best way to calculate the size of the payload? Doing custom math, for each value?

(23) By Gunter Hick (gunter_hick) on 2023-08-30 12:29:24 in reply to 22 [link] [source]

Note that the values NULL, 0, 1, empty string and empty blob do not take up any space in the payload. Changing a value between that set and anything else will affect the length of the payload and invalidate the previous BLOB segmentation calculation

(24) By anonymous on 2023-08-30 12:32:08 in reply to 23 [link] [source]

Yes.
But I am curious about what is the best way to calculate payload size?

(25) By Dan Kennedy (dan) on 2023-08-30 17:17:46 in reply to 22 [link] [source]

So the payload is the whole record without the key (in this case ROWID)?

That's correct. For both table and index b-trees, the "payload size" is the size in bytes of the database record stored in the entry:

https://www.sqlite.org/fileformat2.html#record_format

Dan.

(26) By anonymous on 2023-08-30 20:19:59 in reply to 25 [link] [source]

Thank you.

Is there a good way to calculate record or payload size in exact byte size?
I’m afraid that it is not going to be easy to calculate because as far as I know, SQLite converts some values like REAL if it is possible.

I really wish for a simple way to get a value or record size right now.

(27) By anonymous on 2023-08-31 04:29:06 in reply to 26 [link] [source]

Is there a good way to calculate record or payload size in exact byte size?

Duplicating everything SQLite itself does when assembling records. No shortcuts are available.

If avoiding half-full pages really is all-important, a general-purpose database is probably the wrong tool for your application.

(28) By Warren Young (wyoung) on 2023-08-31 07:40:26 in reply to 27 [link] [source]

I feel this is a trivially-true reply in a world where our application programs are linked to SQLite itself. Nothing needs to be duplicated. The answer is in there, somewhere.

Whether the SQLite devs choose to expose this information by an API or whatever is another question entirely.

(29) By anonymous on 2023-08-31 08:02:05 in reply to 28 [link] [source]

Exactly.
As SQLite has been and is a very suitable file format, and size is a critical need, especially for network-intensive projects, loss of space will be a problem for such applications and use cases.

One big part and benefit of SQLite is the pager, and finding a way to use it for such needs would be a big win.

(30) By anonymous on 2023-08-31 09:28:22 in reply to 29 [link] [source]

dbstat.c has the calculation for unused bytes, so one can use that to know how much space is available to insert the next record and split the BLOB to use that space.
But one question is, how one can calculate the size of a cell before inserting it?
It is possible to create custom functions calculating that, but as SQLite does some internal magic including variants and converting types (REAL to INETEGER), I guess it will not be an easy task.
Is there any internal function that helps with this?


To be clear, I need these:
- A way to know how much free space is available in this current page. So I can split the blob according to that.
- A way to calculate size of row/cell to make a chunk of blob to just fit that free space.

I would appreciate any help.

(31) By anonymous on 2023-08-31 11:12:29 in reply to 30 [link] [source]

You seem to be assuming that there's always a single well-defined current page to be filled. That only applies if you never delete anything (except possibly the most recently added blob).

(32) By anonymous on 2023-08-31 12:10:44 in reply to 31 [link] [source]

Yes I assume, SQLite always adds to a page, whatever that is, it know where it wants to insert the cell, knows the size of the cell too.