SQLite Forum

SQLite3 select (with table scan) not saturating NVMe throughput
Login

SQLite3 select (with table scan) not saturating NVMe throughput

(1) By anonymous on 2021-12-11 21:53:25 [link] [source]

Hello,

I have a 132GB SQLite3 database, all data in a single table, sitting on a Sabrent Rocket NVMe device. Using Crystal Disk Mark, I have measure maximum read performance around 3722 MB/s.

If I select rows sequentially via SQLite3, however, the read speed maxes out at around 370 MB/s. I have verified the following:

  1. This limit is the same whether running SQLIte3 from the command line, or via the Python 3 API.
  2. The query plan consists of nothing more than --SCAN

I'm not sure what the cause of the bottleneck is. Do I have mistuned pragmas or is SQLite3 not able to take full advantage of NVMe architecture?

My disk throughput is at 10% utilization, and none of my CPU cores appears maxed.

Any pointers would be appreciated!

(2.1) By Scott Robison (casaderobison) on 2021-12-11 22:37:37 edited from 2.0 in reply to 1 [link] [source]

It would be useful to see the schema and the query to have any hope of offering advice on how to increase performance. I suspect you just need a better index so the query planner can process data more effectively.

Edit: Oh, I see now your concern has more to do with the amount of bandwidth consumed, not the speed of the query. I still think more information will be useful, but I doubt any program that processes data will ever be able to max out the throughput of such a device. It takes time to read one or more pages to find the data, unpacking the rows along the way to do something useful with them.

(4) By anonymous on 2021-12-11 22:46:50 in reply to 2.1 [link] [source]

I doubt any program that processes data will ever be able to max out the throughput of such a device. It takes time to read one or more pages to find the data, unpacking the rows along the way to do something useful with them.

Thanks for the reply. The table is simple (column names changed), and only aaa and ccc are populated:

CREATE TABLE subdoc
    (
        aaa integer,
        bbb integer,
        ccc blob,
        ddd blob,
        eee blob,
        PRIMARY KEY(aaa),
        FOREIGN KEY(bbb) REFERENCES doc(id)
    );

The query itself is very simple:

SELECT sum(1) FROM subdoc;

The plan is about what I expect:

> explain query plan SELECT sum(1) FROM subdoc;
QUERY PLAN
`--SCAN subdoc

I can replace this query with other more complicated queries, it doesn't seem to affect the maximum read throughput of 370 MB/s.

(3) By Keith Medcalf (kmedcalf) on 2021-12-11 22:39:59 in reply to 1 [source]

This sounds like a hardware issue (you were sold a "bill of goods" because you did not ask the right questions).

Nor did you mention the Operating System, OS version, nor any particulars about the computer system itself (and the particular CPU and chipset), all of which are relevant to the question asked.

That said, unless you payed a lot of money and purchased your hardware from a very reliable source, it is quite likely that your NVME device is connected onto the low-bandwidth shared PCIe bus expansion and not CPU connected PCIe lanes (this is typical with all Intel CPUs until you get to the very tippy toppy end -- they simply do not have the PCIe lanes available to be able to obtain any meaningful performance whatsoever).

This is like in the "old days" when many manufacturers built "gigabit" ethernet adapters that used the USB 1.0 bus interface (which cannot even sustain the bandwith of StarLAN, let alone 10 megabit ethernet), and sold them as "10/100/1000" network adapters. Yes, theey did comply with the gigabit ethernet specification, but could only move a couple of mbps.

Similarly hard drives which were attached to a "super fast ultra whupputy diggity dog" SATA interface and were sold as being ultra fast (because of the raw spec of an unencumbered SATA controller) but had an internal transfer rate that was very slow.

Benchmark utilities (and the people who rely on them) usually do not measure real-world performance. They make a very doctored (as in arranged) benchmark in order to part the foolish with their money.

The same game of "fool the foolish" is played by manufacturers in the RAM market in order to sell common very low speed RAM with crappy access characteristics as if it were high performance RAM.

Similarly, utilizing multiple CPU cores requires (1) a CPU that can run multiple cores at once (this is often a super-secret limitation of Intel CPUs that they do not want you to know about) (b) software which can execute on multiple cores/threads simultaneously and (c) decent I/O handling.

So no, there is no inherent limitation in SQLite3 (nor could there be) which would limit its throughput (there is no "sleep(10000 - versionnumber)" calls scattered about to make each version appear faster than the previous such as some software companies do (I will not mention Microsoft by name).

When dinosaurs ruled the earth the usual cause was also defective hardware implementation, often caused by nutbars putting high thoughput devices onto low capability channels, or mixing channel usage inappropriately.

(5) By Scott Robison (casaderobison) on 2021-12-11 22:47:49 in reply to 3 [link] [source]

This sounds like a hardware issue (you were sold a "bill of goods" because you did not ask the right questions).

The OP writes that the stated performance has been measured by the OP, so it seems unlikely that manufacturer marketing is responsible for this situation. Possible, but unlikely. I think there just isn't enough information surrounding the use case, schema, and query to make any determination.

(6) By anonymous on 2021-12-11 22:54:05 in reply to 5 [link] [source]

Correct, a little more information:

  • Windows 10 Pro 10.0.19043 Build 19043
  • X570 Aorus Master motherboard
  • AMD Ryzen 7 3800X 8-Core, 3900 Mhz
  • American Megatrends F3, 6/6/2019 BIOS
  • 64.0 GB RAM, 48.7 GB available during testing

I obtained the speed of 3722.38 MB/s using CrystalDiskMark 9.0.4 x64, using sequential 1MiB reads (8 queues). 1MiB reads with (1 queue) gave me 2539.66 MB/s.

Reading random 4kb (32 queues) reads I achieve a read speed of 551.71 MB/s. Reading random 4kb (1 queue), I achieve a read speed of only 56.62 MB/s.

Hopefully that helps.

(7) By Keith Medcalf (kmedcalf) on 2021-12-12 00:10:03 in reply to 6 [link] [source]

An AMD Ryzen 7 CPU and X570 chipset will have directly CPU attached PCIe lanes eliminating the muliplexed PCIe bottlenecks commonly present in Intel PCIe architecture.

This leaves the Operating System, Microsoft Windows, or the NVMe device or drivers themselves. Which NVMe driver are you using, what storage device, and what are the Windows Device and IO settings?

(8) By Simon Slavin (slavin) on 2021-12-12 00:11:20 in reply to 6 [link] [source]

My guess is that the difference between raw reading speed (what you'd get if you read a long text file) and what SQLite does is locking.

Can you try opening the file as read-only ? If that doesn't change anything can you try opening the file in immutable mode ? One of the following may be useful.

https://sqlite.org/c3ref/open.html

https://sqlite.org/uri.html

(10) By Scott Robison (casaderobison) on 2021-12-12 00:35:13 in reply to 6 [link] [source]

Given the numbers you report for various tests, I think we can safely say that SQLite is in line with what you are seeing. SQLite will be doing random reads, even if the pages are contiguous in the file. The fact that it is outperforming 4kb 1 queue random reads demonstrates that it is relatively efficient.

What size pages are you using? I don't remember reading that previously.

It would be interesting to know if vacuuming the database would improve performance.

(13) By tensor21 on 2021-12-12 06:17:32 in reply to 10 [link] [source]

Thank you all for the helpful responses. I'm going to collect my replies to all the comments here.

TL;DR increasing the page size to 65536, the maximum allowed according to the docs, improved read speeds to as high as 900 MB/s!

What helped

increase the page size to something huge, like a megabyte or more

What size pages are you using? I don't remember reading that previously.

I did the following:

    PRAGMA page_size = 65536;
    VACUUM;

Other things that didn't seem to help

The other changes didn't seem to make a significant difference, meaning that the read speed stayed in the original range of 350-400 MB/s. Details below:

Can you try opening the file as read-only ?

Opening the file as read-only didn't make much difference in top speed.

    sqlite3 "file:///c:/path/to/file.sqlite3?mode=ro"

For reference, this is the SQL I ran:

    SELECT sum(1) FROM subdoc;

I used a simple statement that triggers a full table scan. More complex statements seem to hit the same limits.

If that doesn't change anything can you try opening the file in immutable mode ?

I tried immutable, but there was no change in the speed:

    sqlite3 "file:///c:/path/to/file.sqlite3?immutable=1"

I tried running vacuum. (Tangentially, during the vacuum process I simultaneously observed reads of 200-250 MB/s and writes of 200-250 MB/s)

Even after vacuuming and with immutable=1, read speeds remained in the original range.

increase the page cache size so that non-leaf pages are only read once

I tried the following but observed no difference:

    PRAGMA cache_size = -10000;

Regarding Keith and Scott's comments:

The fact that it is outperforming 4kb 1 queue random reads demonstrates that it is relatively efficient. Also, without special drivers Windows does not handle scatter-gather, multiple command presentement (queueing), nor block devolution/aggregation from a single thread.

If I have time, I might try RocksDB as a point of comparison. It would be interesting to see if they hit the same throughput limits.

Notes on the schema

Scott Robinson asked about the schema itself. Here it is (field names changed):

    CREATE TABLE subdoc
        (
            aaa integer,
            bbb integer,
            ccc blob,
            ddd blob,
            eee blob,
            PRIMARY KEY(aaa),
            FOREIGN KEY(bbb) REFERENCES doc(bbb)
        );

Only the fields aaa and ccc are set. ccc contains about 1.2 kb of data per row. The foreign key reference is actually bogus, since no table named doc exists in the database.

As far as the actual query I ran, I began with something a little more complex, and when I noticed the read throughput limit, I tried to simplify the query as far as I could, to see the maximum speed I could achieve on a table scan:

    > explain query plan SELECT sum(1) FROM subdoc;
    QUERY PLAN
    `--SCAN subdoc

(16) By Scott Robison (casaderobison) on 2021-12-12 15:28:15 in reply to 13 [link] [source]

Only the fields aaa and ccc are set. ccc contains about 1.2 kb of data per row.

I believe this probably best explains your speed numbers. Someone will correct me if I'm wrong, but the way SQLite encodes data tries to keep all the row data together, only spilling over to other pages if necessary. A row of data in subdoc is going to be about 1.3k including overhead for empty columns. I believe the default page size is currently 4k, so you're only going to be fitting about 3 rows per page. With 64k pages, that will be improved to about 49 rows per page. More rows in each page means fewer pages and thus fewer page reads.

(9) By JayKreibich (jkreibich) on 2021-12-12 00:12:51 in reply to 1 [link] [source]

SQLite files are not read sequentially, even when doing a table scan, and many SSDs exhibit access times that look like seek times. You want to reduce those as much as possible.

These are kind of random stabs in the dark, but might prove useful:

  1. increase the page size to something huge, like a megabyte or more

  2. VACUUM the database to make sure the records and pages are as in-order as possible

  3. increase the page cache size so that non-leaf pages are only read once

If none of that works, also look at memory-mapped I/O.

(15) By tensor21 on 2021-12-12 06:22:20 in reply to 9 [link] [source]

Thanks for the suggestions. Point (1) ended up making the most difference and allowed me to hit 900 MB/s reads. I wrote up the details in my reply to Scott.

(11) By Keith Medcalf (kmedcalf) on 2021-12-12 00:36:19 in reply to 1 [link] [source]

This limit is the same whether running SQLIte3 from the command line, or via the Python 3 API.

This is expected since both are single-thread and thus cannot use more than one "thread of execution" on the CPU. Also, without special drivers Windows does not handle scatter-gather, multiple command presentement (queueing), nor block devolution/aggregation from a single thread.

The benchmark testing does direct I/O and bypasses most of the Windows I/O subsystems so that it can perform queueing and multithread presentation (which I/O originating from the Windows Filesystems will not do).

(12) By Keith Medcalf (kmedcalf) on 2021-12-12 03:32:58 in reply to 11 [link] [source]

I did some testing on Windows 10 Pro for Workstations.

I/O write (continuous append) to a Samsung SSD (NVMe) was about 280 MB/s and read (table scan) was 380 MB/s. This used one entire "CPU thread" worth of CPU at 3.9 Ghz.

I/O write (continuous append) to a Samsung SATA III SSD (with RAPID installed so that it can do proper queueing and scatter/gather) was slightly better at 340 MB/s write and read at about 512 MB/s. This consumed the same CPU (one execution thread 100% at 3.9 Ghz).

Doing the same on a PCIe x4 (Thunderbolt) attached RAID array with "standard spinning rust" performed the write at ~240 MB/s and read at ~340 MB/s.

So your result seems in line with expected Windows CPU and I/O performance on Windows.

(14) By tensor21 on 2021-12-12 06:20:16 in reply to 12 [link] [source]

Thank you for the additional analysis and verification. Please see my reply to Scott for a description of how I hit 900 MB/s on the read.