SQLite User Forum

SQLite and PGO results
Login

SQLite and PGO results

(1) By Alexander Zaitsev (zamazan4ik) on 2023-06-21 17:31:57 [link] [source]

Hi!

I want to share my results of applying Profile-Guided Optimization on SQLite: https://github.com/zamazan4ik/awesome-pgo/blob/main/sqlite.md Since my results do not match the official SQLite post about PGO - https://www.sqlite.org/footprint.html , I want to discuss it here.

More information about other PGO applications on other software can be found here: https://github.com/zamazan4ik/awesome-pgo/

I hope my results would be helpful.

(2.1) By ddevienne on 2023-06-22 07:15:28 edited from 2.0 in reply to 1 [link] [source]

Hi. You write

measured improvement is quite consistent

but you also make it difficult to quantify that statement.

Analyzing a large raw log of one followed by the other, thus trying to compare very distant textual output, is not something (regular) humans are good at.

So I suggest you make that easier to compare somehow, which might lead to more/better engagement here. My $0.02.

(3) By Alexander Zaitsev (zamazan4ik) on 2023-06-22 13:49:11 in reply to 2.1 [link] [source]

I agree. Locally for comparison I usually use VSCode's Compare mode but I understand there are better ways for doing the comparison. I will reformat the results and post them again.

(4) By Alexander Zaitsev (zamazan4ik) on 2023-06-22 14:17:34 in reply to 2.1 [link] [source]

Added formatted results to the original document. The formatted results now show only user time. And wanna post the formatted results here as well:

-O3 (formatted) [0m0,299s,0m0,162s,0m0,163s], [0m7,397s,0m7,194s,0m7,181s], [0m9,199s,0m9,295s,0m9,159s], [0m3,574s,0m2,540s,0m2,643s], [0m21,520s,0m21,529s,0m21,530s], [0m45,183s,0m44,994s,0m45,377s], [0m5,866s,0m5,116s,0m4,898s], [0m7,509s,0m7,493s,0m7,297s], [0m42,145s,0m41,697s,0m41,715s], [0m49,528s,0m49,511s,0m49,381s], [0m10,243s,0m10,051s,0m10,252s], [0m10,908s,0m10,793s,0m10,908s], [0m14,683s,0m14,661s,0m14,730s], [0m16,249s,0m16,021s,0m16,190s], [0m17,775s,0m17,916s,0m17,701s], [0m24,801s,0m23,608s,0m23,423s], [0m36,505s,0m36,934s,0m36,319s], [0m17,858s,0m17,903s,0m18,063s], [1m2,742s,1m2,954s,1m2,051s], [0m3,034s,0m2,257s,0m2,257s], [0m9,373s,0m9,405s,0m9,358s], [0m9,620s,0m9,252s,0m9,554s], [0m7,490s,0m7,283s,0m7,425s], [0m9,629s,0m9,513s,0m9,353s], [0m8,507s,0m8,508s,0m8,556s], [0m8,573s,0m8,358s,0m8,422s], [0m8,679s,0m8,693s,0m8,600s], [0m17,200s,0m17,153s,0m17,568s], [0m0,000s,0m0,001s,0m0,001s], [2m15,972s,2m15,873s,2m16,670s], [0m15,848s,0m15,461s,0m15,509s], [0m14,449s,0m14,539s,0m14,707s], [0m57,652s,0m57,905s,0m56,327s], [0m46,322s,0m44,699s,0m45,869s], [1m16,081s,1m14,553s,1m14,672s], [0m42,548s,0m42,472s,0m42,595s], [0m0,380s,0m0,379s,0m0,356s], [0m0,292s,0m0,289s,0m0,284s], [0m0,117s,0m0,110s,0m0,100s], [0m0,808s,0m0,809s,0m0,824s], [0m0,157s,0m0,153s,0m0,152s], [0m0,160s,0m0,164s,0m0,169s], [0m0,376s,0m0,370s,0m0,371s],

-O3 + PGO (formatted)

[0m0,335s,0m0,163s,0m0,174s], [0m6,173s,0m6,250s,0m6,217s], [0m7,728s,0m7,670s,0m7,499s], [0m2,605s,0m2,123s,0m2,166s], [0m19,036s,0m17,298s,0m17,388s], [0m38,962s,0m39,521s,0m39,284s], [0m5,078s,0m4,100s,0m4,011s], [0m6,053s,0m6,264s,0m6,173s], [0m36,316s,0m36,457s,0m36,582s], [0m42,903s,0m42,164s,0m42,215s], [0m8,798s,0m8,527s,0m8,466s], [0m9,369s,0m8,822s,0m8,888s], [0m12,719s,0m12,907s,0m13,219s], [0m14,202s,0m14,180s,0m14,187s], [0m15,252s,0m14,959s,0m15,039s], [0m23,279s,0m22,127s,0m22,273s], [0m32,414s,0m31,853s,0m31,891s], [0m15,396s,0m15,423s,0m15,502s], [0m52,562s,0m52,666s,0m52,696s], [0m2,441s,0m2,151s,0m2,133s], [0m8,508s,0m8,507s,0m8,416s], [0m8,529s,0m8,523s,0m8,395s], [0m6,737s,0m6,739s,0m6,636s], [0m8,456s,0m8,370s,0m8,481s], [0m7,274s,0m7,165s,0m7,122s], [0m7,229s,0m7,227s,0m7,211s], [0m7,385s,0m7,253s,0m7,225s], [0m15,455s,0m15,426s,0m15,286s], [0m0,000s,0m0,001s,0m0,000s], [1m31,182s,1m29,984s,1m31,346s], [0m12,523s,0m12,547s,0m12,542s], [0m12,807s,0m12,855s,0m12,647s], [0m51,379s,0m51,459s,0m51,183s], [0m42,858s,0m42,228s,0m41,784s], [1m0,514s,0m59,799s,1m0,279s], [0m36,268s,0m37,337s,0m35,963s], [0m0,325s,0m0,330s,0m0,325s], [0m0,262s,0m0,257s,0m0,259s], [0m0,085s,0m0,104s,0m0,093s], [0m0,665s,0m0,677s,0m0,688s], [0m0,137s,0m0,142s,0m0,126s], [0m0,146s,0m0,142s,0m0,148s], [0m0,301s,0m0,300s,0m0,318s],

Not sure why HTML tags render in different ways here and on GitHub but I guess it's not critical now to interpret the results. You can easily grab the results and check them via your favorite diff viewer. All tested queries are located here: https://github.com/ClickHouse/ClickBench/blob/main/sqlite/queries.sql

(5) By Stephan Beal (stephan) on 2023-06-22 14:49:46 in reply to 4 [link] [source]

Not sure why HTML tags render in different ways here and on GitHub

This forum has no CSS specific to the DETAILS and SUMMARY tags. If you'd like to suggest improvements to the CSS, please post them in the fossil forum and we can explore some options.

(6.1) By SeverKetor on 2023-06-23 21:16:41 edited from 6.0 in reply to 4 [link] [source]

To boil this down to a couple totals: 2614.5s for just O3, 2176.1s for PGO. So about 20% faster

(7) By Alexander Zaitsev (zamazan4ik) on 2023-06-23 13:10:08 in reply to 6.0 [link] [source]

(Just to be sure) Formatted results represent only User time, measured 3 times per query (I ran each query three times). If you want to check Real and Sys time as well - you need to extract it from the raw results (probably with some grep/sed/awk magic from ClickBench repo). If your table shows it - sorry for the annoyance.

(8) By Richard Hipp (drh) on 2023-06-23 13:22:29 in reply to 1 [link] [source]

Though I have not reproduced Mr. Zaitsev's results, I did remove the statement about PGO from the footprint.html document. The SQLite documentation now makes no claim one way or the other about the benefits of PGO.

(9.2) By Alexander Zaitsev (zamazan4ik) on 2023-06-23 20:45:16 edited from 9.1 in reply to 8 [source]

Could you please share more information about your setup and the tests that you performed? If I am somewhere wrong or missed something important detail - please let me know. In my benchmarks I did use just these scripts (https://github.com/ClickHouse/ClickBench/tree/main/sqlite) and the only change between runs was SQLite binaries (Release, Instrumented, Release + PGO).

Definitely, I can try to reproduce the results once again.

(10) By Alexander Zaitsev (zamazan4ik) on 2023-06-30 15:45:05 in reply to 8 [link] [source]

Richard, did you try to reproduce my results? If yes and they are confirmed, could you please add a note about PGO improvements for SQLite to the documentation? Thanks in advance.

(11) By Spindrift (spindrift) on 2023-06-30 16:28:27 in reply to 10 [link] [source]

My interpretation would have been that there was no attempt to reproduce your results; however your contention is sufficient evidence to remove the comment that no benefit from such a process exists.

Or in other words, there is now equipoise.

(12) By Alexander Zaitsev (zamazan4ik) on 2023-07-01 19:28:21 in reply to 11 [link] [source]

Probably you are right. So my current question is - what do I need to do to explicitly mention PGO as a useful optimization technique in the SQLite documentation?

(13) By Simon Slavin (slavin) on 2023-07-02 17:05:29 in reply to 12 [link] [source]

A good start would be to apply a PGO procedure yourself, write up the experience and results yourself, and post it all as an article somewhere on the web. If the comments are favourable then you could post a link here.

However, note that static analysis procedures play badly with SQLite because SQLite is a set of separate API calls which can be called in many different orders. For instance, we get many posts here claiming that static analysis calls spotted memory leak possibilities in individual API calls. But they never happen in real life because a real program would never call those routines in that order. The same problem may affect whatever PGO you think you've spotted because the analysis applies to one API call rather than to a sensibly-written program which makes many calls in a specific order.

(14) By Spindrift (spindrift) on 2023-07-02 17:33:32 in reply to 13 [link] [source]

To be fair, that appears to be mostly what the OP believes themselves to have done in the first post of this thread.

(15.1) By Alexander Zaitsev (zamazan4ik) on 2023-07-03 12:00:25 edited from 15.0 in reply to 13 [link] [source]

A good start would be to apply a PGO procedure yourself, write up the experience and results yourself, and post it all as an article somewhere on the web. If the comments are favourable then you could post a link here.

That's why I am here :) I already applied PGO (the results are posted above), wrote an article about that (a dedicated one about PGO and SQLite, and a "generic" one about PGO in other real-life applications). I posted all this stuff here to get feedback from the most experienced SQLite users - SQLite developers :) If you want to get feedback from a broader community of users before investing more resources into testing PGO on your side or mentioning it somewhere in the documentation - well, that's maybe a fair point. In this case, maybe I need a dedicated post on Medium and/or HackerNews.

About your second comment regarding the static analysis. Profile-Guided Optimization is not a static analysis - that's a runtime analysis. You build SQLite in Instrumentation mode, run it on a "real-life" workload, collect some counters, recompile SQLite once again, and get a faster (according to my tests above) SQLite on a real-life load (in my tests - in ClickBench load). There are some omitted here details like AutoFDO, BOLT, Propeller and others - but it's a bit too deep for this discussion yet (of course you can read about all this stuff more here and I would be happy to answer your questions).

(16) By Spindrift (spindrift) on 2023-07-03 06:34:33 in reply to 15.0 [link] [source]

On the footprint.html page that you reference in the sqlite docs, a specific test load is defined.

It is also the basis of the comparisons posted on that page.

Have you considered rerunning that whole page's analysis (useful in itself potentially) and then additionally providing your PGO data?

This could potentially both allow for updated compile option performance information, and give a more solid comparison for your claims.

It may be that PGO is unhelpful on the speed test load referenced at that page... but in that case you may need to reassess if you just happened to have selected a particularly compelling workload.

Or it may prove equally impressive, in which case that is a direct comparison with the architect's chosen benchmarking process (especially if you demonstrate comparable data from the remaining benchmark).

It would also allow you to present the data in a more consumable form, and the whole process might lend itself to the Medium post you seem.tonhave talked yourself into making.

Good luck, I'm interested to see the outcome!

(17) By Simon Slavin (slavin) on 2023-07-03 07:58:43 in reply to 15.0 [link] [source]

Your points are valid, and my post is regretted. You seem to have a legitimate approach.

(18) By Alexander Zaitsev (zamazan4ik) on 2023-07-04 16:29:22 in reply to 16 [link] [source]

Yep. At first, I want to say that PGO is not a silver bullet - there are cases when PGO does not help (e.g. with Catboost or some IO and/or GPU-bounded loads). So that's completely okay to see in real life that PGO "does not work".

About the SQLite speedtest. I completed the benchmark on the same machine as before and compared Release vs Release + PGO in speedtest bench suite. The results are the following:

Release:

./speedtest_release --shrink-memory --reprepare --stats --heap 10000000 64 --size 200
-- Speedtest1 for SQLite 3.42.0 2023-05-16 12:36:15 831d0fb2836b71c9bc51067c49fe
 100 - 100000 INSERTs into table with no index.....................    0.121s
 110 - 100000 ordered INSERTS with one index/PK....................    0.160s
 120 - 100000 unordered INSERTS with one index/PK..................    0.185s
 130 - 25 SELECTS, numeric BETWEEN, unindexed......................    0.096s
 140 - 10 SELECTS, LIKE, unindexed.................................    0.113s
 142 - 10 SELECTS w/ORDER BY, unindexed............................    0.189s
 145 - 10 SELECTS w/ORDER BY and LIMIT, unindexed..................    0.095s
 150 - CREATE INDEX five times.....................................    0.132s
 160 - 20000 SELECTS, numeric BETWEEN, indexed.....................    0.182s
 161 - 20000 SELECTS, numeric BETWEEN, PK..........................    0.183s
 170 - 20000 SELECTS, text BETWEEN, indexed........................    0.541s
 180 - 100000 INSERTS with three indexes...........................    0.371s
 190 - DELETE and REFILL one table.................................    0.381s
 200 - VACUUM......................................................    0.224s
 210 - ALTER TABLE ADD COLUMN, and query...........................    0.005s
 230 - 20000 UPDATES, numeric BETWEEN, indexed.....................    0.187s
 240 - 100000 UPDATES of individual rows...........................    0.579s
 250 - One big UPDATE of the whole 100000-row table................    0.022s
 260 - Query added column after filling............................    0.005s
 270 - 20000 DELETEs, numeric BETWEEN, indexed.....................    0.625s
 280 - 100000 DELETEs of individual rows...........................    0.591s
 290 - Refill two 100000-row tables using REPLACE..................    0.676s
 300 - Refill a 100000-row table using (b&1)==(a&1)................    0.252s
 310 - 20000 four-ways joins.......................................    1.042s
 320 - subquery in result set......................................    0.782s
 400 - 140000 REPLACE ops on an IPK................................    0.457s
 410 - 140000 SELECTS on an IPK....................................    0.407s
 500 - 140000 REPLACE on TEXT PK...................................    0.453s
 510 - 140000 SELECTS on a TEXT PK.................................    0.462s
 520 - 140000 SELECT DISTINCT......................................    0.116s
 980 - PRAGMA integrity_check......................................    0.771s
 990 - ANALYZE.....................................................    0.046s
       TOTAL.......................................................   10.451s
-- Compile option: ATOMIC_INTRINSICS=1
-- Compile option: COMPILER=clang-16.0.5
-- Compile option: DEFAULT_AUTOVACUUM
-- Compile option: DEFAULT_CACHE_SIZE=-2000
-- Compile option: DEFAULT_FILE_FORMAT=4
-- Compile option: DEFAULT_JOURNAL_SIZE_LIMIT=-1
-- Compile option: DEFAULT_MMAP_SIZE=0
-- Compile option: DEFAULT_PAGE_SIZE=4096
-- Compile option: DEFAULT_PCACHE_INITSZ=20
-- Compile option: DEFAULT_RECURSIVE_TRIGGERS
-- Compile option: DEFAULT_SECTOR_SIZE=4096
-- Compile option: DEFAULT_SYNCHRONOUS=2
-- Compile option: DEFAULT_WAL_AUTOCHECKPOINT=1000
-- Compile option: DEFAULT_WAL_SYNCHRONOUS=2
-- Compile option: DEFAULT_WORKER_THREADS=0
-- Compile option: ENABLE_MEMSYS5
-- Compile option: MALLOC_SOFT_LIMIT=1024
-- Compile option: MAX_ATTACHED=10
-- Compile option: MAX_COLUMN=2000
-- Compile option: MAX_COMPOUND_SELECT=500
-- Compile option: MAX_DEFAULT_PAGE_SIZE=8192
-- Compile option: MAX_EXPR_DEPTH=1000
-- Compile option: MAX_FUNCTION_ARG=127
-- Compile option: MAX_LENGTH=1000000000
-- Compile option: MAX_LIKE_PATTERN_LENGTH=50000
-- Compile option: MAX_MMAP_SIZE=0x7fff0000
-- Compile option: MAX_PAGE_COUNT=1073741823
-- Compile option: MAX_PAGE_SIZE=65536
-- Compile option: MAX_SQL_LENGTH=1000000000
-- Compile option: MAX_TRIGGER_DEPTH=1000
-- Compile option: MAX_VARIABLE_NUMBER=32766
-- Compile option: MAX_VDBE_OP=250000000
-- Compile option: MAX_WORKER_THREADS=8
-- Compile option: MUTEX_PTHREADS
-- Compile option: SYSTEM_MALLOC
-- Compile option: TEMP_STORE=1
-- Compile option: THREADSAFE=1
-- Lookaside Slots Used:        1 (max 137)
-- Successful lookasides:       29023975
-- Lookaside size faults:       40170
-- Lookaside OOM faults:        0
-- Pager Heap Usage:            18944 bytes
-- Page cache hits:             7713089
-- Page cache misses:           5054158
-- Page cache writes:           875633
-- Schema Heap Usage:           9920 bytes
-- Statement Heap Usage:        0 bytes
-- Memory Used (bytes):         0 (max 8118144)
-- Outstanding Allocations:     0 (max 1142)
-- Pcache Overflow Bytes:       0 (max 7454720)
-- Largest Allocation:          2097130 bytes
-- Largest Pcache Allocation:   4368 bytes
-- Bytes received by read():    20855060422
-- Bytes sent to write():       3858800650
-- Read() system calls:         5091600
-- Write() system calls:        998134
-- Bytes rcvd from storage:     0
-- Bytes sent to storage:       263860224
-- Cancelled write bytes:       203350016

Release + PGO:

./speedtest_optimized --shrink-memory --reprepare --stats --heap 10000000 64 --size 200
-- Speedtest1 for SQLite 3.42.0 2023-05-16 12:36:15 831d0fb2836b71c9bc51067c49fe
 100 - 100000 INSERTs into table with no index.....................    0.076s
 110 - 100000 ordered INSERTS with one index/PK....................    0.104s
 120 - 100000 unordered INSERTS with one index/PK..................    0.128s
 130 - 25 SELECTS, numeric BETWEEN, unindexed......................    0.095s
 140 - 10 SELECTS, LIKE, unindexed.................................    0.106s
 142 - 10 SELECTS w/ORDER BY, unindexed............................    0.173s
 145 - 10 SELECTS w/ORDER BY and LIMIT, unindexed..................    0.088s
 150 - CREATE INDEX five times.....................................    0.125s
 160 - 20000 SELECTS, numeric BETWEEN, indexed.....................    0.143s
 161 - 20000 SELECTS, numeric BETWEEN, PK..........................    0.143s
 170 - 20000 SELECTS, text BETWEEN, indexed........................    0.465s
 180 - 100000 INSERTS with three indexes...........................    0.347s
 190 - DELETE and REFILL one table.................................    0.345s
 200 - VACUUM......................................................    0.211s
 210 - ALTER TABLE ADD COLUMN, and query...........................    0.004s
 230 - 20000 UPDATES, numeric BETWEEN, indexed.....................    0.142s
 240 - 100000 UPDATES of individual rows...........................    0.470s
 250 - One big UPDATE of the whole 100000-row table................    0.021s
 260 - Query added column after filling............................    0.004s
 270 - 20000 DELETEs, numeric BETWEEN, indexed.....................    0.591s
 280 - 100000 DELETEs of individual rows...........................    0.490s
 290 - Refill two 100000-row tables using REPLACE..................    0.628s
 300 - Refill a 100000-row table using (b&1)==(a&1)................    0.237s
 310 - 20000 four-ways joins.......................................    1.027s
 320 - subquery in result set......................................    0.771s
 400 - 140000 REPLACE ops on an IPK................................    0.372s
 410 - 140000 SELECTS on an IPK....................................    0.325s
 500 - 140000 REPLACE on TEXT PK...................................    0.355s
 510 - 140000 SELECTS on a TEXT PK.................................    0.359s
 520 - 140000 SELECT DISTINCT......................................    0.104s
 980 - PRAGMA integrity_check......................................    0.717s
 990 - ANALYZE.....................................................    0.038s
       TOTAL.......................................................    9.204s
-- Compile option: ATOMIC_INTRINSICS=1
-- Compile option: COMPILER=clang-16.0.5
-- Compile option: DEFAULT_AUTOVACUUM
-- Compile option: DEFAULT_CACHE_SIZE=-2000
-- Compile option: DEFAULT_FILE_FORMAT=4
-- Compile option: DEFAULT_JOURNAL_SIZE_LIMIT=-1
-- Compile option: DEFAULT_MMAP_SIZE=0
-- Compile option: DEFAULT_PAGE_SIZE=4096
-- Compile option: DEFAULT_PCACHE_INITSZ=20
-- Compile option: DEFAULT_RECURSIVE_TRIGGERS
-- Compile option: DEFAULT_SECTOR_SIZE=4096
-- Compile option: DEFAULT_SYNCHRONOUS=2
-- Compile option: DEFAULT_WAL_AUTOCHECKPOINT=1000
-- Compile option: DEFAULT_WAL_SYNCHRONOUS=2
-- Compile option: DEFAULT_WORKER_THREADS=0
-- Compile option: ENABLE_MEMSYS5
-- Compile option: MALLOC_SOFT_LIMIT=1024
-- Compile option: MAX_ATTACHED=10
-- Compile option: MAX_COLUMN=2000
-- Compile option: MAX_COMPOUND_SELECT=500
-- Compile option: MAX_DEFAULT_PAGE_SIZE=8192
-- Compile option: MAX_EXPR_DEPTH=1000
-- Compile option: MAX_FUNCTION_ARG=127
-- Compile option: MAX_LENGTH=1000000000
-- Compile option: MAX_LIKE_PATTERN_LENGTH=50000
-- Compile option: MAX_MMAP_SIZE=0x7fff0000
-- Compile option: MAX_PAGE_COUNT=1073741823
-- Compile option: MAX_PAGE_SIZE=65536
-- Compile option: MAX_SQL_LENGTH=1000000000
-- Compile option: MAX_TRIGGER_DEPTH=1000
-- Compile option: MAX_VARIABLE_NUMBER=32766
-- Compile option: MAX_VDBE_OP=250000000
-- Compile option: MAX_WORKER_THREADS=8
-- Compile option: MUTEX_PTHREADS
-- Compile option: SYSTEM_MALLOC
-- Compile option: TEMP_STORE=1
-- Compile option: THREADSAFE=1
-- Lookaside Slots Used:        1 (max 137)
-- Successful lookasides:       29023975
-- Lookaside size faults:       40170
-- Lookaside OOM faults:        0
-- Pager Heap Usage:            18944 bytes
-- Page cache hits:             7713089
-- Page cache misses:           5054158
-- Page cache writes:           875633
-- Schema Heap Usage:           9920 bytes
-- Statement Heap Usage:        0 bytes
-- Memory Used (bytes):         0 (max 8118144)
-- Outstanding Allocations:     0 (max 1142)
-- Pcache Overflow Bytes:       0 (max 7454720)
-- Largest Allocation:          2097130 bytes
-- Largest Pcache Allocation:   4368 bytes
-- Bytes received by read():    20855060422
-- Bytes sent to write():       3858800650
-- Read() system calls:         5091600
-- Write() system calls:        998134
-- Bytes rcvd from storage:     0
-- Bytes sent to storage:       263925760
-- Cancelled write bytes:       263843840

So even in this benchmark PGO shows good results.

Command to compile SQLite speedtest was clang speedtest1.c sqlite3.c -DSQLITE_ENABLE_MEMSYS5 -O3 -lpthread -ldl -lm -o speedtest_optimized. Version with PGO - all the same flags but with an additional one -fprofile-instr-use. As a profiling load the same benchmark was used with -fprofile-instr-generate compiler flag.

(19) By Alexander Zaitsev (zamazan4ik) on 2023-07-04 17:16:35 in reply to 16 [link] [source]

Have you considered rerunning that whole page's analysis (useful in itself potentially) and then additionally providing your PGO data?

Well... It depends on what you mean under "PGO data". If you mean "what load was used to prepare a profile" - I always describe a load. In the starting post, I used ClickBench (with the corresponding methodology and dataset (I do not want to copy-paste all it here - in the original repo there is a great description with all scripts). Here I used a guide from "footprint" page (and the corresponding commands).

If you mean PGO profile data file itself (.profraw/.profdata in the case of Clang, .gcda in the case of GCC) - they are not so useful since these files could be incompatible between compiler versions. So describing a profiling load, in general, is much more useful.

(20) By Spindrift (spindrift) on 2023-07-04 17:33:52 in reply to 19 [link] [source]

Very briefly - yes the former (plus the actual results, though I presume that's not contentious!).

Do you have the GCC data to compare? "Vanilla" GCC compilation is quite a common baseline for fossil (afaik).

(21) By Alexander Zaitsev (zamazan4ik) on 2023-07-05 00:17:14 in reply to 20 [link] [source]

Yep, I have results for GCC too (GCC 13.1.1 from Fedora repositories).

Release (gcc speedtest1.c sqlite3.c -DSQLITE_ENABLE_MEMSYS5 -O3 -lpthread -ldl -lm -o speedtest_gcc_release):

./speedtest_gcc_release --shrink-memory --reprepare --stats --heap 10000000 64 --size 200
-- Speedtest1 for SQLite 3.42.0 2023-05-16 12:36:15 831d0fb2836b71c9bc51067c49fe
 100 - 100000 INSERTs into table with no index.....................    0.116s
 110 - 100000 ordered INSERTS with one index/PK....................    0.153s
 120 - 100000 unordered INSERTS with one index/PK..................    0.174s
 130 - 25 SELECTS, numeric BETWEEN, unindexed......................    0.091s
 140 - 10 SELECTS, LIKE, unindexed.................................    0.103s
 142 - 10 SELECTS w/ORDER BY, unindexed............................    0.181s
 145 - 10 SELECTS w/ORDER BY and LIMIT, unindexed..................    0.088s
 150 - CREATE INDEX five times.....................................    0.127s
 160 - 20000 SELECTS, numeric BETWEEN, indexed.....................    0.178s
 161 - 20000 SELECTS, numeric BETWEEN, PK..........................    0.179s
 170 - 20000 SELECTS, text BETWEEN, indexed........................    0.521s
 180 - 100000 INSERTS with three indexes...........................    0.360s
 190 - DELETE and REFILL one table.................................    0.352s
 200 - VACUUM......................................................    0.207s
 210 - ALTER TABLE ADD COLUMN, and query...........................    0.004s
 230 - 20000 UPDATES, numeric BETWEEN, indexed.....................    0.175s
 240 - 100000 UPDATES of individual rows...........................    0.538s
 250 - One big UPDATE of the whole 100000-row table................    0.022s
 260 - Query added column after filling............................    0.004s
 270 - 20000 DELETEs, numeric BETWEEN, indexed.....................    0.622s
 280 - 100000 DELETEs of individual rows...........................    0.572s
 290 - Refill two 100000-row tables using REPLACE..................    0.665s
 300 - Refill a 100000-row table using (b&1)==(a&1)................    0.245s
 310 - 20000 four-ways joins.......................................    1.036s
 320 - subquery in result set......................................    0.773s
 400 - 140000 REPLACE ops on an IPK................................    0.451s
 410 - 140000 SELECTS on an IPK....................................    0.395s
 500 - 140000 REPLACE on TEXT PK...................................    0.435s
 510 - 140000 SELECTS on a TEXT PK.................................    0.459s
 520 - 140000 SELECT DISTINCT......................................    0.111s
 980 - PRAGMA integrity_check......................................    0.746s
 990 - ANALYZE.....................................................    0.040s
       TOTAL.......................................................   10.123s
-- Compile option: ATOMIC_INTRINSICS=1
-- Compile option: COMPILER=gcc-13.1.1 20230614 (Red Hat 13.1.1-4)
-- Compile option: DEFAULT_AUTOVACUUM
-- Compile option: DEFAULT_CACHE_SIZE=-2000
-- Compile option: DEFAULT_FILE_FORMAT=4
-- Compile option: DEFAULT_JOURNAL_SIZE_LIMIT=-1
-- Compile option: DEFAULT_MMAP_SIZE=0
-- Compile option: DEFAULT_PAGE_SIZE=4096
-- Compile option: DEFAULT_PCACHE_INITSZ=20
-- Compile option: DEFAULT_RECURSIVE_TRIGGERS
-- Compile option: DEFAULT_SECTOR_SIZE=4096
-- Compile option: DEFAULT_SYNCHRONOUS=2
-- Compile option: DEFAULT_WAL_AUTOCHECKPOINT=1000
-- Compile option: DEFAULT_WAL_SYNCHRONOUS=2
-- Compile option: DEFAULT_WORKER_THREADS=0
-- Compile option: ENABLE_MEMSYS5
-- Compile option: MALLOC_SOFT_LIMIT=1024
-- Compile option: MAX_ATTACHED=10
-- Compile option: MAX_COLUMN=2000
-- Compile option: MAX_COMPOUND_SELECT=500
-- Compile option: MAX_DEFAULT_PAGE_SIZE=8192
-- Compile option: MAX_EXPR_DEPTH=1000
-- Compile option: MAX_FUNCTION_ARG=127
-- Compile option: MAX_LENGTH=1000000000
-- Compile option: MAX_LIKE_PATTERN_LENGTH=50000
-- Compile option: MAX_MMAP_SIZE=0x7fff0000
-- Compile option: MAX_PAGE_COUNT=1073741823
-- Compile option: MAX_PAGE_SIZE=65536
-- Compile option: MAX_SQL_LENGTH=1000000000
-- Compile option: MAX_TRIGGER_DEPTH=1000
-- Compile option: MAX_VARIABLE_NUMBER=32766
-- Compile option: MAX_VDBE_OP=250000000
-- Compile option: MAX_WORKER_THREADS=8
-- Compile option: MUTEX_PTHREADS
-- Compile option: SYSTEM_MALLOC
-- Compile option: TEMP_STORE=1
-- Compile option: THREADSAFE=1
-- Lookaside Slots Used:        1 (max 137)
-- Successful lookasides:       29023975
-- Lookaside size faults:       40170
-- Lookaside OOM faults:        0
-- Pager Heap Usage:            18944 bytes
-- Page cache hits:             7713089
-- Page cache misses:           5054158
-- Page cache writes:           875633
-- Schema Heap Usage:           9920 bytes
-- Statement Heap Usage:        0 bytes
-- Memory Used (bytes):         0 (max 8118144)
-- Outstanding Allocations:     0 (max 1142)
-- Pcache Overflow Bytes:       0 (max 7454720)
-- Largest Allocation:          2097130 bytes
-- Largest Pcache Allocation:   4368 bytes
-- Bytes received by read():    20855060422
-- Bytes sent to write():       3858800676
-- Read() system calls:         5091600
-- Write() system calls:        998134
-- Bytes rcvd from storage:     0
-- Bytes sent to storage:       263876608
-- Cancelled write bytes:       263843840

Release + PGO (gcc speedtest1.c sqlite3.c -DSQLITE_ENABLE_MEMSYS5 -O3 -fprofile-use=gcc_profiles/ -fprofile-partial-training -fprofile-correction -lpthread -ldl -lm -o speedtest_gcc_instrumented && mv speedtest_gcc_intrumented speedtest_gcc_optimized. You need a mv command since GCC during the compilation trying to search profiles with the same name as a collected binary):

./speedtest_gcc_optimized --shrink-memory --reprepare --stats --heap 10000000 64 --size 200
-- Speedtest1 for SQLite 3.42.0 2023-05-16 12:36:15 831d0fb2836b71c9bc51067c49fe
 100 - 100000 INSERTs into table with no index.....................    0.081s
 110 - 100000 ordered INSERTS with one index/PK....................    0.109s
 120 - 100000 unordered INSERTS with one index/PK..................    0.131s
 130 - 25 SELECTS, numeric BETWEEN, unindexed......................    0.091s
 140 - 10 SELECTS, LIKE, unindexed.................................    0.114s
 142 - 10 SELECTS w/ORDER BY, unindexed............................    0.184s
 145 - 10 SELECTS w/ORDER BY and LIMIT, unindexed..................    0.096s
 150 - CREATE INDEX five times.....................................    0.133s
 160 - 20000 SELECTS, numeric BETWEEN, indexed.....................    0.144s
 161 - 20000 SELECTS, numeric BETWEEN, PK..........................    0.145s
 170 - 20000 SELECTS, text BETWEEN, indexed........................    0.467s
 180 - 100000 INSERTS with three indexes...........................    0.345s
 190 - DELETE and REFILL one table.................................    0.343s
 200 - VACUUM......................................................    0.215s
 210 - ALTER TABLE ADD COLUMN, and query...........................    0.004s
 230 - 20000 UPDATES, numeric BETWEEN, indexed.....................    0.149s
 240 - 100000 UPDATES of individual rows...........................    0.460s
 250 - One big UPDATE of the whole 100000-row table................    0.021s
 260 - Query added column after filling............................    0.005s
 270 - 20000 DELETEs, numeric BETWEEN, indexed.....................    0.574s
 280 - 100000 DELETEs of individual rows...........................    0.486s
 290 - Refill two 100000-row tables using REPLACE..................    0.627s
 300 - Refill a 100000-row table using (b&1)==(a&1)................    0.231s
 310 - 20000 four-ways joins.......................................    0.958s
 320 - subquery in result set......................................    0.723s
 400 - 140000 REPLACE ops on an IPK................................    0.387s
 410 - 140000 SELECTS on an IPK....................................    0.333s
 500 - 140000 REPLACE on TEXT PK...................................    0.367s
 510 - 140000 SELECTS on a TEXT PK.................................    0.366s
 520 - 140000 SELECT DISTINCT......................................    0.103s
 980 - PRAGMA integrity_check......................................    0.708s
 990 - ANALYZE.....................................................    0.040s
       TOTAL.......................................................    9.140s
-- Compile option: ATOMIC_INTRINSICS=1
-- Compile option: COMPILER=gcc-13.1.1 20230614 (Red Hat 13.1.1-4)
-- Compile option: DEFAULT_AUTOVACUUM
-- Compile option: DEFAULT_CACHE_SIZE=-2000
-- Compile option: DEFAULT_FILE_FORMAT=4
-- Compile option: DEFAULT_JOURNAL_SIZE_LIMIT=-1
-- Compile option: DEFAULT_MMAP_SIZE=0
-- Compile option: DEFAULT_PAGE_SIZE=4096
-- Compile option: DEFAULT_PCACHE_INITSZ=20
-- Compile option: DEFAULT_RECURSIVE_TRIGGERS
-- Compile option: DEFAULT_SECTOR_SIZE=4096
-- Compile option: DEFAULT_SYNCHRONOUS=2
-- Compile option: DEFAULT_WAL_AUTOCHECKPOINT=1000
-- Compile option: DEFAULT_WAL_SYNCHRONOUS=2
-- Compile option: DEFAULT_WORKER_THREADS=0
-- Compile option: ENABLE_MEMSYS5
-- Compile option: MALLOC_SOFT_LIMIT=1024
-- Compile option: MAX_ATTACHED=10
-- Compile option: MAX_COLUMN=2000
-- Compile option: MAX_COMPOUND_SELECT=500
-- Compile option: MAX_DEFAULT_PAGE_SIZE=8192
-- Compile option: MAX_EXPR_DEPTH=1000
-- Compile option: MAX_FUNCTION_ARG=127
-- Compile option: MAX_LENGTH=1000000000
-- Compile option: MAX_LIKE_PATTERN_LENGTH=50000
-- Compile option: MAX_MMAP_SIZE=0x7fff0000
-- Compile option: MAX_PAGE_COUNT=1073741823
-- Compile option: MAX_PAGE_SIZE=65536
-- Compile option: MAX_SQL_LENGTH=1000000000
-- Compile option: MAX_TRIGGER_DEPTH=1000
-- Compile option: MAX_VARIABLE_NUMBER=32766
-- Compile option: MAX_VDBE_OP=250000000
-- Compile option: MAX_WORKER_THREADS=8
-- Compile option: MUTEX_PTHREADS
-- Compile option: SYSTEM_MALLOC
-- Compile option: TEMP_STORE=1
-- Compile option: THREADSAFE=1
-- Lookaside Slots Used:        1 (max 137)
-- Successful lookasides:       29023975
-- Lookaside size faults:       40170
-- Lookaside OOM faults:        0
-- Pager Heap Usage:            18944 bytes
-- Page cache hits:             7713089
-- Page cache misses:           5054158
-- Page cache writes:           875633
-- Schema Heap Usage:           9920 bytes
-- Statement Heap Usage:        0 bytes
-- Memory Used (bytes):         0 (max 8118144)
-- Outstanding Allocations:     0 (max 1142)
-- Pcache Overflow Bytes:       0 (max 7454720)
-- Largest Allocation:          2097130 bytes
-- Largest Pcache Allocation:   4368 bytes
-- Bytes received by read():    20855060422
-- Bytes sent to write():       3858800676
-- Read() system calls:         5091600
-- Write() system calls:        998134
-- Bytes rcvd from storage:     0
-- Bytes sent to storage:       263860224
-- Cancelled write bytes:       263843840

Profile collection is the same as before - running the same benchmark with Instrumented mode (-fprofile-generate compiler flag).

As you see, GCC shows PGO improvements too.

(22) By Spindrift (spindrift) on 2023-07-05 05:48:00 in reply to 21 [link] [source]

Brill, thanks 👍

(23) By Alexander Zaitsev (zamazan4ik) on 2023-07-16 00:38:25 in reply to 22 [link] [source]

I see you patched the footprint page and now it does not mention PGO. Do you need more time for testing PGO with your workloads before mentioning PGO somewhere in the SQLite documentation as a possible optimization technique for improving the performance of the database?

I care about that since I think for the users would be helpful to see an "official" recommendation from the developers about an additional way how a user can optimize the database and what improvements can be expected (at least according to the official SQLite benchmarks).