Prefetching and its impact on block demands
(1) By nick.tessier on 2024-09-17 19:35:23 [link] [source]
Hello,
We have some tests which read from a 4GB database with 10 tables of 400mb each in them. Table1 is the first 400mb, table2 is the second 400mb and so on. In these tests we compare the time it takes to read a certain table both with and without prefetching on.
A colleague of mine ran a test on their local windows machine reading the last 400mb of the database (table10), and observed it would take 51 seconds to read with prefetching off and 155 seconds to read with prefetching on. Is it expected that a prefetching occurring would slow down the demands on table 10 so significantly?
My naive assumption is that this is due to the mutexes that are acquired when doing things like writing to the cachefile. The prefetch thread grabs the mutex frequently so sometimes demands have to wait to write to the cachefile causing this slowdown. Does that seem right? I could also try to grab a profile of this test with and without prefetching if you think that would be helpful.
(2) By Dan Kennedy (dan) on 2024-09-17 20:50:33 in reply to 1 [link] [source]
Seems like an awful lot of slowdown for the cachefile mutex.
If you turn on logging of http requests/replies, are the on-demand requests being answered by the server more slowly when prefetching is going on?
Dan.
(3) By nick.tessier on 2024-09-19 20:25:35 in reply to 2 [link] [source]
Yes, it appears to almost fully account for the slowdown. If I add the difference in avg download times * the number of blocks demanded to the faster non prefetch time then the two times were only 3 seconds apart which could just be variance.
- 236411 mlliseconds to access table 10 with prefetching on, avg demand download time 2312.75 ms
- 112258 milliseconds to access table 10 with prefetching off, avg demand download time 1099.65 ms
At first seeing those numbers I thought maybe the server was just throttling me, but I also tried increasing curl's buffer size to 10mb and rerunning the test:
- 181516 ms to access table 10 with prefetching on, avg demand download time 1772 ms
- 95816 ms to access table 10 with prefetching off, avg demand download time 933.57 ms
A decent improvement to the prefetching case... do we know for sure that the download times reported are just the downloading from the server part? Worried about what else may be captured in those times
(4) By Dan Kennedy (dan) on 2024-09-20 11:43:32 in reply to 3 [link] [source]
Quite an interesting result. Seems like it would be quite worthwhile tinkering with curl parameters.
Are you doing this with the daemon or the single-process client?
As far as what may be captured in the download times, it's a little tricky to answer because the code is event driven. The "request" log message is issued right before the call to curl_multi_add_handle(). The "reply" log message is issued right after curl_multi_info_read() indicates that the reply has been received. But if the thread was busy doing something else, it may not have called curl_multi_wait(), and so might not be waiting for the reply the instant it is actually available.
With the single-process client, each thread has its own event loop. So a thread might be late to process a reply because it was processing another reply related to its own work - but the on-demand and prefetch replies are handled by different threads using different (CURLM*) handles. We do not hold any mutex during IO operations - only when manipulating in-memory data structures.
With the daemon, there is a single thread shared by everybody. So a reply to an on-demand request might be processed slightly later than it would be otherwise because the thread was busy processing a prefetch reply.
My guess is that these numbers are too big to be contention in the client. These are network magnitude things. That's just a guess of course.
What happens if you run the experiment under unix command [time]? Does it show that the system uses significant CPU (user or system) time compared to wall-clock time? Or is most of the time spent idle waiting for network replies?
Dan.
(5) By nick.tessier on 2024-09-20 15:15:27 in reply to 4 [source]
So far all the numbers Ive shared have been single-process client including this one.
The numbers are going to change now as I have spun up a linux VM close to the datacenter where the blocks are stored for the purpose of the time command.
New Test Parameters
- 20 GB database
- read the last 4 GB of the database as demands
- single-process client
- I also checked how far the prefetch was able to get in one run and it finished blocks 0 - 3140.
With Prefetching on (time taken to complete demands)
200842 ms
- real 3m32s.62s
- user 35.65s
- sys 24.63s
171733 ms
- real 3m2.3s
- user 36.79s
- sys 24.92s
183965 ms
- real 3m14.46s
- user 37.41s
- sys 26.3s
With Prefetching off
120711 ms
- real 2m12.53s
- user 9.45s
- sys 9.49s
118282 ms
- real 2m9.89s
- user 9.16s
- sys 9.44s
(6) By Dan Kennedy (dan) on 2024-09-20 15:39:16 in reply to 5 [link] [source]
Nice. Is this with 4MiB blocks?
Assuming so, those numbers look realistic. Downloading (4GiB + 3140*4MiB) uses a bit under 4x the user and sys CPU as just downloading the 4MiB. Reasonable, as it's doing roughly 4 times as much work.
Moving closer to the server seems to have improved relative performance. We're now seeing with "with prefetching" case taking 1.5 times as long as "without prefetching". Whereas when we were further from the server "without prefetching" was taking 3 times as long. I think this shows that in the first case (far from server) the slowdown is due to the network. If it was the mutex or something else in the cloudsqlite client, I think we would see the relative performance decrease as we moved closer to the server.
Does that sound right to you?
It doesn't prove anything about why the new test is 1.5 times slower with prefetching turned on though. It wouldn't be surprising if that is partly due to contention in the cloudsqlite client. If were able to set up an experiment where you ran the two operations (i.e. the on-demand query and the prefetch) in separate clients using separate cachefiles to see how long it took that way, that might give us an upper limit on any possible performance improvements.
I think there is a good chance that even in this case, most of the slowdown is in the network ops.
Dan.
(7) By nick.tessier on 2024-09-27 14:12:32 in reply to 6 [link] [source]
Yep, this is with 4MiB blocks.
Agree with your assessment about the 'time' results and the relative performance increase.
I gave an attempt at your suggested experiment and unfortunately it seems like having two processes is hurting the demands more so than if the prefetch were in the same process. With those numbers in mind, I don't think we can draw any conclusions about how much the network is holding us back. I also am fairly confident that the prefetch in the single process case never downloads the same blocks that the demand would end up hitting.
DUAL PROCESS VS SINGLE PROCESS
Dual Process
- 161692 ms to demand testdata9, 10 with a prefetch running in a different process
- 187802 ms
- 220536 ms
- 191995 ms in demand time (1005 blocks downloaded, average 191 ms download time)
Single Process
- No prefetch
- 109509 ms
- prefetch
- 129560 ms
- 138188 ms
- 141954 ms
Based off of what you said about the daemon vs in-process, could we expect the numbers to be worse in the daemon? I assumed yes, but I gave it a test and the numbers were roughly the same. Maybe the test is just not overwhelming enough to expose the differences in the daemon vs the single process?
(8) By Dan Kennedy (dan) on 2024-09-27 18:00:18 in reply to 7 [link] [source]
Based off of what you said about the daemon vs in-process, could we expect the numbers to be worse in the daemon? I assumed yes, but I gave it a test and the numbers were roughly the same. Maybe the test is just not overwhelming enough to expose the differences in the daemon vs the single process?
I'd say that's correct. The network is a much bigger bottleneck than the single-threadedness of the daemon process.
Dan.