SQLite Forum

CREATE INDEX ignoring pragma cache_size - is there a way to fix?
Login

CREATE INDEX ignoring pragma cache_size - is there a way to fix?

(1) By anonymous on 2020-12-21 10:08:55 [link] [source]

It seems that SQLite ignores pragma cache_size when using CREATE INDEX.

I have profiled my application mamory use, and creating index always takes the same amount of memory, depending on the size of data indexed. There is no correlation between cache_size and memory used.

cache_size is set correctly, because SELECT and INSERT clauses respect it.

I also used temp_store = FILE with no effect.

Is there any way to force amount of memory used when creating indices?

(2) By Dan Kennedy (dan) on 2020-12-21 13:17:35 in reply to 1 [link] [source]

How are you measuring memory usage?

With temp_store=FILE, a CREATE INDEX statement for a large index should use approximately 2x (or (n+2)x if you have SQLite configured to use n background threads when sorting) the configured cache-size in heap memory. It also uses temporary files, usually in /tmp, but sometimes in some other place:

https://sqlite.org/tempfiles.html#temporary_file_storage_locations

Depending on how you're measuring memory usage and where they are being created, the temp files may be being included in the measurement.

Dan.

(3) By anonymous on 2020-12-21 14:38:43 in reply to 2 [link] [source]

I am measuring memory use using Windows task manager and Pythons mprof (program is using sqlite module from Python 3.8.6).

My connection pragma is:

PRAGMA synchronous = OFF;
PRAGMA journal_mode = OFF;
PRAGMA temp_store = FILE;
PRAGMA cache_size = -{500*1024}); # 500 MB

During program run, cache fills up to 500 MB (due to inserts).

Then, when I call CREATE INDEX, memory use raises additionally by around 1400 MB (dependent on data to be indexed).

I tried to counter this by temporarily reducing cache_size to 50 MB:

PRAGMA cache_size=50000

When I do so, memory use drops to 150 MB (50 MB cache + 100 MB Python overhead), then raises again (by around 1000 MB this time).

So it seems as if cache_size has some effect on the amount of the memory used, but it is nowhere near twice of its amount.

(4.1) By curmudgeon on 2020-12-22 07:46:19 edited from 4.0 in reply to 3 [link] [source]

I've no idea if this is related but I recall finding that on windows 10 copying the rowids from a very large index into a vector was much much slower if I accessed the index in descending order. I discovered that during the descending order copy windows available memory gradually decreased as if windows was caching the entire index. The memory usage and time difference between the asc/desc copies was huge. Keith Medcalf was able to reproduce it and contacted ms but nothing was ever done about it.

The thread is on the old Nabble forum here

http://sqlite.1065341.n5.nabble.com/timer-td102512.html

If you think it might be related and look it up I'd start at the post Keith got involved as many of my earlier posts / experiments were flawed.

(5) By Keith Medcalf (kmedcalf) on 2020-12-21 17:59:36 in reply to 3 [link] [source]

PRAGMA cache_size=50000

This sets the cache size to 50,000 pages -- not 50 MB. Assuming a default page size of 4KB this is just a tad larger than 195 MB.

So it seems as if cache_size has some effect on the amount of the memory used, but it is nowhere near twice of its amount.

Well, it certainly does and it does not, since the cache_size is the size of the buffer (cache) for reading and writing the database file. It has no direct impact on sorting (since it is completely unrelated).

If you have enabled worker threads then there is a runtime PMASZ configuration option (and corresponding compile time default) that is used to determine when to spin up additional workers for sorting operations.

I do not know what the memory usage limits for the sorter are, perhaps they are bounded to some percentage of the pager cache size however the memory would come from the heap outside the pager cache and so would have a hard upper bound set by the heap size limit. Without examining the code I have no idea how the soft limit (if there is one) is set.

(9) By anonymous on 2020-12-21 18:34:02 in reply to 5 [link] [source]

Typo, of course it was -50000.

Thank you for confirming that sorting memory limits are not related to read/write cache.

My journey continues into the source code then :)

(10) By Keith Medcalf (kmedcalf) on 2020-12-21 20:12:16 in reply to 9 [source]

My understanding (which may be in error -- I have not examined the sorter code) is that the sorting is basically a partitioned merge sort.

This means that the memory used by the sorting operation would be limited by the temp file cache size which is, I believe, fixed (though the documentation indicates that this is a compile time define SQLITE_DEFAULT_TEMP_CACHE_SIZE, a grep of the current source code does not seem to reveal the existence of this character string in the code, though the documentation claims that the default temp cache size is fixed).

This would imply that if the TEMP_STORE was "file" the the amount of RAM consumed by a sorter operation would be constrained by this value (the size of the TEMP_CACHE_SIZE) multiplied by the number of worker threads configured or actually spun up, whichever is less.

Similarly if the TEMP_STORE was "memory" then there would be no limit -- ie, all the temporary files would exist solely in memory and there would be no bound other than an OOM condition (or the hard_heap_limit if one was set).

Of course, I stand to be corrected, but this is what my observations and the documentation would lead me to conclude.

(6) By Keith Medcalf (kmedcalf) on 2020-12-21 18:22:50 in reply to 3 [link] [source]

(7) By Keith Medcalf (kmedcalf) on 2020-12-21 18:27:31 in reply to 3 [link] [source]

See also pragma temp_store and the corresponding compile option.

(8) By Keith Medcalf (kmedcalf) on 2020-12-21 18:32:03 in reply to 3 [link] [source]

Note that if you set pragma temp_store to memory (or the compile time option to memory only) then all the sorter data is stored only in memory. I can attest that this is true because creating an index or using a sorter that exceeds the amount of virtual memory available causes the process to terminate with an OOM error.