Load database file from disk to memory

(1) By Swastik (961679) on 2022-07-08 12:17:33 [link] [source]

I'm working with SQLite in JAVA using JDBC driver
    implementation group: 'org.xerial', name: 'sqlite-jdbc', version: '3.36.0.3'

My use case is to run queries faster by loading SQLite into memory. I'm using restore command to load my database file (.db) from disk to main memory using below code

But looks like "restore" command is not fully loading .db file into memory, because after running this code when I check heap-memory using Jconsole there is no much spike in it, heap memory is increased by only 10-20 MB, but my .db file in disk occupies some 330MB. 

Is this right way to load disk file to memory? If yes why can't I see spike in heap memory?

@Service
@Slf4j
public class RestoreService {

  private static Connection connection = null;

  public static Connection getConnection() {
    return connection;
  }

  public String restore() {
    try {
      SQLiteConfig config = new SQLiteConfig();
      config.setPragma(SQLiteConfig.Pragma.JOURNAL_MODE, "wal");
      config.setPragma(SQLiteConfig.Pragma.SYNCHRONOUS, "normal");
      config.setPragma(SQLiteConfig.Pragma.TEMP_STORE, "memory");

      connection =
          DriverManager.getConnection("jdbc:sqlite:file:prod?mode=memory&cache=shared", config.toProperties());

      try(Statement statement = connection.createStatement()) {
        statement.executeUpdate("restore from products.db");
      }

      return "SUCCESS";
    } catch (Exception e) {
      log.error("Error while restoring: {}", e.getMessage());
    }

    return "ERROR";
  }
}

(2) By Simon Slavin (slavin) on 2022-07-08 13:36:43 in reply to 1 [link] [source]

restore is not a SQL command, and is not implemented in SQLite. The restore command is part of the CLI application which can be used to inspect/change SQLite databases.

There are a few ways to do the equivalent thing in SQLite, One is to open an in-memory database

https://www.sqlite.org/inmemorydb.html

then ATTACH the database on disk and copy everything across. Another is to use the backup API:

https://sqlite.org/backup.html

However, your plan to do all your processing in memory is unusual. SQLite is ridiculously fast. Unless you are handling a database stored in Flash memory (i.e. on a removable Flash drive) it's probably not necessary. Have you done actual testing or are you just assuming it'll be too slow for your requirements ?

(3) By Swastik (961679) on 2022-07-08 14:12:08 in reply to 2 [link] [source]

Did some testing with mmap mode and in-memory mode(by exporting sqlite DB into csv and inserted records to db In-Memory) and found slight difference in speed. So decided to go with In-Memory.

(14) By Decker (d3x0r) on 2022-07-09 17:15:52 in reply to 3 [link] [source]

TL;DR just use the file; having extra hoops is just going to cause future you to regret increasing the maintenance cost.

I don't know what method you used to get the mmap mode... if that's something the java connector gives you maybe?

Working that way is probably still like working with a file, there are a few things that can make files much faster

compile options

SQLITE_ENABLE_LOCKING_STYLE=0 disable locks
SQLITE_THREADSAFE=0 disable memory locks (if you only use single threads)

https://www.sqlite.org/pragma.html#pragma_synchronous

PRAGMA synchronous = 0

https://www.sqlite.org/pragma.html#pragma_locking_mode

PRAGMA locking_mode = EXCLUSIVE

but mostly the file system is cached anyway, between the OS caching the file and sqlite caching pages that are already loaded; which if sqlite finds it in cache, it's approaching memory... if you test only with a couple operations the fetch from disk into memory will be a larger cost than the accumulated time for thousands of accesses of that data.

(4) By Swastik (961679) on 2022-07-08 14:44:51 in reply to 2 [link] [source]

If I go with your 1st solution i.e, by creating In-Memory connection and ATTACH-ing database on disk and copying everything across.

Do I actually need to copy everything across?, just ATTACH-ing database in disk with Connection(created in-memory) will bring all tables to In-Memory right?

(5) By Gunter Hick (gunter_hick) on 2022-07-08 16:45:14 in reply to 4 [source]

Wrong. Attaching an SQLite db file on a disk will parse the table definitions in to working memory. The data still stays on the disk. Only if you create an in-memory database and copy everything across will the data reside in memory. If you are looking to modify any data, it still needs to go to the on-disk database to be persisted.

All that setting temp storage to memory does is make the temp database reside in memory. So you could also just open the database normally and then

CREATE TEMP TABLE whatever AS SELECT * FROM whatever;

which will make a copy of the data in the whatever table in the (in-memory) temp database and shadow the original table.

Or, for more fidelity, duplicate the table and indices in the temp database

CREATE TEMP TABLE whatever <table declaration>);
INSERT INTO whatever SELECT * FROM MAIN.whatever;
CREATE INDEX whatever_index ON whatever (...);

(6) By Swastik (961679) on 2022-07-08 16:58:02 in reply to 5 [link] [source]

Got it!.

Now is there any difference between creating an In-Memory database, attaching database from disk, and then copying tables from attached database to In-Memory database VS Creating Temp table, copying tables from database in disk to Temp table (which resides in Memory)?

(7) By Ryan Smith (cuz) on 2022-07-08 18:20:32 in reply to 6 [link] [source]

Yes, there is "any" difference, but it is slight and might not matter.

The one has the full DB as it was on disk now in memory, the other has an already in-memory DB with now the same tables from the on-disk database replicated to it.

Both require establishing a memory DB and copying data across. The DB name might be different and the speed may be slightly different, but for practical purposes it would be much of the same.

May I ask why you so desperately want the attached DB to be forced into memory? If the request is for speed, then as long as you give it enough cache, the effective speed should be more or less the same. The only valid reason to do this is if you expect to lose the on-disk DB at some point, such as a flash drive that will be removed, but in such a case, copying the DB to an actual drive will be much much safer, and speed difference negligible for anything that can fit in working memory.

What exact problem are you trying to solve? There may already be known best practices or even example code.

(8) By Swastik (961679) on 2022-07-08 18:51:25 in reply to 7 [link] [source]

Ok let me be more clear, I'm trying to use SQLite as cache where I'll be storing some 500k - 5M products data. My goal is to serve requests within some 5ms(let's say 10 RPS). Also each request might have to perform some complex SQL queries (Joins, Like etc..). So for these use case thought it is better to keep data in memory

Please let me know if this approach is not good or if there are some better way to do this

(9) By Ryan Smith (cuz) on 2022-07-08 22:58:14 in reply to 8 [link] [source]

Ok, that's a perfectly logical reasoning and it might work well.

There are a few things I'd like to point out:

SQLite working from a drive (SSD) with a sufficiently large cache memory allocated will be extremely fast, faster than you may think, and to the point where it might become hard to tell the difference between that and an in-memory DB.
If the Data you want to "host" in memory is quite large, so that it occupies more than 50% of the computer's memory, SQLite still needs memory for doing the actual queries, which can easily amount to needing an amount as large as the DB (ok that's an extreme case, but possible) and between the query needs and working memory and journaling needs, it might choke and actually be slower than using it from disk and having all of that memory for the working memory and page-cache.
You can switch journaling modes, if you anyway going to only read from it, plus remove the need to sync data all the time. Lots of ways to make a read-only (or mostly reads) DB very fast.
If your memory is significantly larger than the DB (4 times+) and do only read-only access, the all-in-memory option will certainly be very fast, I'm just not sure how much faster than an on-disk DB (on SSD) with all that memory servicing as cache. This should be tested before making a design decision.
Lastly, regardless of all the above, if your memory is significantly larger, and do only read-only access, the all-in-memory option will work very well, but is more effort to set up and more code to maintain.

I'd say some testing with your actual data is the next step, to see query performance for both scenarios.

Let us know more about your setup, OS, Platform, DB size, memory size, drive type, speed and size, etc, and we can suggest some good tweaks for setting up both.

(10) By Swastik (961679) on 2022-07-09 04:15:43 in reply to 9 [link] [source]

Thank you so much these points will really help us in making our design much more better.

Regarding the setup, will update here again in sometime

(11) By Swastik (961679) on 2022-07-09 07:26:20 in reply to 9 [link] [source]

Here are some of the details of our setup

OS - CentOS
Platform - AWS EC2 t3
CPU & Memory - 2CPU, 8GB RAM (will upgrade to 4CPU, 16GB RAM if database grows to 5GB on disk)
Drive Type - 50GB SSD
DB size - 2GB expected but might grow upto 5GB if our database grows to some 5M records

And another important point is this cache will be used most of the times for only reads but might also have to write to the cache whenever product details gets updated which is not very often

(13) By Ryan Smith (cuz) on 2022-07-09 14:02:16 in reply to 11 [link] [source]

That seems more or less as expected and I see nothing there to suggest difficulty with any of the solutions.

My advice, figure out which queries you'd want to run a lot (the ones that will need to happen fast), then make at least half the max expected DB size, like 2.5GB (i'd advise the full max, but you only have half the memory and half the CPU power now, so testing with half the size should give us a good idea).

Add fake data rows into your DB, taking care to at least keep the typical cardinality of the keys and so that the queries still produce as many rows as they are expected to return in production.

Start with a stock-standard SQLite set-up, design a bunch of similar typical queries (maybe 20 or so) and write some code to run these queries (don't do anything with the data, just keep running the queries) in sequence, repeating the entire process 1000 times or so, and record the time it takes. It's important to not just run the same query every time as the caching will skew the results, so it must be different queries and preferably even different types of queries - the closer to real-world we get, the better. This will be a good base value.

Then up the SQLite tweaks for on-disk speed (see below), and run it again, then push the entire DB into memory (the easiest programmatic way I think is to serialize the DB into memory, as a post above suggested) and run it again. The difference in times will show us which it is faster, and if faster enough to make the extra effort pay off or not, and what the actual real-world speeds are you can expect in production.

Some typical tweaks for speed increase (where we care more about speed than data-loss on power-failures) are:

  -- Assuming a page-size of 4K (4096 bytes).
  -- Assuming ANALYZE run + Queries already optimized.
  -- These must be run before any queries.

  PRAGMA journal_mode = TRUNCATE;       -- Speed     
  PRAGMA synchronous = 0;               -- Speed Mode
  PRAGMA soft_heap_limit = 0;           -- None      
  PRAGMA temp_store = 2;                -- Memory    
  PRAGMA cache_size = 610400;           -- ~2.5GB @ 4K/Pg
  PRAGMA cell_size_check = 0;           -- No checking
  PRAGMA ignore_check_constraints = 1;  -- No checking
  PRAGMA mmap_size = 536870912;         -- 0.5GB (See Compile-Opts)
  PRAGMA threads = 8;                   -- Worker Thread limit

Notes:

Some of the above are the defaults, but still added since future defaults may change.
Some of these may clash with the specific DB's design or operating conditions, use with discretion.
Some of the above are affected/limited by compile-time options and may need rolling your own compile options.
If rows in your tables are particularly wide or data-heavy, upping the page-size may be useful too.

(23.1) By Swastik (961679) on 2022-07-29 11:32:51 edited from 23.0 in reply to 13 [link] [source]

Have tested with above configurations with SQLite db(~130MB) on SSD and found that having large cache_size(size larger than db file) is not helping much.

In addition to above configuration I've enabled the shared_cache, because on app startup I'm creating a first connection and executing SELECT * on all tables (to bring all pages into the memory) and later any connection created to execute any query can use that cache created by first connection.

Am I using the cache_size & shared_cache in right way?.

Will be happy to share table structure and queries if needed!

(12) By Simon Slavin (slavin) on 2022-07-09 13:19:35 in reply to 8 [link] [source]

Please excuse some questions which might be considered insulting to an experienced SQLite user. These things can make a big difference to the speed of the sort of operation you outline, and may make your times acceptably fast without you needing special code to copy the entire database into memory.

A) SQLite itself can copy your data into memory, just by you giving one command. Have you tried setting the cache size big enough that SQLite itself will copy the database into memory ? This lets SQLite do the heavy lifting without you having to do custom programming to write changes back to storage.

https://www.sqlite.org/pragma.html#pragma_cache_size

B) Have you created the right indexes for the SELECTs and UPDATEs you need to do ? This includes setting FOREIGN KEYs and thinking about how each JOIN will work.

C) Do you enclose queries for a single user operation (producing a whole web page, or finding everything you need to know about an item) in a transaction ? A lot of the most time-consuming work SQLite does is about starting and ending transactions. The thing you want – looking up the data – is fast. If you don't define transactions, SQLite makes one per command, which may be more than you probably need.

D) I would ask whether your database was stored on SSD, but you already posted that it is. Good.

I'm asking these only because it saves you a lot of extra coding, which can lead to some nasty bugs if changes are made to the database. We talk about 'premature optimization' a lot on this forum. It happens when someone starts writing complicated code before checking whether simpler operations produce a 'fast enough' result. You seem experienced, but I thought I'd check.

(15) By Swastik (961679) on 2022-07-09 17:42:40 in reply to 12 [link] [source]

I don't have any previous experience in SQLite :)

Also, I'm not clear with the point 'A' from your post. Suppose if we set PRAGMA cache_size to some large number will SQLite bring all records into memory as soon as connection is created?

(16) By Keith Medcalf (kmedcalf) on 2022-07-10 00:38:24 in reply to 15 [link] [source]

No. pragma cache_size=n sets the maximum cache size for the connection. When a page is read, it is stored in the cache, and will not be read again from underlying storage (which may be the system page cache), unless the cache is invalidated.

If you only ever read page 427, then only page 427 will be in the cache.

However, if you set the cache_size to 2 and then read 3 pages, the last two pages read will be in the cache.

https://sqlite.org/pragma.html#pragma_cache_size

(17) By Simon Slavin (slavin) on 2022-07-10 13:11:57 in reply to 15 [link] [source]

Ah. If you don't have SQLite experience, my questions are more appropriate. The following is simplified for clarity, and full documentation is available on the SQLite site.

The answer to your question is 'no'. Each page will be read into cache memory the first time SQLite accesses it from disk. From them on, SQLite will access that data fro the cache rather than having to go back to storage. A page will be cleared from cache only if SQLite runs out of cache space and decides to free up space by discarding a page it hasn't looked at for a long time. Obviously, if you make the cache bigger than the database, this never happens, and as your program gradually accesses all the data the whole database will eventually be read into memory.

Using caching like this prevents a long startup time while your program reads the entire database into memory. On the other hand, If you want to force SQLite to read all the data (though not all the indexes) into cache at the start of the program just execute something like SELECT * FROM MyTable for every table at the start of the program.

The main advantage is that you don't have to write special complicated write-back code into your program, because it's all done with one PRAGMA statement right at the start. SQLite takes care of the rest. And if you decide that SSD is fast enough and you don't need it any more, you can just comment out one line of your program (or change a configuration parameter in a config file somewhere).

The point I wrote about transactions is also important. Much of the time taken by SQLite is involved in setting up a transaction with BEGIN and ending a transaction with COMMIT or END (which are synonyms). You can't execute any SQL command, including SELECT, outside a transaction, so if you forget a BEGIN, SQLite will helpfully surround your SQL command with BEGIN and COMMIT.

Because of this, if you're writing code which needs to assemble the results of multiple queries together, it's far faster to put BEGIN before the first one and COMMIT/END after the last one.

(18) By Swastik (961679) on 2022-07-10 14:10:22 in reply to 17 [link] [source]

I understood your point of using larger cache and doing select *.

I've a doubt here, suppose let's say I used this approach and when App starts I did select * which brings all pages into the cache. Later at some point a record is updated/deleted in this situation what will happen to cache, will this reflects on cache automatically (I mean will SQLite takes care of updating the cache)?

(19) By Simon Slavin (slavin) on 2022-07-10 14:28:38 in reply to 18 [link] [source]

Whenever SQLite needs access to a database it does it via the cache. Data is read from the cache (with another mechanism reading from storage into the cache first, if necessary) and changes are made to the cache (with another mechanism writing changed parts of cache back to storage).

The question is not whether SQLite uses the cache, the question is how big the cache is.

The same is true of how your operating system works. And how real-time fast 3D games work. Pretty much everything big works using caches. What you were doing when you asked your first question here was actually inventing a big cache.

[The above explanation is simplified for clarity and brevity. You are dipping into the shallow parts of the complexity which is Computer Science.]

(20) By anonymous on 2022-07-10 15:57:19 in reply to 19 [link] [source]

I suppose the desire here is to prefill the cache to deliberately cache the whole table, not just cache on first demand.

(24) By skywalk on 2022-07-29 14:30:04 in reply to 19 [link] [source]

Great topic and I may be a dufus?
My app has a growing SSD based database(dbf) with conservative settings for integrity instead of speed.
During development I noticed slower speeds of queries & inserts with larger row counts.
I decided to try a feeder memory database(dbm) for queries & inserting.
If the data was acceptable, only then do I insert from dbm -> dbf.
This sped up my overall operation by the very technical term: A LOT!

Are you saying I only had to increase the (dbf) cache size and continue to incrementally add to it without doing operations within the tinier (dbm)?
My app never changed the default cache size, which I now know is 1.95MB.

(21.2) By Swastik (961679) on 2022-07-11 06:21:44 edited from 21.1 in reply to 1 [link] [source]

So based on the complete discussion, I would like to conclude few things here - Please correct me if I'm wrong in any of the below points

Three ways to bring SQLite DB from disk to memory

a. Using ATTACH command, to attach DB in disk to DB in memory and migrating tables - From thread 2

b. Creating TEMP table and migrating tables, by configuring PRAGMA TEMP_STORE=MEMORY - From thread 2

c. Configuring PRAGMA CACHE_SIZE to larger value(value greater than DB size on disk) and then performing SELECT * when APP starts (I thought this is the better approach) - From thread 17
Compare the difference in speed between SQLite on disk(SSD) by configuring few parameters vs SQLite in Memory - From thread 13

(22) By Ryan Smith (cuz) on 2022-07-11 07:06:10 in reply to 21.2 [link] [source]

Three ways to bring SQLite DB from disk to memory

No, there's only 2 ways to "bring from disk to memory": Serialize it to memory, or create in-memory DB and copy tables over.

The other options are all geared to make SQLite faster by leveraging the available memory.

Further the suggestion was that serializing or recreating the DB in-memory would require somewhat more programming, especially if there will be DB updates. To confirm whether the extra effort offers a speed-increase that is worthy of the effort, you need to TEST the approaches, by comparing the speed differences.

(25) By anonymous on 2022-09-16 09:43:03 in reply to 21.2 [link] [source]

Since you are using xerial/sqlite-jdbc there is one more efficient way to bring the entire db from disk to in-memory.

Connection connection = DriverManager.getConnection("jdbc:sqlite::memory:"); SQLiteConnection sqliteConnection = (SQLiteConnection) connection; DB db = sqliteConnection.getDatabase(); db.restore("main", fileName, null);

I have found this to be really fast even on a non-ssd. Please have a look at this once.