Multithread processing, storing in memory and writing to database

(1) By Nagesh Kotyal (nageshka) on 2021-09-29 15:11:36 [link] [source]

Background: I am working on creating a daemon process to collect metrics like CPU, memory, UFS usages and so on every x seconds. For each of the metric, I create one worker thread. It's job is to collect the data, process it, write to database in memory. (Each thread will have its own instance of database in memory) and wait for x seconds. Each worker thread and a table in the database are tightly coupled.

Main thread responsibility is on a event, it will request all worker thread to write/flush data from memory to actual database in the filesystem.

Environment: Linux

Approach 1:

As explained above, each worker thread will have its own instance of database in memory.
Main thread will notify worker thread to flush data from memory to DB in filesystem

Approach 2:

Share one connection instance (memory) across all worker threads
Main thread will flush data from memory to DB since it has access to memory instance.

My questions:

Which approach should I take?
Do I need to handle any concurrent accessing issues? Reason for asking this is each thread will update one specific table.
Or is there any other approach?

Thank you

(2) By Warren Young (wyoung) on 2021-09-29 16:17:57 in reply to 1 [link] [source]

I think you should stop trying to pound that nail in with the butt of your screwdriver. Use a time-series database for this. It's what they're for.

(3) By Nagesh Kotyal (nageshka) on 2021-09-29 16:23:53 in reply to 2 [link] [source]

Thank you Warren for your suggestion. Let me check.

(4) By Nagesh Kotyal (nageshka) on 2021-09-29 16:36:48 in reply to 3 [link] [source]

@Warren, I do not have liberty to use time series database.

Is there any optimal solution for sqlite database?

(5) By Warren Young (wyoung) on 2021-09-29 17:22:07 in reply to 4 [link] [source]

"Optimal?" Inherently not, else we would all be using relational databases for everything.

Aside from your concurrency problems, one of the characteristics of B-tree type storage is that when a bucket spills over, the tree has to be rebalanced, which takes time. What this means in terms of your problem is that insert time varies depending on the state of the B-tree, which means you can't predict the overhead of the insert while other data continues arriving in real time. The only way to avoid dropping data or queueing it up for batch inserts (and then hoping you don't spill again) is to overprovision the hardware so much that even the worst case spill occurs in the time slices you have available.

So yeah, your life sucks because you keep aiming the foot-gun, pulling the trigger, and then wondering why it hurts so much each time. Stop it!

(7) By ddevienne on 2021-09-29 17:35:36 in reply to 5 [link] [source]

Indeed. Appending to a table is not as cheap as appending to a file,
because of B-tree relancing. But if your end result must be an SQLite
DB, then unless you create it as a post-processing from some other
append-only file(s), in which case those rebalances matter little,
you have to accept the cost of appending to tables.

Time-series databases certainly do exists, but are there any as small and
beautiful as SQLite, with the same world-class quality as SQLite, and free too?

(9) By Warren Young (wyoung) on 2021-09-29 18:03:08 in reply to 7 [link] [source]

RRDTool matches pretty well with those criteria, and was designed pretty much exactly for the OP's sort of problem.

(12) By anonymous on 2021-09-30 17:09:28 in reply to 7 [link] [source]

Time-series databases certainly do exists, but are there any as small and beautiful as SQLite

Indeed, and its increasingly common to see SQLite at the endpoints in IOT etc for storing configurations and such, for ex, but then the already present SQLite cannot be used for time-series data and that data has to be stored in some other way for ex, in CSVs or specialized tools or data structures in application language and ETL-ed to a db with time-series features for aggregation etc.

While this does not address OP's immediate problem, I would deviate a little from the thread trying to find workarounds or alternative solutions for OP and suggest a new feature in SQLite instead. PostgreSQL, which SQLite derives a lot of inspiration from, has a time series DB called TimeScale which uses two core concepts of hypertable and chunks - it uses BTrees underneath and keeps all relational features intact. It also claims very fast single-machine ingestion rates so it certainly seems possible with BTrees (I am no expert, though).

Unless something in SQLite design rules it out completely, I think, the chunking approach might be a great feature to add to SQLite to improve SQLite ingestion rate and add basic timeseries capabilities.

https://blog.timescale.com/blog/what-is-high-cardinality-how-do-time-series-databases-influxdb-timescaledb-compare/

https://blog.timescale.com/blog/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c/

https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/#hypertable-benefits

(6) By ddevienne on 2021-09-29 17:26:24 in reply to 1 [link] [source]

Advantages of Approach#1 is that writes are independent and concurrent.
But you'll block the threads, to copy the data over eventually.

Approach#2 is bad, since the one DB must serialize everything,
which makes all threads wait on a global mutex basically, when they need to write anything.

A third approach is to use a concurrent queue, enqueue from the threads,
dequeue from the main thread to write into the DB. If you are in C++,
I'd recommend https://github.com/cameron314/concurrentqueue which is
lock-free and performant. There are others from Boost, Facebook Folly, etc...

The approach#3 is similar to #2, but with the crucial difference that you are
trading more memory use (to store in the queue) for better concurrency.
And since you guarantee serialized use of SQLite, you can build it w/o mutex
support too, gaining a little more speed.

My $0.02. --DD

(8) By anonymous on 2021-09-29 17:51:31 in reply to 1 [link] [source]

A fourth approach,

Have each thread write to its own database, but instead of an in-memory database, just use SQLite in WAL mode and set synchronous to zero. The main thread could then perform checkpointing on these databases synchronously without blocking the writer threads.

And no, you don't need to aggregate the data later in a single database, you can attach all the databases from a single connection and query it as you like.

(10) By Nagesh Kotyal (nageshka) on 2021-09-29 18:29:14 in reply to 8 [link] [source]

Thank you for the support. Having own database for each metric(each thread) like CPU, memory and so on is not feasible solution.

(11) By Gunter Hick (gunter_hick) on 2021-09-30 07:33:32 in reply to 1 [source]

The heretic's approach:

- have each thread write INSERT INTO <table> VALUES (<values>); statements into a text file 
- when triggered (and/or periodically), each thread closes and renames the text file, then opens a new file
- main thread processes the renamed text files into the DB and renames them again (or deletes them if you don't require rollforward capability)

No pesky synchronization of databases, predictable effort storing the metrics, just one connection doing all the writes in a batch transaction.

Alternatively, instead of full statements, have the metric threads write csv files containing just the data and import those either directly in the shell or via a CSV table valued function.

INSERT INTO <table> SELECT * FROM CSV('<filename>');