SQLite User Forum

A newbie’s question about database structure design
Login
> I have a 3yeaars large number of stock historical data, they are stored in directories according to different dates, and the name of each directory is like 20180102, 20180103...There are about 750 directories so far, and there are more than 4000 CSV files stored in each directory, and file name corresponds to a stock code. Each CSV file has 50000-100000 rows and 7 columns.

I'm guessing the following:

1. Each CSV contains _tick data_, i.e. all individual trades and/or bid/ask quotes for a single counter on a single trading day
2. You're re-summarizing each counter's tick data every day _before_ doing cross-counter or time-based analysis.

To be frank, [2] is stupid. There's absolutely no reason to re-process the raw historical data for, say, 20180102 _every single day_, when you're not changing the operations you do each time. Far better to run your summary calculations _once_ for each day/counter and cache that summary, so instead of a 50000-100000 row CSV, you have a CSV with as few as _one_ row (EOD summary).

Once you have that, then the most flexible database design is actually ONE database with ONE table contains ONE row for each day/counter combo. Having done some work in this area, I can assure you that 4 million records encompassing an exchange's entire historical EOD data is far more tractable and useful, than _400 billion_ records at a data resolution (tick) that isn't even practical for analysis.