A newbie's question about database structure design

(1.1) By sunjar on 2021-11-13 22:18:18 edited from 1.0 [link] [source]

I have a 3yeaars large number of stock historical data, they are stored in directories according to different dates, and the name of each directory is like 20180102, 20180103...There are about 750 directories so far, and there are more than 4000 CSV files stored in each directory, and file name corresponds to a stock code. Each CSV file has 50000-100000 rows and 7 columns.

My current method of processing these files is used the os.walk method in python to walking the tree. This method is very inefficient and slowly, so I want to convert these files to a sqlite database, but I am new to databases and I don't know how to design a more reasonable database structure. Here are the three solutions I think:

Build a database with 4000 tables. One column of each table corresponds to all the historical data of a stock. I need to use python to merge the three-year CSV datas of single stock and import it into a column in the table.
Build 4000 databases (the number of stocks), each database has 750 tables, and each table directly corresponds to the content of a CSV file.
Build 750 databases (the number of directories). There are 4000 tables in each database, and each table directly corresponds to the content of a CSV file. (Similar to solution 2)

My question is which solution is best? The method 1 or the latter two methods(2 and 3)? including factors such as execution efficiency, speed, stability, etc.?

Any information will be appeciated.

(2) By anonymous on 2021-11-14 05:51:08 in reply to 1.1 [link] [source]

Just build one database of one table. Each directory level just adds one column to your seven cab columns.

(3) By Adrian Ho (lexfiend) on 2021-11-14 09:06:31 in reply to 1.1 [link] [source]

I have a 3yeaars large number of stock historical data, they are stored in directories according to different dates, and the name of each directory is like 20180102, 20180103...There are about 750 directories so far, and there are more than 4000 CSV files stored in each directory, and file name corresponds to a stock code. Each CSV file has 50000-100000 rows and 7 columns.

I'm guessing the following:

Each CSV contains tick data, i.e. all individual trades and/or bid/ask quotes for a single counter on a single trading day
You're re-summarizing each counter's tick data every day before doing cross-counter or time-based analysis.

To be frank, [2] is stupid. There's absolutely no reason to re-process the raw historical data for, say, 20180102 every single day, when you're not changing the operations you do each time. Far better to run your summary calculations once for each day/counter and cache that summary, so instead of a 50000-100000 row CSV, you have a CSV with as few as one row (EOD summary).

Once you have that, then the most flexible database design is actually ONE database with ONE table contains ONE row for each day/counter combo. Having done some work in this area, I can assure you that 4 million records encompassing an exchange's entire historical EOD data is far more tractable and useful, than 400 billion records at a data resolution (tick) that isn't even practical for analysis.

(4) By sunjar on 2021-11-14 09:43:46 in reply to 2 [link] [source]

Wow, What a wake-up call, I already understand what you are talking about is how the database works, I am ashamed that I even said "build 4000 databases". I'm such a rookie.

I’m repeating what you mean, that is, because all files have the same structure of fields(columns), I just need to create a table, throw all the datas into table, and then use the characteristics of the database to query and sort, right? ? But those CSV files do not contain 'Date' and 'Stock code', so I need to add 2 columns, namely 'Date' and 'StockCode', and then I can rely on these two columns to query data. Am I right?

But in this case, the table will be very large, and it will have (assuming that each CSV file has an average of 70,000 rows) 70,000×750×4000=210,000,000,000 (210 billion rows). Can SQLite withstand such a large amount of data? So or should I use a database with several tables? But is it inconvenient to search in that case?

Thanks for you advice.

(5) By sunjar on 2021-11-14 10:00:33 in reply to 3 [link] [source]

HI, Adrian, your guess is correct, those are bid/ask quotes ticks data. What you said is very reasonable. In most cases, I don’t need to process the data repeatedly every day. I have already made daily summary data and saved them(use python DateFrame), but sometimes when I have a new strategy, I still need repeatedly traverse all the data of a stock for calculation to obtain the historical curve of the new strategy.

It seems that in this case, I only need to save the summary information in the database, and I will use python to process those CSVs when I need to traverse occasionally. This will save a lot of work.

I am a complete novice to the database and I am just starting to learn, That's why I asked such stupid questions. :)

Thank you very much for your very helpful and constructive comments.

(6) By Ryan Smith (cuz) on 2021-11-14 10:18:06 in reply to 4 [source]

You are exactly correct One table, some extra rows for distinguishing dates and Stocks.

With appropriate indexing, this one table will probably be faster than any other combination of multiple tables you could consider.

Also, SQLite is size-limited based on your choice of Page-size, and the row count limit, well, let's just say at INT64 it can store those rows you described for longer than the age of the known Universe.

The Size limit will be depending on your amount of data per row, which sounds like it isn't a lot. Rows are stored on pages, and in the default SQLite setup the page size is 4K which is typically corresponding to most OS's IO page-sizes, which is efficient. There is a limit on the amount of pages that can be in the DB, so the real technical limiting factor is how big a page is, and you can make the page size bigger easily to a maximum of 64K which allows DBs of up to 2 Terra bytes in size - See here how to change it.

There are other considerations. If you want to add another Index with a billion rows, that will see a large space requirement added. If you want to Vacuum a DB (reducing the free space inside it after some use) you will need free space on your hard drive of more than twice the size of the DB.

Best advice I can give is.... start with the default page size, and try import it all, see if it runs into size errors, and if so, see at what percentage of the job it failed (if at all), that should give you an indication of how much more you will need, which will help deciding an appropriate page size.

Also - Ensure you import each section (folder) in a single transaction, to make it as fast as possible without postponing fail conditions by much, and that way it's also easy to just add on new folders as they come.

(7) By sunjar on 2021-11-14 10:36:11 in reply to 6 [link] [source]

HI,Ryan:

Thanks for your reply and useful advice, I will follow your suggestion, starting from the default page size with part of the data, and then gradually increase to all the data to see how the effect is. I am now learning the basic statements of SQLite, and I believe it will not take too long for me to start this test.

Regarding what you said "longer than the age of the known Universe", this is very funny, maybe I shouldn't spend so much energy to do this thankless task, I will treat it as a study of SQLite.

All the best.

(8) By Holger J (holgerj) on 2021-11-14 12:14:47 in reply to 3 [link] [source]

The suggestion is exactly what is usually being done by "materialized views", that is, physical table which have a statement to refresh them.

But of course you can use ordinary tables and store the end-of-day stock quotes there, resulting in exactly one row per stock and day.

After a while, the intra-day quotes are of no interest any more, so you may as well keep them elsewhere or totally delete them.