Should tables be populated based upon a known/expected pattern of retrieval?

(1) By Gary (1codedebugger) on 2021-04-21 04:16:19 [link] [source]

I'm trying to compare two sources of data that are supposed to each contain at least one column of identical data (the same original source from which each claims to have been constructed) and be very close otherwise.

I unpacked the ugly data blocks from each source into two separate tables, in which there are about 500,000 rows of 10 columns, where no column is over 50 bytes.

If everything were perfect, there'd be an equal number of rows and one identical column, such that the two tables could be joined on that column to add the other columns from table 2 into table 1.

Since that isn't the case and I cannot make it so without adding my "opinion" to one or both of the data sources, I have a question of whether the two tables should remain separate or combined vertically since the layouts are identical.

I read here before that, if the table structures are the same, there's no good reason to have two tables instead of one with an added identifier.

If that is true, I've another question. Even though the tables don't match exactly in a one-to-one correspondence, each has the exact same number of divisions in it, where a division is generally between 2 and 50 rows of data. Since almost every case of examining this data will be to compare the same division from each table, should the two tables be combined in that manner?

If table 2 is simply appended to table 1, when two divisions are compared, n number of rows will be retrieved from the top of the table and then n or close to n rows will be retrieved about 500,000 rows lower down in the table.

I'm writing way beyond my depth of understanding here but should the final combined table be built ordered by table 1 block 1, table 2 block 1, table 1 block 2, table 2 block 2, ..., such that both blocks may more likely be on the same page making the retrieval query more efficient?

Also, if this data is to be made accessible through a GUI that permits the user to edit the data, which is done by adding a new row to the table with an incremented version number, such that the original is always preserved, does that remove any efficiency gained by ordering the table by division?

Thank you for considering my very novice question.

(2) By Ryan Smith (cuz) on 2021-04-21 14:18:26 in reply to 1 [link] [source]

To answer the titular question: No - It really doesn't matter.

Fill the table however you like. Choose a good column/columns to index by, but the indexer will build its own model-list of the values that do make sense in terms of ordering, but is completely unrelated to the order/granularity with which you inserted it.

Should the data of equal structure be in one table rather than two? - yes definitely. This is very important.

You can imagine this problem yourself by arm-chair experiment: If I told you to look up someone's phone number in an old-school phonebook, the book is ordered by Surname, Name. It matters very little that the phone numbers themselves, or their addresses or such are unordered or not clumped together, you are only going to be looking up a Surname+name.

Similarly - would you like to do one lookup in a nice fat phonebook, or two different look-ups in two thinner phonebooks?

The former is always preferred, even by humans, but especially so by computers. Mathematically, a binary search (SQL engines typically use B-Trees and not straight binary look-ups per sé, but they still allow look-ups in logarithmic time) requires a maximum 24 steps to search through ~1 million records, and 25 steps to search ~2 million records. 25 steps is a lot better than doing 24 steps through a million and again 24 through the million in the next table.

To be more direct: You can extrapolate it out to say, 10 million records in 10 tables, which takes 10 x 24 steps (240 steps[1]) each to search, and 10 million records in one single table which can be searched in a grand total of 28 steps. It is clear that the optimal always converges on the single table.

There are other more human advantages too, not having to join tables to get a single result set, Understanding your own data and schemata better, etc.

[1] Note that it doesn't always take the full 24 steps, sometimes luck lets us hit the desired item in less steps - BUT, this average converges near the final steps and not the initials, so that in a 24-step look-up (~million rows) if we do a few million look-ups and count the steps each time, the average goes to somewhere around the 23 steps mark and not 12 steps. Cardinality will influence this average greatly, but can be ignored for Unique/Primary keys. B-Trees are slightly less efficient at the steps, but it is paid in service of much more efficient storage.

(3) By Ryan Smith (cuz) on 2021-04-21 14:45:09 in reply to 2 [source]

I feel I should add some notes, before someone thinks that there is no good reason to have multiple tables with similar structures EVER.

Reasons to split tables in the face of those disadvantages mentioned in the parent post:

Physically the data won't fit in a single file/disk/partition. In SQLite's case, this will require a separate DB, not just table.
The data is split upstream by some factor unrelated to the content. The data for different regions/states/counties may reside on different web servers/services or be accessed by different front-ends so that they will never mix and look-ups will never ever happen outside a specific set.
For SQLite specifically, or ISAM format tables supporting only table-level locking and not row-locking, very high write-concurrency can be improved/solved by multiple similar tables, at the cost of having to Join/Attach them when doing wide queries.

In near every other case it is better to have a single table. (I might be forgetting a use case or two - perhaps someone else might add more).