Newbie question: Decision for database model

(1.1) By nji9 (nji9nji9) on 2020-06-20 10:01:58 edited from 1.0 [link]

Hello ... for the very first time :-)

Finally I'm at the point to need a database for an application.
I decided for SQLite, read the tutorials and documentation,
but would like to ask the experts if my first steps would be right.
As I would like to avoid to choose a wrong design at first step.

It's about a labeling/ tagging application.

For any file (of up to several hundred thousands) at a filesystem
there shall be made an assignment/ labeling to some of about 50 labels.

Also this labeling shall be queried like:

Which items are assigned to a (subset of labels)?
Which labels are assigned to a (subset of items)?
But also more complex like:
Sort all subsets of labels by the count of items they are assigned to.
... etc.

My "idea" is to model it like this:
One table for the items (consisting only of its name = location).
One table for each label (consisting of only the ids of the item table).
Simple enough.

But will the queries from items to labels be fast enough?
As there is no "backward link"?
I.e. I (SQLite) will have to search every label table on the existence
of the item...?

As this seems a typical newbie question ... I hope you'll be kind to me ;-)

(2) By Larry Brasfield (LarryBrasfield) on 2020-06-20 09:48:31 in reply to 1.0

I'm rephrasing your intention to clarify it.

You have a largish number of file objects, each with some identifier (such as a full pathname.)

You have a small, generally fixed, set of labels (sometimes called "tags") to be associated with the file objects.

The relationship between file objects and labels is many-to-many. (In other words, a file can be associated with 0, 1, or more labels, and a label can be applied to an arbitrary number of files.)

(Unstated before, but likely:) You want to avoid duplication of labels associated with any file object.

The traditional way to express this in a relational database is to have 3 tables and a constraint:

1. A table with one row per file object, containing at least their identifiers.

2. A table with one row per label that _might_ be associated with files.

3. A linking table, each row of which signifies the association of of one label with one file object. In SQL-speak (or RDBMS-speak), there will be two foreign key columns, one referencing file table rows and the other referencing label table rows.

4. Usually, a linking table has a UNIQUE constraint applied to the combination of the two foreign keys. This precludes the above-mentioned duplication.

Your concerns about speed will be addressed by creating suitable indexes upon the tables. The precise indexes needed will depend on the queries you want to support with speedy response. That response will have to be balanced against the impact of index updates during modification of the indexed tables.

(3) By nji9 (nji9nji9) on 2020-06-20 10:00:58 in reply to 2 [link]

Thankyou very much for your help.

Yes, you rephrased it correctly, with one exeption:

The number of labels/ tags is not generally fixed.

...
So that actually was not only a newbie question but also my newbie solution ;-)

I was missing the linking table idea!

I'll go that way then.

Thank you again.

(Probably) see you ... ;-)

(4) By Simon Slavin (slavin) on 2020-06-21 22:35:39 in reply to 1.1 [link]

Larry's response, and your response to it, are great.

I wanted to add a more general point.  No database design should depend on routine creation and deletion of tables. Creating and deleting tables are for programmers and maintenance routines.  If you're creating new tables while your apps are running, you're probably doing something wrong.

So you should definitely look for a solution that doesn't require creating a table every time someone thinks up a new label.

(5) By Larry Brasfield (LarryBrasfield) on 2020-06-21 23:13:46 in reply to 4 [link]

I agree with Simon on his point,

> ... look for a solution that doesn't require creating a table every time someone thinks up a new label.

, but I thought that was evident in the form of schema I suggested. However, I can see how that is not as obvious as I thought, leading to a need for that point.

To be clear: I thought the requirement was to allow a slowly growing or changing (or "generally fixed" as I put it) set of labels to be associated with files. The labels table would contain that set.

(6) By nji9 (nji9nji9) on 2020-06-22 07:50:11 in reply to 5 [link]

If I get it right, Simon's response (thank you :) was meant for me,
as the idea of a new table for every label comes from my side.

And his general point is valuable for me.

Actually I just started with SQL(ite) and beside db design
I will have to get a feeling about algorithms:
When to let do by SQLite (and how) and when better do as before
with structures/ algs of my own.

(Horrible solutions are about to arise! :-)))

Thankyou again!

(7) By Ryan Smith (cuz) on 2020-06-22 10:47:35 in reply to 6 [link]

> (Horrible solutions are about to arise! :-)))

There are a great many folks here whom absolutely delight in discussing, exploring, rehashing, and in general, faff about optimal and best ways to do things - and rightly so, it is fascinating how many seemingly insignificant things really do matter.

Please do some reading on the matters, but keep posting your questions for any and all things not fully comprehended, and we'll happily answer/explore them with you.

Moreover, old salts here don't really know what it is like to be a "newbie" to DB design these days, and we don't get to make up topics here, only answer them (though we stretch that grant significantly most days). This means your questions will probably be in the hearts of many other people who do not post, but do read this forum, so your questions about those "horrible solutions" (and anyone else's that might read this) are very welcome.