Novice database design and query question.

(1.1) By Gary (1codedebugger) on 2020-12-16 06:22:33 edited from 1.0 [link] [source]

I need to store media files in the file system rather than a database because they are well over 1 MB. If I understand the SQLite documentation, it is expected to be more efficient to use the file system than a database when BLOBs exceed 1 MB.

In a media-resource table, the file name and/or path and a unique key will be the minimum columns.

I don't want the user to select the media resources by file path, but by creating their own tags and assigning some number of them to each resource. For a very poor example, the user may apply to an image file the tags of MAPS, TEXAS, OIL WELLS. Thus, a second table is needed to hold the unique user-defined tags.

Being a novice, my first thought was that I'd like to add a column equivalent to an array of tag keys to the media-resource table; but I read that is not the correct approach. Instead a third table should be maintained that has a row for each media resource and tag. By that, I mean, in the poor example given above, that single media resource would have three rows in the third table, one for each tag key.

Would you please confirm that this is the correct approach?

If this is the correct approach, and a user wants a listing of all media resources that have tags of both MAPS and TEXAS, in English, a listing of unique keys from the third table would be selected that have at least two rows, one that is MAPS and one that is TEXAS, and joined with the media-resource table to get the file name and location for display.

Assuming table "tags" is:

key, name
1 | MAPS
2 | OIL WELLS
3 | TEXAS

and table "array" is:

media_key,	tag
1	2
1	3
2	1
2	2
2	3
3	1
3	2

Is it as simple as below?

SELECT * FROM media_resources WHERE key in ( SELECT media_key FROM array WHERE tag = 1 intersect SELECT media_key FROM array WHERE tag = 3 );

Of course, I tried this as small examples and it worked. But my questions are is this the correct approach in design and is there a better method of querying. I know very little but it seems like a bit of work to build two tables inside an "in" and find their intersection, but I don't know how to perform an AND across table rows, if possible. If this is not specific enough to SQLite or is too novice a question, please delete it.

Thank you.

(2) By Gunter Hick (gunter_hick) on 2020-12-16 13:43:35 in reply to 1.1 [link] [source]

NOTE: Foreign key restraints are not shown but may be required

CREATE TABLE media (id integer primary key, file text); CREATE TABLE tags (id integer primary key, name text unique); CREATE TABLE media_tags (media_id integer, tag_id integer); CREATE UNIQUE INDEX tags_by_media on media_tags(media_id, tag_id); CREATE UNIQUE INDEX media_by_tags om media_tags(tag_id, media_id);

SELECT m.file FROM tags t1 JOIN media_tags mt1 ON (t1.id = mt1.tag_id) JOIN media_tags mt2 ON (mt2.media_id = mt1.media_id) JOIN tags t2 ON (t2.id = mt2.tag_id) JOIN media m ON (m.id = mt1.media_id) WHERE t1.name like 'Maps' AND t2.name like 'Texas');

NOTE: Using case insensitive like. Should also be possible by specifying COLLATE NOCASE for tags.name

CREATE VIEW media_with_tag AS SELECT t.name,mt.media_id FROM tags t JOIN media_tags mt ON (t.id = mt.tag_id);

SELECT m.file FROM media WHERE m.id IN ( SELECT media_id from media_with_tag where name like 'Maps' INTERSECT SELECT media_id from media_with_tag where name like 'Texas' );

(3) By Gary (1codedebugger) on 2020-12-16 23:20:38 in reply to 2 [source]

Thank you very much. This was very helpful and interesting to decipher; and it helped me to see how the media_tags table is used to connect the other two.

I'm a little confused still though. If the WHERE clause is left out of the first query with multiple joins, the table has ten columns and seventeen rows (for the example provided in my question of three media files, two of which had two tags and one of which had three tags). Each media file has as many rows as the number of its tags squared.

It appears that the method is to place the tags on the same row in order to use an AND in the WHERE clause, such that there is included every possible pairing of tags for each media file. If there were three tags, does this mean that three tag columns would be required in the table and, therefore, more joins?

I don't know much and I'm only trying to understand, but isn't it a lot to build in the middle so to speak, and would grow quite a bit with the number of media files and associated tags per file?

Simple things seem to blow up in complexity, unless I'm making it more difficult than it is.

Would it be okay to try something like this, considering that the UI will pass the script a list of tag ids?

SELECT *
FROM media m
WHERE m.id in (
SELECT mt.media_id
FROM tags t, media_tags mt
WHERE t.id = mt.tag_id
AND t.id in ( 1, 3, 8, 14 )
GROUP BY mt.media_id
HAVING count( mt.media_id ) = 4 );

It appears that it would handle any number of tags and require two variables, a list of tags and the number of tags; and may build less/smaller intermediate tables.

However, I don't understand the impact of sub queries, GROUP BY, and HAVING on the efficiency of the full process.

Thank you.

(4) By Gunter Hick (gunter_hick) on 2020-12-17 07:54:54 in reply to 3 [link] [source]

There is more than one way to formulate what you are looking for.

Your proposed statements are using tag.id values directly; this means you need to run other SELECT statements first to translate words from input into ids. This may obscure the complexity of what you are doing.

My first example follows "procedural thinking". optimized for two tags.

start with the name of the first tag
translate it into the tag_id
for each associated media id -- find the set of tag_ids -- check if that tag_id translates to the name of the second tag

My second example follows set operations

create a set of ids that have the first tag
intersect with the set of ids that have the second tag
return the files that remin

this can easily be extended for an arbitrary number of conditions, each additional condition requiring building another set and performing an intersect with a progressively small result set.

Your proposed statement produces the cartesian product of the media_tags relation (which requires exponential time) and then selects only a very small number of rows from that.

(5.1) By Gary (1codedebugger) on 2020-12-18 03:01:55 edited from 5.0 in reply to 4 [link] [source]

Thank you for the explanation. I've got quite a bit to learn to genuinely understand it and need to experiment also. I was reading about Cartesian products last night and the trade off between memory usage and time, which I didn't find intuitive but I've much more to read. Thanks again.

It's kindergarten to you all but this is the quickest thus far. When you wrote procedural that made me think about what would I do by hand to accomplish this.

select path, name, concat(path,'/', name)
from media
where id in (
select media_id
from media_tags
where tag_id in (
select id
from tags
where name in ( 'Texas', '18th Century', 'photos', 'video', 'html' ) )
group by media_id
having count( media_id ) = 5 );