SQLite User Forum

Creating hierarchical tags
Login

Creating hierarchical tags

(1) By whereswaldon on 2022-08-09 20:51:25 [source]

I apologize if this has been discussed before (I couldn't find it) or is a common-knowledge SQL thing that I'm just ignorant of.

I'd like to create an SQL schema that implements tagging, but the tags can imply other tags. I'm not sure of the best way to implement this efficiently in SQLite.

Consider the basic schema:

CREATE TABLE entity (
  id integer primary key autoincrement not null,
  name text not null
);
CREATE TABLE tag (
  id integer primary key autoincrement not null,
  name text not null
);
CREATE TABLE entity_tags (
  entity integer references entity(id),
  tag integer references tag(id)
);

This enables creating named entities and named tags, and entities can be given multiple tags. However if tags happened to represent related concepts like a lizard tag and a reptile tag, I'd like to be able to express that things tagged as lizard are also tagged as reptile.

The strategy that occurred to me to represent this was:

CREATE TABLE tag_implies (
  subject_tag references tag(id) NOT NULL,
  object_tag references tag(id) NOT NULL
);

This enables tags referencing other tags. I can query for the tags of an entity (just using the second one in the example) with something like:

select distinct object_tag from tag_implies where subject_tag in (select tag from entity_tags where entity=2);

However, this doesn't show the actual tags directly attached to the entity in the query results, just the tags they imply. It also only shows one layer of implication, when in reality many are possible.

So my question for any generous readers who care to lend their expertise is: is there an efficient way to model this kind of data relationship in SQLite? It seems like it's kinda recursive, and I don't really know how to handle that.

(2) By PChemGuy on 2022-08-09 21:45:21 in reply to 1 [link] [source]

Actually, I have posted a question here a couple days ago about an advanced SQL/SQLite tutorial. The last part of this tutorial focused on the exact same problem, hierarchical category/tagging system. Take a look at it. This tutorial is a work in progress, but it has all key pieces in place. The associated GitHub repository (the link is at the top of the tutorial) contains a model database with pretty much same three tables with tags called "categories" and entities called "files". I have developed SQL code for a set of standard operations. This set is not complete, but, I believe, it contains the most complicated parts. You may wanna go through the entire tutorial, as I build-up material gradually. And if you find it useful, I would greatly appreciate your feedback via discussions/issues in the GitHub repo.

(3) By KIT.james (kjames3411) on 2022-08-10 15:57:44 in reply to 2 [link] [source]

Quite interested, but the link is dead

(4.1) By Keith Medcalf (kmedcalf) on 2022-08-10 19:29:18 edited from 4.0 in reply to 1 [link] [source]

Firstly, it would be perspicacious to declare your "name" columns as UNIQUE. Surely the entity/tag "Red" is the same as the entity/tag "Red" and you only want that tag/entity recorded once? (or how, pray tell, do you intend to tell them apart?)

Also, should they have the attribute COLLATE NOCASE? If "Red" is the same as "reD" then you certainly want this.

Also, an "INTEGER PRIMARY KEY" is never null. Why do you use autoincrement?

There is also no need of ROWID organization for the N:M linkage tables, though entries that are null should not permitted (they are useless), and all you really need is the linkage indexes, not multiple copies of the data.

CREATE TABLE entity (
  id integer primary key,
  name text not null unique collate nocase
);
CREATE TABLE tag (
  id integer primary key,
  name text not null unique collate nocase
);
CREATE TABLE entity_tags (
  entity integer not null references entity(id),
  tag integer not null references tag(id),
  primary key (entity, tag),
  unique (tag, entity)
) WITHOUT ROWID;
CREATE TABLE tag_implies (
  subject_tag integer not null references tag(id),
  object_tag integer not null references tag(id),
  primary key (subject_tag, object_tag),
  unique (object_tag, subject_tag)
) WITHOUT ROWID;

insert into entity values (1, 'Test');
insert into tag values (1, 'Green'), ('2', 'Lizard'), (3, 'Reptile');
insert into entity_tags values (1,1), (1,2);
insert into tag_implies values (2,3);

with directtags(entity, tag) as
     (
        select entity.id,
               entity_tags.tag
          from entity, entity_tags
         where entity.name == 'Test'
           and entity_tags.entity == entity.id
     ),
     alltags(entity, tag, isImplied) as
     (
        select entity,
               tag,
               False
          from directtags
     union
        select entity,
               object_tag,
               True
          from alltags, tag_implies
         where tag == subject_tag
     )
select entity,
       entity.name,
       tag,
       tag.name,
       isImplied
  from alltags, entity, tag
 where alltags.entity == entity.id
   and alltags.tag == tag.id
;
┌────────┬────────┬─────┬───────────┬───────────┐
│ entity │  name  │ tag │   name    │ isImplied │
├────────┼────────┼─────┼───────────┼───────────┤
│ 1      │ 'Test' │ 1   │ 'Green'   │ 0         │
│ 1      │ 'Test' │ 2   │ 'Lizard'  │ 0         │
│ 1      │ 'Test' │ 3   │ 'Reptile' │ 1         │
└────────┴────────┴─────┴───────────┴───────────┘

(5) By PChemGuy on 2022-08-10 19:22:09 in reply to 3 [link] [source]

GitHub decided to block my account. They say it may take a week before they fix it.

(6.1) By Keith Medcalf (kmedcalf) on 2022-08-10 19:40:19 edited from 6.0 in reply to 4.1 [link] [source]

Note that the query can be simplified:

with alltags(entity, tag, isImplied) as
     (
        select entity.id,
               entity_tags.tag,
               False
          from entity, entity_tags
         where entity.name == 'Test'
           and entity_tags.entity == entity.id
     union
        select entity,
               object_tag,
               True
          from alltags, tag_implies
         where tag == subject_tag
     )
select entity,
       entity.name,
       tag,
       tag.name,
       isImplied
  from alltags, entity, tag
 where alltags.entity == entity.id
   and alltags.tag == tag.id
;

(7.3) By beetlejuice (coleifer) on 2022-08-11 18:49:32 edited from 7.2 in reply to 1 [link] [source]

Expanding on Keith's answer, you could use a recursive CTE. Here's an example with multiple levels of implication:

CREATE TABLE entity (
  id integer primary key,
  name text not null unique collate nocase
);
CREATE TABLE tag (
  id integer primary key,
  name text not null unique collate nocase
);
CREATE TABLE entity_tags (
  entity integer not null references entity(id),
  tag integer not null references tag(id),
  primary key (entity, tag),
  unique (tag, entity)
) WITHOUT ROWID;
CREATE TABLE tag_implies (
  subject_tag integer not null references tag(id),
  object_tag integer not null references tag(id),
  primary key (subject_tag, object_tag),
  unique (object_tag, subject_tag)
) WITHOUT ROWID;

insert into entity values (1, 'test');
insert into entity values (2, 'test2');
insert into tag values (1, 'green'), (2, 'lizard'), (3, 'reptile'), (4, 'animal'), (5, 'living thing');
insert into entity_tags values (1, 1), (1, 2); 
insert into entity_tags values (2, 4);
-- lizard implies reptile, reptile implies animal, animal implies living thing.
insert into tag_implies values (2, 3), (3, 4), (4, 5);

with recursive all_tags (ename, tid, tname) as (
    select e.name, t.id, t.name from tag as t 
    inner join entity_tags as et on (et.tag = t.id)
    inner join entity as e on (et.entity = e.id)
    union all
    select all_tags.ename, tr.id, tr.name
    from tag as tr
    inner join tag_implies as ti on (tr.id = ti.object_tag)
    inner join all_tags on (ti.subject_tag = all_tags.tid))
select at.ename, group_concat(at.tname)
from all_tags as at
group by at.ename;

-- OUTPUT:
test|green,lizard,reptile,animal,living thing
test2|animal,living thing

Depending on whether the CTE is an optimization fence and how that's handled, you may want to move a where clause into the CTE to avoid generating the full list only to filter it later, e.g.:

with recursive all_tags (ename, tid, tname) as (
    select e.name, t.id, t.name from tag as t 
    inner join entity_tags as et on (et.tag = t.id)
    inner join entity as e on (et.entity = e.id)
    where e.name = 'test' -- Restrict our base-case here!
    union all
    select all_tags.ename, tr.id, tr.name
    from tag as tr
    inner join tag_implies as ti on (tr.id = ti.object_tag)
    inner join all_tags on (ti.subject_tag = all_tags.tid))
select at.ename, group_concat(at.tname)
from all_tags as at
group by at.ename

(8) By whereswaldon on 2022-08-12 02:34:22 in reply to 6.1 [link] [source]

Thanks so much for your advice! I've learned a lot from reading it and playing with the queries. Here's where I'm at:

CREATE TABLE entities (
    id INTEGER PRIMARY KEY NOT NULL,
    name TEXT NOT NULL COLLATE NOCASE
);

CREATE TABLE tags (
    id INTEGER PRIMARY KEY NOT NULL,
    name TEXT NOT NULL COLLATE NOCASE
);

CREATE TABLE entity_tags (
    entity INTEGER NOT NULL REFERENCES entities (id),
    tag INTEGER NOT NULL REFERENCES tags (id),
    PRIMARY KEY (entity, tag),
    UNIQUE (entity, tag)
) WITHOUT ROWID;

CREATE TABLE tag_implies (
    subject INTEGER NOT NULL REFERENCES tags (id),
    object INTEGER NOT NULL REFERENCES tags (id),
    PRIMARY KEY (subject, object),
    UNIQUE (subject, object)
) WITHOUT ROWID;

INSERT INTO entities (id, name) VALUES (1,'ant'),(2,'beetle'),(3,'lizard'),(4,'turtle'),(5,'cat'),(6,'rabbit'),(7,'human');
INSERT INTO tags (id, name) VALUES (1,'creature'),(2,'insect'),(3,'reptile'),(4,'mammal'),(5,'furry');
-- Insect, reptile, and mammal imply creature.
INSERT INTO tag_implies (subject,object) VALUES (2,1),(3,1),(4,1);
-- Only cats and rabbits are furry, everything else is intuitive.
INSERT INTO entity_tags (entity,tag) VALUES (1,2),(2,2),(3,3),(4,3),(5,4),(5,5),(6,4),(6,5),(7,4);

-- Find the tags for a particular entity. 
WITH RECURSIVE full_tags(entity, tag, implied, implied_by) AS (
    SELECT entity, tag, False, NULL FROM entity_tags WHERE entity_tags.entity = @entity
    UNION
    SELECT full_tags.entity, tag_implies.object, True, tag_implies.subject
    FROM tag_implies JOIN full_tags ON full_tags.tag = tag_implies.subject
) SELECT * FROM full_tags;

-- Reverse-lookup all descendant tags for a tag set.
WITH RECURSIVE rev(tag, ancestor, implied) as (
    SELECT id, id, False FROM tags WHERE id IN @tag_set
    UNION
    SELECT subject, rev.ancestor, True FROM tag_implies JOIN rev ON rev.tag=tag_implies.object
) SELECT ancestor, group_concat(tag) FROM rev GROUP BY ancestor;

-- Find all entity ids matching a set of tags. @tag_set is a group of tags, and @num_tags is
-- the number of tags in the group.
WITH RECURSIVE full_tags(entity, tag, implied, implied_by) AS (
    SELECT entity, tag, False, NULL FROM entity_tags
    UNION
    SELECT full_tags.entity, tag_implies.object, True, tag_implies.subject
    FROM tag_implies JOIN full_tags ON full_tags.tag = tag_implies.subject
)
SELECT entities.*
FROM full_tags JOIN entities ON entities.id = full_tags.entity
WHERE tag IN @tag_set
GROUP BY entities.id
HAVING count(entities.id) = count(@num_tags);

I'm able to look up the tags for a given entity, and I can query for entities matching a set of tags. However, the entity matching is pretty inefficient because it is forced to compute the entire full_tags CTE every time. I have a suspicion that one can use a variant of my descendant lookup query to do it more efficiently. Something like:

  • resolve all descendant tags for each tag of interest
  • query only direct tags on entities, searching for entities with a tag matching the descendant sets of each tag of interest

I'm not sure it can be expressed in pure SQL though. Does that seem like a sane approach?

I'm also currently trying to figure out how to prevent the user from creating cycles in the tags. Is it possible to enforce that as an SQL constraint, or do I need to enforce it at the application layer?

(9) By whereswaldon on 2022-08-12 02:36:52 in reply to 7.3 [link] [source]

Thank you very much for the advice and especially the discussion of handling optimization fences! I definitely think I'm hitting some of those, mostly around querying for entities matching sets of tags (instead of resolving tags for a given entity). I've written up the current state of things here if you have advice on how to tackle that side of the problem.

(10) By PChemGuy on 2022-08-15 05:38:43 in reply to 3 [link] [source]

My account is fixed, the link is alive.