Breadth-first graph traversal

(1.1) By abetlen on 2020-10-19 23:17:49 edited from 1.0 [link] [source]

Say I have a graph like:

1
|\
2 3
|/
4

5
|
6

stored in a sqlite database as:

create table edges(fromNode integer, toNode integer);
insert into edges(fromNode, toNode) values
 (1, 2),
 (1, 3),
 (2, 4),
 (3, 4),
 (5, 6);

I would like to generate a table of the breadth-first traversal of this graph from a specific starting node.

I can currently do this using a recursive cte and then performing an aggregate query grouping on the id outside of the recursive cte

with recursive
 bfs_tree(id, parent, distance) as (
  select 1, NULL, 0
  union all
  select edges.toNode, bfs_tree.id, bfs_tree.distance + 1
  from edges, bfs_tree
  where edges.fromNode = bfs_tree.id
  order by 3
 )
select id, parent, min(distance) from bfs_tree
group by id;

but this is very slow for larger graphs because the same nodes are being visited multiple times when there are multiple paths to a given node.

Is there any way to avoid revisiting nodes inside the recursive cte?

Thanks in advance

(2) By Keith Medcalf (kmedcalf) on 2020-10-20 05:16:43 in reply to 1.1 [link] [source]

See https://sqlite.org/forum/forumpost/b36527c43

(3) By abetlen on 2020-10-20 05:59:41 in reply to 2 [link] [source]

I actually did read through that discussion but the solution is not what I'm looking for. Using the path and instr method you showed there will cut off paths that become cycles but it won't cut off paths that revisit nodes.

Take for example the directed-acyclic graph:

1
|\
2 3
|/
4
|\
5 6
|/
7

$ cat bfs_test.sql
create table edges(fromNode integer, toNode integer);
insert into edges(fromNode, toNode) values
    (1, 2),
    (1, 3),
    (2, 4),
    (3, 4),
    (4, 5),
    (4, 6),
    (5, 7),
    (6, 7);

with recursive
    bfs_tree(id, visited, distance) as (
        select 1, '/' || 1 || '/', 0
        union all
        select toNode, visited || toNode || '/', distance + 1
        from edges, bfs_tree
        where fromNode = id and instr(visited, '/' || toNode || '/') == 0
        order by 3 asc
    )
select * from bfs_tree;

$ sqlite3 < bfs_test.sql
1|/1/|0
2|/1/2/|1
3|/1/3/|1
4|/1/2/4/|2
4|/1/3/4/|2
5|/1/2/4/5/|3
6|/1/2/4/6/|3
5|/1/3/4/5/|3
6|/1/3/4/6/|3
7|/1/2/4/5/7/|4
7|/1/2/4/6/7/|4
7|/1/3/4/5/7/|4
7|/1/3/4/6/7/|4

(4.1) By Keith Medcalf (kmedcalf) on 2020-10-20 10:17:28 edited from 4.0 in reply to 3 [link] [source]

You cannot do that because the reference to the recursive table in the recursive select is a reference to the singleton row being recursed. You do not have access to other rows in the recursive table.

That is, the path 1-2-4 and 1-3-4 are distinct recursion results and both are valid paths from node 1 to node 4.

(6) By Keith Medcalf (kmedcalf) on 2020-10-20 09:56:53 in reply to 4.0 [source]

You will note that the current (next release) version of SQLite3 allows more than one "recursive select" so you can now do something like this:

create table edges(fromNode integer, toNode integer);
insert into edges(fromNode, toNode) values
    (1, 2),
    (1, 3),
    (2, 4),
    (3, 4),
    (4, 5),
    (4, 6),
    (5, 7),
    (6, 7);

with bfs_tree(startat, endat, visited, distance)
  as (
         select 1, 1, '/' || 1 || '/', 0
      union all
         select startat, toNode, visited || toNode || '/', distance + 1
           from edges, bfs_tree
          where fromNode == endat
            and instr(visited, '/' || toNode || '/') == 0
      union all
         select startat, fromNode, visited || fromNode || '/', distance + 1
           from edges, bfs_tree
          where toNode == endat
            and instr(visited, '/' || fromNode || '/') == 0
    )
  select *
    from bfs_tree
order by startat, endat, distance;

so that your edges become bi-directional with the following results:

┌─────────┬───────┬───────────────┬──────────┐
│ startat │ endat │    visited    │ distance │
├─────────┼───────┼───────────────┼──────────┤
│ 1       │ 1     │ /1/           │ 0        │
│ 1       │ 2     │ /1/2/         │ 1        │
│ 1       │ 2     │ /1/3/4/2/     │ 3        │
│ 1       │ 3     │ /1/3/         │ 1        │
│ 1       │ 3     │ /1/2/4/3/     │ 3        │
│ 1       │ 4     │ /1/2/4/       │ 2        │
│ 1       │ 4     │ /1/3/4/       │ 2        │
│ 1       │ 5     │ /1/2/4/5/     │ 3        │
│ 1       │ 5     │ /1/3/4/5/     │ 3        │
│ 1       │ 5     │ /1/2/4/6/7/5/ │ 5        │
│ 1       │ 5     │ /1/3/4/6/7/5/ │ 5        │
│ 1       │ 6     │ /1/2/4/6/     │ 3        │
│ 1       │ 6     │ /1/3/4/6/     │ 3        │
│ 1       │ 6     │ /1/2/4/5/7/6/ │ 5        │
│ 1       │ 6     │ /1/3/4/5/7/6/ │ 5        │
│ 1       │ 7     │ /1/2/4/5/7/   │ 4        │
│ 1       │ 7     │ /1/2/4/6/7/   │ 4        │
│ 1       │ 7     │ /1/3/4/5/7/   │ 4        │
│ 1       │ 7     │ /1/3/4/6/7/   │ 4        │
└─────────┴───────┴───────────────┴──────────┘

However, the reference to the recursion table in the recursion parts is still a reference to the current singleton being recursed.

(7) By abetlen on 2020-10-20 13:26:03 in reply to 6 [link] [source]

Thank you for the answer! At least now I know it can't be done this way.

After some searching I was able to find this extension ext/misc/closure.c which implements transitive closures using a virtual table interface.

I'll spend some time reading through the source and see if I can implement something similar for this problem.

(5) By Warren Young (wyoung) on 2020-10-20 09:28:28 in reply to 3 [link] [source]

FYI, this forum now has a feature called Pikchr which lets you create inline diagrams with an easily-learned language. Except for very simple cases, it's easier than working out the ASCII art equivalent.

In your particular case, the result actually communicates what you're after better, because your ASCII DAG isn't "directed":

    scale = 0.6
C1: circle "1" fit
    arrow right 50%
C2: circle same "2"
    arrow same
C4: circle same "4"
    arrow same
C5: circle same "5"
    arrow same
C7: circle same "7"
C3: circle same "3" at 1cm above C2
C6: circle same "6" at 1cm above C5
    arrow from C1 to C3 chop
    arrow from C3 to C4 chop
    arrow from C4 to C6 chop
    arrow from C6 to C7 chop

There's a feature on every Fossil repository called /pikchrshow that provides a sandbox where you can construct these things before copy-pasting them into your post.

Click the diagram to see what the source text for the diagram looks like, or click the [source] link above to see the raw text of the whole post, so you can see how I injected the diagram into the post.

That diagram is based on the first one in this article. Point being, if you know where you can find a suitably-close Pikchr, you can rework it to purpose faster than crafting it from scratch.

(8) By abetlen on 2020-10-20 14:21:59 in reply to 5 [link] [source]

That's very handy, thank you!

(9.1) By tom (younique) on 2020-10-21 18:08:54 edited from 9.0 in reply to 1.1 [link] [source]

The documentation about https://sqlite.org/lang_with.html instructs (as you are doing here, too) to use ORDER BY to control whether to do a breadth-first or depth-first search. Is there a specific reason why the SQL keywords "SEARCH DEPTH/BREADTH FIRST BY ... SET ..." cannot be used?

(10) By Richard Hipp (drh) on 2020-10-21 18:42:29 in reply to 9.1 [link] [source]

I'm not aware of any such SQL keywords. What other SQL database engine supports this syntax?

(11) By tom (younique) on 2020-10-21 20:07:56 in reply to 10 [link] [source]

I've learned it from a book about database concepts (chapter 11.5.2):

Traversing order

The SQL standard provides an optional traversing clause, which determines the order in which the the nodes of the graph are visited. This is possible via the clause

SEARCH BREADTH FIRST BY column SET pseudo-column

the width search is possible, through which first all directly connected nodes (in in our example the places that can be reached directly from the starting point) then the nodes are output over two edges (places marked with (which can be reached with a single change), then the nodes over three edges etc.

On the other hand, by specifying the clause

SEARCH DEPTH FIRST BY column SET pseudo-column

forced a depth search. At first a path is searched until the end before the next path is visited.

In both cases, column specifies the attribute to be used for navigation - so it must be an attribute from the condition of the recursive compound be. The clause generates a pseudo column in the query result, in the non-recursive part of the request by an ORDER BY clause can be evaluated.

(Translated with www.DeepL.com/Translator (free version))

So, to me it looked as if it were part of SQL standard. Appearently, it's supported in Oracle and IBM DB2. The latter contains the following example:

WITH destinations (departure, arrival, connections, cost) AS 
    (SELECT f.departure, f.arrival, 0, price
            FROM flights f 
            WHERE f.departure = 'Chicago' 
     UNION ALL 
     SELECT r.departure, b.arrival, r.connections + 1,
                r.cost + b.price 
            FROM destinations r, flights b 
            WHERE r.arrival = b.departure) 
    SEARCH BREADTH FIRST BY arrival SET ordcol 
SELECT * 
  FROM destinations 
  ORDER BY ordcol

(12) By Simon Slavin (slavin) on 2020-10-22 14:04:38 in reply to 11 [link] [source]

As far as I can tell the SEARCH … BY construction was used in DB2 and copied by Oracle. I can't find it in any other implementation. SQL Server instead implements breadth-first searches using a construction CREATE CLUSTERED INDEX ….

It's possible that SEARCH … BY was introduced in SQL:2016. However, copies of that (now outdated) specification are expensive and can't be reproduced, so documenting code written to that standard is awkward.

SQLite, PostgreSQL, MySQL, and some other implementations of SQL, allow breadth-first graph traversal using a recursive Common Table Expression. You can see how it's done here

https://sqlite.org/lang_with.html#controlling_depth_first_versus_breadth_first_search_of_a_tree_using_order_by

However if you're not familiar with CTEs I suggest you start at the top of the page and work down, because they're one of the most complicated things SQLite does.

(13) By abetlen on 2020-11-17 21:58:54 in reply to 1.1 [link] [source]

I was able to solve this in the end by writing a virtual table extension. I've published that extension here.

It uses the AVL tree implementation from ext/misc/closure.c to keep an in-memory set of visited nodes. Currently, only integer node IDs are supported, there's also no filtering / ordering on the node neighbour query in the breadth-first search, but I'm planning to include both soon.

If anyone has any thoughts / suggestions, please let me know.