Representing Trees in a relational DB


Abstract

This document describes an approach for handling hierarchical data in a relational database.

 

Definitions

A hierarchical data structure (tree) is a set of nodes. Each node is a 2-tupel consisting of an id and a parent id. We therefore define two functions that operates on nodes as:

 

                     

 

The id of a node is unique for a tree. More formally we can express that as:

 

                     

 

We also define the root of a tree to be a node having itself as parent:

 

                     

 

Each tree is defined to have at least one root node. Given this we can easily extend our set of definitions to include path and sub tree. We can think of the path as if we pick a node from a tree and recursively pick the parent until we reach a root node. The sub tree of a node x is any node having x in its path.

 

                     

 

                     

 

Since the id of a node is unique we will sometimes use the abbreviations:

 

 and

 


Operations

If we assume that one of the most common operation in a database that stores hierarchical data, is to query for related data ­– for example, sub tree or path - it is not a very convenient thing to recursively ask for data until some condition is fulfilled. In the rest of this document we will focus on an approach that uses a table to remember the path of a node, and thereby also keeps information regarding sub tree. We start by defining the table that contains the nodes.

 

Figure 1: Table containing data.

 

In any sensible situation this table should contain some data as well, but that is of no concern here. We can then add the table that “remembers” the path.

Figure 2:Remember table and relation to data table.

 

In figure 3 a tiny tree and the resulting tables are shown.

 

Data

1

1

2

1

3

1

4

2

5

2

6

5

Path

2

1

3

1

4

2

4

1

5

2

5

1

6

5

6

2

6

1

Figure 3: A small tree, resulting Data and Path tables

 

To ask for the path of a given node is straightforward thing. Lets say we would like to retrieve the path of the node with id 5:

 

                      “Select AncestorId from Path where Id = 5”

 

The result of this query is 1 and 2. Asking for the sub tree of a given node is almost as easy. Assume we would like to know the sub tree of the node with id 2:

 

                      “Select Id from Path where AncestorId = 2”

 

The result of this question is as expected 4, 5 and 6. For more complicated questions we probably need to join more tables in the query, but the general idea remains the same.

 

Thus, given a remember table we can quite easily retrieve hierarchically information. The trouble starts when we change the content of the data table. When this happens we must ensure that the Path table is also changed to reflect the new content of the Data table.

 

Add node

If we add a new node (x,y) to Data, we must also add the following nodes to Path:

 

                     

 

*   denotes the Cartesian product, and path and parent functions are defined in the previous paragraph. In SQL we can define this operation as:

 

Insert into Path (values (x,y) union Select x, AncestorId from Path where Id = y”

 

Delete node

When deleting nodes we assume that nodes that act as parents for other nodes cant be deleted. Thus it is only leafs that can be deleted from the Data table. If we delete the leaf (x,y) from the Data table, we need to delete the set:

 

                     

 

from the Path table. This is easily expressed in SQL as:

 

      Delete from Path where Id = x”

 

So inserting and deleting (with the limitations given) is fairly simple. Now we will have a look on moving a set of nodes.

 


Move node

A naïve approach would be to just delete all nodes that are about to move and then insert them in their new position one by one. This is however most inefficient since there is no need to change the structure within the sub tree we are moving. If we for example move node 5 so that it becomes a child of node 3 instead of 2, there is no need to change the relation between 5 and 6.

 

Data

1

1

2

1

3

1

4

2

5

3

6

5

Path

2

1

3

1

4

2

4

1

5

3

5

1

6

5

6

3

6

1

Figure 4: Node 5 moved from 2 to 3 and resulting tables.

 

If we look in the Path table above we see that there are only two rows that are changed compared to the Path table in figure 3.        

 

A more efficient approach is to delete only those rows where Id belongs to the sub tree of x, and AncestorId belongs to the path of x. Formally we can express that as:

 

                     

 

or simplified as:

 

                     

 

The same goes for insertion. If we assume that we are moving x so that it becomes a child of y, it is only necessary to insert x and the sub tree of x, together with y and the path of y. The expression goes:

 

                     

 

The delete statement is rather straightforward to transform to SQL:

 

delete from Path where

            ((Id = x) or Id in (select Id from Path where AncestorId = x)) and

            AncestorId in (select AncestorId from Path where Id = x)

 


The insert part is a bit trickier, and we will do the transformation in two steps. First we will construct some pseudo SQL using the functions path, and subtree:

 

insert into Path (

            (values x, y)

union

            (select x, path(y) from Path)

union

            (select subtree(x), y from Path)

union

            (select subtree(x), path(y) from Path)

)

 

which when expanded becomes:

 

insert into Path (

            (values x, y)

union

            (select x, AncestorId from Path where Id = y)

union

            (select Id, y from Path where AncestorId = x)

union

            (select a.Id, b.AncestorId from Path a, Path b where

                      a.AncestorId = x and b.Id = y)

)

 

One can argue that it is not necessary to change the intersection between the old path and the new path.

 

                     

 

But it is left as an exercise for the reader to include this in the SQL statements :-)


Extensions

In the previous sections we have assumed that there is a single path – that might be empty - from any node. If we remove this assumption we are dealing with a Directed Acyclic Graph (DAG) instead of a tree. How does this affect our implementation?

 

Lets start by stating that we can’t insert a new element in the Data table containing the same Id and another ParentId as another row. Doing so would violate the primary key constraint of the Data table. Hence we will need some kind of “Link” table between nodes. Furthermore lets limit the functionality of links so that the FromId and ToId may not belong to the same tree. That is, if we link a node from x to y, the intersection of path(x) and path(y) must be the empty set. By disallowing links within a tree we avoids the problem with cycles.

Figure 5 By allowing links we end up with a Graph.

 

Since a node can be linked to several other nodes, we will need an n-n relation to represent this structure.

 

Figure 6 Link table and relations to other tables involved.

 

From now on we will denote a link from x to y with:

 

                     

 

x is the node we are linking and y is the position we link it to. While we are at it we might define functions to handle links.

 

                                                                                 

 

Since the id of a node is unique we will sometimes use the abbreviation:

 

                     

 

Lets start by looking at what will happen when we apply our new set of operations.

 

Add Link

We saw earlier that inserting a node in the Data table should be reflected in the Path table as:

 

 

Without defining the specific operations we need, it is clear that adding a link between x and y means that we should add not only the link, but also the path from the target to the Path table. Thus adding a link means that we should insert

 

                     

 

into the Path table. Note that we contrary to the situation in “add node”, also must take into consideration the sub tree of x. Furthermore we cant just insert (x,y), instead we have to insert x and its sub tree together with y. Thus, we end up with the insert part of the move node operation. This makes perfect sense since we are sort of copying the node. We must also remember to exclude elements already in the Path table. It is probably easiest to exclude the Path table itself from the insert expression. Transformation to SQL is relatively simple and goes like:

 

insert into Path (

            select * from (

                      (values x, y)

            union

                      (select x, AncestorId from Path where Id = y)

            union

                      (select Id, y from Path where AncestorId = x)

            union

                      (select a.Id, b.AncestorId from Path a, Path b where

                                 a.AncestorId = x and b.Id = y)

            )  as T (ActorId, AncestorId)

 where not exists (

            select * from Path P where

 P.ActorId = T.ActorId and P.AncestorId = T.AncestorId

)

 

One might note how an extra select clause is added in order to be able to refer to the union in the “exists” clause.

 

 


Remove Link

This is the time when we need to put our thinking hats back on. In fact, this operation is the one to blame for this document. Lets start with a simple case like the one in Figure 7

 

Figure 7

 

If we look at the Path table for node 6 we have the elements:

 

                      {(6,5),(6,2), (6,1), (6,14), (6,12), (6,11)}

 

It is clear that we should remove the elements (6,14), (6,12), (6,11) when we remove the link [6,4]. Naively we can assume that the expression

 

 

satisfies our needs. However, a quick glance in Figure 8 reveals that this is not true. If we would delete this expression from the Path table, we would also delete the nodes (6,12) and (6,11). Since node 6 should still be linked to 16, and 12 and 11 lies in the path of 16 we can’t remove those nodes. Do note that we have included the sub tree of x in our set.             

 

Figure 8 Links from node 6 to 14 and 16

 


The solution is to exclude other link paths. In order to do this we need to introduce a new function complement. We define the complement of a link to be all links but the link itself.

 

 

With the help of this function we can define the elements that should be deleted as the closure of the link except the closure of all other links. For simplicity we will also define the closure of a link [x,y] as:

 

                                        

and the closure of a set of links as:

 

                  

 

Given this we are now ready to deal with the removal of a link. The elements that shall be removed is:

 

                     

 

This is perhaps not trivial to transform to SQL (it wasn’t for me anyhow). We already know how express the closure of [x,y], thus we need to think of second part of the expression. It is clear that we must use the same kind of union construction to express the closure, but in extent we must also join in the remaining links in each sub expression.

 

select g.ActorId, g.GroupId from DataLink g where

                      (g.ActorId <> x or g.GroupId <> y)

union

select g.ActorId, b.AncestorId from Path b, ActorLink g where

                                                                 (g.GroupId <> y) and

                                                                 (b.ActorId = g.GroupId)

union

select a.ActorId, g.GroupId from Path a, ActorLink g where

                                                                 (g.ActorId <> x) and

                                                                 (a.AncestorId = g.ActorId)

union

select a.ActorId, b.AncestorId from Path a, Path b, DataLink g where

                                                                 (g.ActorId <> x or g.GroupId <> y) and

                                                                 (a.AncestorId = g.ActorId) and

                                                                 (b.ActorId = g.GroupId)

 


Therefore the final expression to remove a link becomes:

 

delete from Path where (ActorId, AncestorId) in

(

            select * from (

                      (values x, y)

                      union

                      (select x, AncestorId from Path where Id = y)

                      union

                      (select Id, y from Path where AncestorId = x)

                      union

                      (select a.Id, b.AncestorId from Path a, Path b where

                                 a.AncestorId = x and b.Id = y)

            ) except select * from (

select g.ActorId, g.GroupId from DataLink g where

                                                                 (g.ActorId <> x or g.GroupId <> y)

union

select g.ActorId, b.AncestorId from Path b, ActorLink g where

                                                                 (g.GroupId <> y) and

                                                                 (b.ActorId = g.GroupId)

union

select a.ActorId, g.GroupId from Path a, ActorLink g where

                                                                 (g.ActorId <> x) and

                                                                 (a.AncestorId = g.ActorId)

union

select a.ActorId, b.AncestorId from Path a, Path b, DataLink g where

                                                                 (g.ActorId <> x or g.GroupId <> y) and

                                                                 (a.AncestorId = g.ActorId) and

                                                                 (b.ActorId = g.GroupId)

            )

)               
Summary

This section is a summary of how different operations on a tree should be reflected in the Path table.

 

Add node

 

Insert:

 

Insert into Path (values (x,y) union Select x, AncestorId from Path where Id = y”

 

Delete node

 

Delete:

 

Delete from Path where Id = x”

 

Move node

 

Delete:

Insert:  

 

delete from Path where

            ((Id = x) or Id in (select Id from Path where AncestorId = x)) and

            AncestorId in (select AncestorId from Path where Id = x)

 

insert into Path (

            (values x, y)

union

            (select x, AncestorId from Path where Id = y)

union

            (select Id, y from Path where AncestorId = x)

union

            (select a.Id, b.AncestorId from Path a, Path b where

                      a.AncestorId = x and b.Id = y)

)

 

Optionally one might exclude the set:

 

 

from both operations.


Add Link

 

Insert:  

 

insert into Path (

            select * from (

                      (values x, y)

            union

                      (select x, AncestorId from Path where Id = y)

            union

                      (select Id, y from Path where AncestorId = x)

            union

                      (select a.Id, b.AncestorId from Path a, Path b where

                                 a.AncestorId = x and b.Id = y)

            )  as T (ActorId, AncestorId)

 where not exists (

            select * from Path P where

 P.ActorId = T.ActorId and P.AncestorId = T.AncestorId

)

 

 

Remove Link

 

Delete:

 

                  

 

 


delete from Path where (ActorId, AncestorId) in

(

            select * from (

                      (values x, y)

                      union

                      (select x, AncestorId from Path where Id = y)

                      union

                      (select Id, y from Path where AncestorId = x)

                      union

                      (select a.Id, b.AncestorId from Path a, Path b where

                                 a.AncestorId = x and b.Id = y)

            ) except select * from (

select g.ActorId, g.GroupId from DataLink g where

                                                                 (g.ActorId <> x or g.GroupId <> y)

union

select g.ActorId, b.AncestorId from Path b, ActorLink g where

                                                                 (g.GroupId <> y) and

                                                                 (b.ActorId = g.GroupId)

union

select a.ActorId, g.GroupId from Path a, ActorLink g where

                                                                 (g.ActorId <> x) and

                                                                 (a.AncestorId = g.ActorId)

union

select a.ActorId, b.AncestorId from Path a, Path b, DataLink g where

                                                                 (g.ActorId <> x or g.GroupId <> y) and

                                                                 (a.AncestorId = g.ActorId) and

                                                                 (b.ActorId = g.GroupId)

            )

)