Representing Trees in
a relational DB
This
document describes an approach for handling hierarchical data in a relational
database.
A
hierarchical data structure (tree) is a set of nodes. Each node is a 2tupel
consisting of an id and a parent id. We therefore define two functions that
operates on nodes as:
_{}
The id of a
node is unique for a tree. More formally we can express that as:
_{}
We also
define the root of a tree to be a node having itself as parent:
_{}
Each tree
is defined to have at least one root node. Given this we can easily extend our
set of definitions to include path and sub tree. We can think of the path as if
we pick a node from a tree and recursively pick the parent until we reach a
root node. The sub tree of a node x is any node having x in its path.
_{}
_{}
Since the
id of a node is unique we will sometimes use the abbreviations:
_{} and _{}
If we
assume that one of the most common operation in a database that stores
hierarchical data, is to query for related data – for example, sub tree or
path  it is not a very convenient thing to recursively ask for data until some
condition is fulfilled. In the rest of this document we will focus on an
approach that uses a table to remember the path of a node, and thereby also
keeps information regarding sub tree. We start by defining the table that
contains the nodes.
Figure 1: Table containing data.
In any sensible situation this table should contain some data as well, but that is of no concern here. We can then add the table that “remembers” the path.
Figure
2:Remember table and relation to data table.
In figure 3
a tiny tree and the resulting tables are shown.



Figure
3: A small tree, resulting Data and Path tables
To ask for
the path of a given node is straightforward thing. Lets say we would like to
retrieve the path of the node with id 5:
“Select AncestorId from
Path where Id = 5”
The result
of this query is 1 and 2. Asking for the sub tree of a given node is almost as
easy. Assume we would like to know the sub tree of the node with id 2:
“Select Id from Path
where AncestorId = 2”
The result
of this question is as expected 4, 5 and 6. For more complicated questions we
probably need to join more tables in the query, but the general idea remains
the same.
Thus, given
a remember table we can quite easily retrieve hierarchically information. The
trouble starts when we change the content of the data table. When this happens
we must ensure that the Path table is also changed to reflect the new content
of the Data table.
If we add a
new node (x,y) to Data, we must also add the following nodes to Path:
_{}
_{} denotes the Cartesian product, and path and parent functions are defined in the previous paragraph. In SQL we can define this operation as:
“Insert into Path (values (x,y)
union Select x, AncestorId from Path where Id = y”
When deleting nodes we assume that nodes that
act as parents for other nodes cant be deleted. Thus it is only leafs that can
be deleted from the Data table. If we delete the leaf (x,y) from the Data
table, we need to delete the set:
_{}
from the Path table. This is easily expressed in SQL as:
“Delete from Path where Id = x”
So inserting and deleting (with the limitations
given) is fairly simple. Now we will have a look on moving a set of nodes.
A naïve approach would be to just delete all
nodes that are about to move and then insert them in their new position one by
one. This is however most inefficient since there is no need to change the structure
within the sub tree we are moving. If we for example move node 5 so that it
becomes a child of node 3 instead of 2, there is no need to change the relation
between 5 and 6.



Figure 4: Node 5 moved from 2 to 3 and resulting tables.
If we look in the Path table above we see that
there are only two rows that are changed compared to the Path table in figure
3.
A more efficient approach is to delete only
those rows where Id belongs to the sub tree of x, and AncestorId belongs to the
path of x. Formally we can express that as:
_{}
or simplified as:
_{}
The same goes for insertion. If we assume that we are moving x so that it becomes a child of y, it is only necessary to insert x and the sub tree of x, together with y and the path of y. The expression goes:
_{}
The delete statement is rather straightforward
to transform to SQL:
delete from Path where
((Id
= x) or Id in (select Id from Path where AncestorId = x)) and
AncestorId
in (select AncestorId from Path where Id = x)
The insert part is a bit trickier, and we will
do the transformation in two steps. First we will construct some pseudo SQL
using the functions path, and subtree:
insert into Path (
(values
x, y)
union
(select
x, path(y) from Path)
union
(select
subtree(x), y from Path)
union
(select
subtree(x), path(y) from Path)
)
which when expanded becomes:
insert into Path (
(values
x, y)
union
(select
x, AncestorId from Path where Id = y)
union
(select
Id, y from Path where AncestorId = x)
union
(select
a.Id, b.AncestorId from Path a, Path b where
a.AncestorId
= x and b.Id = y)
)
One can argue that it is not necessary to
change the intersection between the old path and the new path.
_{}
But it is left as an exercise for the reader to
include this in the SQL statements :)
In the
previous sections we have assumed that there is a single path – that might be
empty  from any node. If we remove this assumption we are dealing with a
Directed Acyclic Graph (DAG) instead of a tree. How does this affect our
implementation?
Lets start
by stating that we can’t insert a new element in the Data table containing the
same Id and another ParentId as another row. Doing so would violate the primary
key constraint of the Data table. Hence we will need some kind of “Link” table
between nodes. Furthermore lets limit the functionality of links so that the
FromId and ToId may not belong to the same tree. That is, if we link a node
from x to y, the intersection of path(x) and path(y) must be the empty set. By
disallowing links within a tree we avoids the problem with cycles.
Figure 5 By allowing links we end up with a Graph.
Since a node can be linked to several other nodes, we will need an nn relation to represent this structure.
Figure 6 Link table and relations to other tables involved.
From now on we will denote a link from x to y with:
_{}
x is the node we are linking and y is the position we link it to. While we are at it we might define functions to handle links.
_{}
Since the
id of a node is unique we will sometimes use the abbreviation:
_{}
Lets start by looking at what will happen when we apply our new set of operations.
We saw earlier that inserting a node in the Data table should be reflected in the Path table as:
_{}
Without defining the specific operations we need, it is clear that adding a link between x and y means that we should add not only the link, but also the path from the target to the Path table. Thus adding a link means that we should insert
_{}
into the Path table. Note that we contrary to the situation in “add node”, also must take into consideration the sub tree of x. Furthermore we cant just insert (x,y), instead we have to insert x and its sub tree together with y. Thus, we end up with the insert part of the move node operation. This makes perfect sense since we are sort of copying the node. We must also remember to exclude elements already in the Path table. It is probably easiest to exclude the Path table itself from the insert expression. Transformation to SQL is relatively simple and goes like:
insert into Path (
select * from (
(values
x, y)
union
(select
x, AncestorId from Path where Id = y)
union
(select
Id, y from Path where AncestorId = x)
union
(select
a.Id, b.AncestorId from Path a, Path b where
a.AncestorId = x and b.Id = y)
) as T (ActorId,
AncestorId)
where not exists (
select * from Path P where
P.ActorId = T.ActorId and P.AncestorId =
T.AncestorId
)
One might note how an extra select clause is added in order to be able to refer to the union in the “exists” clause.
This is the time when we need to put our thinking hats back on. In fact, this operation is the one to blame for this document. Lets start with a simple case like the one in Figure 7
Figure 7
If we look at the Path table for node 6 we have the elements:
{(6,5),(6,2), (6,1), (6,14), (6,12), (6,11)}
It is clear that we should remove the elements (6,14), (6,12), (6,11) when we remove the link [6,4]. Naively we can assume that the expression
_{}
satisfies our needs. However, a quick glance in Figure 8 reveals that this is not true. If we would delete this expression from the Path table, we would also delete the nodes (6,12) and (6,11). Since node 6 should still be linked to 16, and 12 and 11 lies in the path of 16 we can’t remove those nodes. Do note that we have included the sub tree of x in our set.
Figure 8 Links from node 6 to 14 and 16
The solution is to exclude other link paths. In order to do this we need to introduce a new function complement. We define the complement of a link to be all links but the link itself.
_{}
With the help of this function we can define the elements that should be deleted as the closure of the link except the closure of all other links. For simplicity we will also define the closure of a link [x,y] as:
_{}
and the closure of a set of links as:
_{}
Given this we are now ready to deal with the removal of a link. The elements that shall be removed is:
_{}
This is perhaps not trivial to transform to SQL (it wasn’t for me anyhow). We already know how express the closure of [x,y], thus we need to think of second part of the expression. It is clear that we must use the same kind of union construction to express the closure, but in extent we must also join in the remaining links in each sub expression.
select g.ActorId, g.GroupId from DataLink g where
(g.ActorId <> x or g.GroupId <> y)
union
select g.ActorId, b.AncestorId from Path b, ActorLink g where
(g.GroupId <> y) and
(b.ActorId = g.GroupId)
union
select a.ActorId, g.GroupId from Path a, ActorLink g where
(g.ActorId <> x) and
(a.AncestorId = g.ActorId)
union
select a.ActorId, b.AncestorId from Path a, Path b, DataLink g where
(g.ActorId <> x or g.GroupId <> y) and
(a.AncestorId = g.ActorId) and
(b.ActorId = g.GroupId)
Therefore the final expression to remove a link becomes:
delete from Path where (ActorId, AncestorId)
in
(
select * from (
(values
x, y)
union
(select
x, AncestorId from Path where Id = y)
union
(select
Id, y from Path where AncestorId = x)
union
(select
a.Id, b.AncestorId from Path a, Path b where
a.AncestorId = x and b.Id = y)
)
except select * from (
select
g.ActorId, g.GroupId from DataLink g where
(g.ActorId
<> x or g.GroupId <> y)
union
select
g.ActorId, b.AncestorId from Path b, ActorLink g where
(g.GroupId
<> y) and
(b.ActorId
= g.GroupId)
union
select a.ActorId,
g.GroupId from Path a, ActorLink g where
(g.ActorId
<> x) and
(a.AncestorId
= g.ActorId)
union
select
a.ActorId, b.AncestorId from Path a, Path b, DataLink g where
(g.ActorId
<> x or g.GroupId <> y) and
(a.AncestorId
= g.ActorId) and
(b.ActorId
= g.GroupId)
)
This
section is a summary of how different operations on a tree should be reflected
in the Path table.
Insert: _{}
“Insert into Path (values (x,y) union Select
x, AncestorId from Path where Id = y”
Delete: _{}
“Delete from Path where Id = x”
Delete: _{}
Insert: _{}
delete from Path where
((Id
= x) or Id in (select Id from Path where AncestorId = x)) and
AncestorId
in (select AncestorId from Path where Id = x)
insert into Path (
(values
x, y)
union
(select
x, AncestorId from Path where Id = y)
union
(select
Id, y from Path where AncestorId = x)
union
(select
a.Id, b.AncestorId from Path a, Path b where
a.AncestorId
= x and b.Id = y)
)
Optionally one might exclude the set:
_{}
from both operations.
Insert: _{}
insert into Path (
select * from (
(values
x, y)
union
(select
x, AncestorId from Path where Id = y)
union
(select
Id, y from Path where AncestorId = x)
union
(select
a.Id, b.AncestorId from Path a, Path b where
a.AncestorId = x and b.Id = y)
) as T (ActorId,
AncestorId)
where not exists (
select * from Path P where
P.ActorId = T.ActorId and P.AncestorId =
T.AncestorId
)
Delete:
_{}
_{}
_{}
_{}
delete from Path where (ActorId,
AncestorId) in
(
select * from (
(values
x, y)
union
(select
x, AncestorId from Path where Id = y)
union
(select
Id, y from Path where AncestorId = x)
union
(select
a.Id, b.AncestorId from Path a, Path b where
a.AncestorId = x and b.Id = y)
)
except select * from (
select g.ActorId,
g.GroupId from DataLink g where
(g.ActorId
<> x or g.GroupId <> y)
union
select
g.ActorId, b.AncestorId from Path b, ActorLink g where
(g.GroupId
<> y) and
(b.ActorId
= g.GroupId)
union
select
a.ActorId, g.GroupId from Path a, ActorLink g where
(g.ActorId
<> x) and
(a.AncestorId
= g.ActorId)
union
select
a.ActorId, b.AncestorId from Path a, Path b, DataLink g where
(g.ActorId
<> x or g.GroupId <> y) and
(a.AncestorId
= g.ActorId) and
(b.ActorId
= g.GroupId)
)
)