The nested set model is a convenient way to represent hierarchies in SQL. By taking advantage of the set nature of relations, it provides quick and relatively simple queries to determine, for example, all descendants of a given node. Such queries are more computationally difficult using adjacency lists.
The nested set however, has performance limitations on insert, update, and delete. This arises because the nested set model encodes the hierarchy based on what the set contains: when a node is inserted or deleted, all of the sets containing the node must be updated as well. In other words, the encoding depends on the current state of the database. This behavior makes the nested set model less desirable for large, dynamic hierarchies.
An alternative to a data-dependent encoding is a static hierarchy. An example of this is a materialized path, where nodes 1.1, 1.2, and 1.3 are the first, second, and third children of node 1. Node 1.4 is still the fourth child of node 1 even if there don’t happen to be four (or more) children assigned to node 1 in the database. The encoding value, 1.4, is waiting to be used when the time arises.
With all hierarchies, there are four basic queries:
And there are the three additional operations of inserting a node, deleting a node, or moving a node.
With a static hierarchy, encoding values are preselected. The node’s position relative to all another nodes is known directly from the encoding value. Given node 1.2.2, we know its parent is node 1.2, and its ancestors are nodes 1.2 and 1. We know its children will all have an encoding of 1.2.1.x, and all of its descendants will be prefixed by 1.2.1.
This makes inserts trivial. Inserting a child of a node requires only finding out how many children the node currently has, and giving the new node the encoding appropriate for the next child. If node 1 currently has three children, the new child will have encoding 1.4. No nodes need to be updated when a new node is inserted.
Deleting a node requires finding all of the nodes younger siblings: nodes with the same parent and greater sibling number. Continuing the previous example, if we delete node 1.2, nodes 1.3 and 1.4 become 1.2 and 1.3 respectively. If 1.3 and 1.4 have descendants, we need a way of translating the subtrees of 1.3 and 1.4 to their new positions. This makes translation a fundamental operation when working with hierarchies.
If materialized paths are used as the encoding, simply find all of the current descendants of 1.3, and replace the 1.3 with 1.2. This translation is repeated for each younger sibling. Updating is simply translating a node and its subtree to a new inserted position and translating its former younger siblings to fill the hole it left. If the database model requires inserting a node between two existing nodes, just translate the younger siblings to an even younger position.
Materialized paths represented as text in the database makes all hierarchy operations simply a matter of parsing and manipulating strings. In PostgreSQL, one can even create a domain to make sure all paths are properly represented.
create domain ni_path as
text
not null
check (value ~ '^([1-9]\d*)(\.[1-9]\d*)*$');
One criticism of materialized paths is that string manipulation is undesirable. An encoding that is based on a mathematical algorithm might be better. One possible algorithm is to use binary fractions, which is explained further below. Other algorithms can be used to map the hierarchy as well, such as Farey fractions or continued fractions. Vadim Tropashko has written a number of articles on various nested interval implementations for Oracle using PL/SQL. My implementation is largely a port of his work for PostgreSQL using PL/pgSQL. The point of any of these alogorithms is to generate a static hierarchy. The only difference is how the hierarchy is encoded and manipulated.
The model is called the nested interval model, which implies one dimensional intervals each bounded by two points. However, I think the math is easier to follow if I use two dimensions. If you’re not interested in the math, feel free to skip down to the functions.
Take the unit square described by 0 ≤ x ≤ 1, 0 ≤ y ≤ 1. The goal is to divide this square into progressively smaller, non-overlapping domains, nodes being points represented by (x,y) coordinates. Their position, i.e., which domain they lie within, determines their ancestors, and also the domain of their descendants.
Following Tropashko’s example, let’s place the first node—node 1—at (1,1⁄2). Where do we place this node’s children? Well, the x-coordinate of the first child of node 1—node 1.1—is the same as its parent, so x1.1 = 1. The y-coordinate of the first child is located at the midpoint of the y-position of the parent (y = 1⁄2) and the depth-first convergence point, where the x-position of the parent meets the diagonal x = y. For node 1, the diagonal intercepts x = 1 at y = 1, so the y-coordiate of the first child is y = 3⁄4: Node 1.1 is located at (1,3⁄4). You can see that node 1.1.1, the first child of node 1.1, is at (1,7⁄8), and that 1.1.1.1 is located at (1,15⁄16).
So how about siblings? Siblings are located along the line between the first child and the breadth-first convergence point, where the y-position of the parent intercepts the diagonal x = y. The second child—node 1.2—is located at (3⁄4, 5⁄8). Node 1.3 (the third child) is half-way between node 1.2 and the breadth-first convergence point, at (5⁄8, 9⁄16).
The domain of node 1 is shown in red, and the domains of nodes 2 and 3 are yellow and green, respectively. The depth-first and breadth-first convergence points are simple the points where the lines x = x1 and y = y1 intersect the diagonal: the other two points of the domain triangle.
Also, note that all immediate children of a node z are located vertically half-way between the diagonal and the line y = yz; and horizontally half-way between their next-eldest sibling and the diagonal. This will be useful when finding the parent and sibling number of a given node.
Following this algorithm, we can divide the unit square into smaller and smaller domains, determined by the location of a parent node. In the figure, all children of node 1 are found within the right triangle formed by the diagonal x = y, y = y1, x = x1. In turn, all children of node 1.1 are within the right triangle bounded by the x = y, x = x1.1, y = y1.1.
This fractal pattern continues to infinity, or to the precision of the computer system. As we move deeper into the tree, numbers get big pretty quickly, and quite close together. Accuracy is important, and floating point math is not sufficiently accurate to safely represent the hierarchy. A rational number type which accurately represents integer numerators and denominators is desirable, and something I’d like to implement in PostgreSQL as a base type, but for the time being I’ll represent rational numbers as numerator-denominator integer pairs. Each point is represented by four integers. For example, node 1 is xnum = 1, xden = 1, ynum = 1, yden = 2. Functions will likewise have numerator and denominator variants.
node | x | y | sum |
---|---|---|---|
1 | 1 | 1⁄2 | 3⁄2 |
1.1 | 1 | 3⁄4 | 7⁄4 |
1.1.1 | 1 | 7⁄8 | 15⁄8 |
1.1.1.1 | 1 | 15⁄16 | 31⁄16 |
1.1.2 | 7⁄8 | 13⁄16 | 27⁄16 |
1.1.3 | 13⁄16 | 25⁄32 | 51⁄32 |
1.2 | 3⁄4 | 5⁄8 | 11⁄8 |
1.2.1 | 3⁄4 | 11⁄16 | 23⁄16 |
1.3 | 5⁄8 | 9⁄16 | 19⁄16 |
2 | 1⁄2 | 1⁄4 | 3⁄4 |
2.1 | 1⁄2 | 3⁄8 | 7⁄8 |
2.1.1 | 1⁄2 | 7⁄16 | 15⁄16 |
2.2 | 3⁄8 | 5⁄16 | 11⁄16 |
2.3 | 5⁄16 | 9⁄32 | 19⁄32 |
3 | 1⁄4 | 1⁄8 | 3⁄8 |
3.1 | 1⁄4 | 3⁄16 | 7⁄16 |
3.2 | 3⁄16 | 5⁄32 | 11⁄32 |
It should also be obvious from the fact that the algorithm involves repeated halving that these are binary fractions: the denominator is always a power of 2.
Tropashko points out that the x and y values of each node are related and can be equivalently represented as the sum of x and y. Inspection shows that xnum × yden = ynum + 1. For node 1, num1 = 3, den1 = 2, or 3⁄2. These sums will be the basis of the encoding.
Given their sum, we can derive the x and y numerators and denominators using functions. I prefixed my functions with ni for nested interval. The functions take advantage of binary fractions in two ways:
create function ni_x_numer (
integer -- numer
, integer -- denom
) returns integer
language plpgsql as '
declare
numer alias for $1;
denom alias for $2;
ret_numer integer;
ret_denom integer;
begin
-- add one to the numerator so the numerator becomes 2x
ret_numer := numer + 1;
ret_denom := denom * 2;
-- reduce to simplest form
while floor(ret_numer / 2.0) = ret_numer / 2.0 loop
ret_numer := ret_numer / 2;
ret_denom := ret_denom / 2;
end loop;
return ret_numer;
end;
';
create function ni_x_denom (
integer -- numer
, integer -- denom
) returns integer
language plpgsql as '
declare
numer alias for $1;
denom alias for $2;
ret_numer integer;
ret_denom integer;
begin
ret_numer := numer + 1;
ret_denom := denom * 2;
while floor(ret_numer / 2.0) = ret_numer / 2.0 loop
ret_numer := ret_numer / 2;
ret_denom := ret_denom / 2;
end loop;
return ret_denom;
end;
';
create function ni_y_numer (
integer -- numer
, integer -- denom
) returns integer
language plpgsql as '
declare
numer alias for $1;
denom alias for $2;
ret_numer integer;
ret_denom integer;
begin
ret_numer := ni_x_numer(numer,denom);
ret_denom := ni_x_denom(numer,denom);
while ret_denom < denom loop
ret_numer := ret_numer * 2;
ret_denom := ret_denom * 2;
end loop;
ret_numer = numer - ret_numer;
-- reduce to simplest form
while floor(ret_numer / 2.0) = ret_numer / 2.0 loop
ret_numer = ret_numer / 2;
ret_denom = ret_denom / 2;
end loop;
return ret_numer;
end;
';
create function ni_y_denom (
integer -- numer
, integer -- denom
) returns integer
language plpgsql as '
declare
numer alias for $1;
denom alias for $2;
ret_numer integer;
ret_denom integer;
begin
ret_numer := ni_x_numer(numer,denom);
ret_denom := ni_x_denom(numer,denom);
while ret_denom < denom loop
ret_numer := ret_numer * 2;
ret_denom := ret_denom * 2;
end loop;
ret_numer = numer - ret_numer;
while floor(ret_numer / 2.0) = ret_numer / 2.0 loop
ret_numer = ret_numer / 2;
ret_denom = ret_denom / 2;
end loop;
return ret_denom;
end;
';
The two ni_x
functions
So, let’s see if it works. I set up a table of the node x-y-sum data.
create table ni_data (
node ni_path not null unique
, numer integer not null
, denom integer not null
, unique (numer,denom)
);
insert into ni_data (node, numer, denom) values ('1',3,2);
insert into ni_data (node, numer, denom) values ('1.1',7,4);
insert into ni_data (node, numer, denom) values ('1.1.1',15,8);
insert into ni_data (node, numer, denom) values ('1.1.1.1',31,16);
insert into ni_data (node, numer, denom) values ('1.1.2',27,16);
insert into ni_data (node, numer, denom) values ('1.1.3',51,32);
insert into ni_data (node, numer, denom) values ('1.2',11,8);
insert into ni_data (node, numer, denom) values ('1.2.1',23,16);
insert into ni_data (node, numer, denom) values ('1.3',19,16);
insert into ni_data (node, numer, denom) values ('2',3,4);
insert into ni_data (node, numer, denom) values ('2.1',7,8);
insert into ni_data (node, numer, denom) values ('2.1.1',15,16);
insert into ni_data (node, numer, denom) values ('2.2',11,16);
insert into ni_data (node, numer, denom) values ('2.3',19,32);
insert into ni_data (node, numer, denom) values ('3',3,8);
insert into ni_data (node, numer, denom) values ('3.1',7,16);
insert into ni_data (node, numer, denom) values ('3.2',11,32);
select
node
, ni_x_numer(numer,denom) || '/' || ni_x_denom(numer,denom) as x
, ni_y_numer(numer,denom) || '/' || ni_y_denom(numer,denom) as y
, numer || '/' || denom as sum
from ni_data
order by node;
node | x | y | sum
---------+-------+-------+-------
1 | 1/1 | 1/2 | 3/2
1.1 | 1/1 | 3/4 | 7/4
1.1.1 | 1/1 | 7/8 | 15/8
1.1.1.1 | 1/1 | 15/16 | 31/16
1.1.2 | 7/8 | 13/16 | 27/16
1.1.3 | 13/16 | 25/32 | 51/32
1.2 | 3/4 | 5/8 | 11/8
1.2.1 | 3/4 | 11/16 | 23/16
1.3 | 5/8 | 9/16 | 19/16
2 | 1/2 | 1/4 | 3/4
2.1 | 1/2 | 3/8 | 7/8
2.1.1 | 1/2 | 7/16 | 15/16
2.2 | 3/8 | 5/16 | 11/16
2.3 | 5/16 | 9/32 | 19/32
3 | 1/4 | 1/8 | 3/8
3.1 | 1/4 | 3/16 | 7/16
3.2 | 3/16 | 5/32 | 11/32
(17 rows)
So far, so good!
As I mentioned before, the key to nested intervals is that the entire tree is determined by the algorithm used to generate it. I plotted the children of each node using two rules:
These rules can also be expressed as functions. The pow
function in PostgreSQL is based on floating point math. Manfred Koizar posted a function to compute powers of 2 using integers to the pgsql-general mailing list. I have reproduced his function (with minor editorializing) here and use it in the functions to calculate the sibling number and parent of a node.
create or replace function pow2(bigint)
returns bigint
language plpgsql as '
declare
e bigint;
ret bigint;
begin
e := $1;
if (e < 0) then
raise exception ''2 ^ % does not fit into a bigint'', e;
end if;
if (e > 62) then
raise exception ''2 ^ % does not fit into a bigint'', e;
end if;
ret = 1;
while (e > 0) loop
ret = 2 * ret;
e = e - 1;
end loop;
return ret;
end;
';
create function ni_child_numer (
integer -- numer
, integer -- denom
, integer -- child
) returns integer
language plpgsql as '
declare
numer alias for $1;
denom alias for $2;
child alias for $3;
begin
return numer * pow2(child) + 3 - pow2(child);
end;
';
create function ni_child_denom (
integer -- numer
, integer -- denom
, integer -- child
) returns integer
language plpgsql as '
declare
numer alias for $1;
denom alias for $2;
child alias for $3;
begin
return denom * pow2(child);
end;
';
Let’s see if these worked.
-- node 1, sum = 3/2. first child (1.1)
select ni_child_numer(3,2, 1) || '/' || ni_child_denom(3,2, 1) as child;
child
-------
7/4
(1 row)
-- node 1.1, sum = 7/4, first child (1.1.1)
select ni_child_numer(7,4, 1) || '/' || ni_child_denom(7,4, 1) as child;
child
-------
15/8
(1 row)
-- node 1.1, third child (1.1.3)
select ni_child_numer(7,4, 3) || '/' || ni_child_denom(7,4, 3) as child;
child
-------
51/32
(1 row)
-- node 3, sum = 3/8, second child (3.2)
select ni_child_numer(3,8, 2) || '/' || ni_child_denom(3,8, 2) as child;
child
-------
11/32
(1 row)
Given the child node, we can find its parent and sibling number. These functions are based on the earlier observation of a child’s position relative to its parent:
create function ni_parent_numer (
integer -- numer
, integer -- denom
) returns integer
language plpgsql as '
declare
numer alias for $1;
denom alias for $2;
ret_numer integer;
ret_denom integer;
begin
-- special case where node is the root
if numer = 3 then
return null;
end if;
ret_numer := (numer - 1) / 2 ;
ret_denom := denom / 2 ;
while floor (( ret_numer - 1) / 4.0) = (ret_numer - 1) / 4.0 loop
ret_numer := (ret_numer + 1) / 2;
ret_denom := ret_denom / 2;
end loop;
return ret_numer;
end;
';
create function ni_parent_denom (
integer -- numer
, integer -- denom
) returns integer
language plpgsql as '
declare
numer alias for $1;
denom alias for $2;
ret_numer integer;
ret_denom integer;
begin
if numer = 3 then
return null;
end if;
ret_numer := (numer - 1) / 2 ;
ret_denom := denom / 2 ;
while floor (( ret_numer - 1) / 4.0) = (ret_numer - 1) / 4.0 loop
ret_numer := (ret_numer + 1) / 2;
ret_denom := ret_denom / 2;
end loop;
return ret_denom;
end;
';
create function ni_sibling_number (
integer -- numer
, integer -- denom
) returns integer
language plpgsql as '
declare
numer alias for $1;
denom alias for $2;
ret_numer integer;
ret_denom integer;
ret integer;
begin
ret_numer := (numer - 1) / 2;
ret_denom := denom / 2;
ret := 1;
while floor ((ret_numer - 1) / 4.0) = (ret_numer - 1) / 4.0 loop
if ret_numer = 1 and ret_denom = 1 then
return ret;
end if;
ret_numer := (ret_numer + 1) / 2;
ret_denom := ret_denom / 2;
ret := ret + 1;
end loop;
return ret;
end;
';
-- node 3.2, sum 11/32
select ni_parent_numer(11,32) || '/' || ni_parent_denom(11,32) as parent
, ni_sibling_number(11,32) as child_number;
parent | child_number
--------+--------------
3/8 | 2
(1 row)
-- node 2.1.1, sum 15/16
select ni_parent_numer(15,16) || '/' || ni_parent_denom(15,16) as parent
, ni_sibling_number(15,16) as child_number;
parent | child_number
--------+--------------
7/8 | 1
(1 row)
Comparing with our previous data, we can verify that this is correct.
The dotted path notation is a much more intuitive way to represent the position of the node on the tree than the sum of its x and y coordinates. Here are some functions to convert between the sum and the path and back again.
Note. These implementations rely on the instr
function which is ported from Oracle. It, along with its support functions, can be found in the PostgreSQL 7.4 Documentation in the appendix of Porting from Oracle PL/SQL. There is an error in the functions as they’re given in the 7.4 docs. The instr
function with three arguments, instr(varchar,varchar,varchar)
, should be instr(varchar,varchar,integer)
. Just change the datatype of the last parameter from varchar
to integer
and it should work. This should be fixed in the 7.5 docs.
create function ni_path_inner (
integer -- numer
, integer -- denom
) returns text
language plpgsql as '
declare
numer alias for $1;
denom alias for $2;
path_string text;
begin
if numer is null then
return '''';
end if;
return
ni_path_inner(
ni_parent_numer(numer,denom)
, ni_parent_denom(numer,denom)
)
|| ''.''
|| ni_sibling_number(numer,denom)
;
end;
';
create function ni_path (
integer -- numer
, integer -- denom
) returns ni_path
language sql as '
select substr(ni_path_inner($1,$2),2,length(ni_path_inner($1,$2)) - 1)::ni_path
';
The ni_path_inner
function returns a string with a leading dot. The ni_path
function is a wrapper which calls ni_path_inner
and removes the leading dot. This is not only for form’s sake: it’s necessary to close the conversion between path and sum. We want x / y
to be equal to ni_path_numer(ni_path(x,y)) / ni_path_denom(ni_path(x,y))
.
-- node 1
select ni_path(3,2);
ni_path
---------
1
(1 row)
-- node 1.1.3
select ni_path(51,32);
ni_path
---------
1.1.3
(1 row)
-- node 2.1.1
select ni_path(15,16);
ni_path
---------
2.1.1
(1 row)
create function ni_path_numer (
ni_path -- x.x.x path notation
) returns integer
language plpgsql as '
declare
path alias for $1;
numer integer;
denom integer;
postfix text;
sibling text;
begin
numer := 1;
denom := 1;
postfix := ''.'' || path || ''.'';
while length(postfix) > 1 loop
sibling := substr(
postfix
, 2
, instr(postfix,''.'',2) - 2
);
postfix := substr(
postfix
, instr (postfix, ''.'', 2)
, length(postfix) - instr(postfix,''.'',2) + 1
);
numer := ni_child_numer(
numer
, denom
, to_number(
sibling
, repeat(''9'',length(sibling))
)::integer
);
denom := ni_child_denom(
numer
, denom
, to_number(
sibling
, repeat(''9'',length(sibling))
)::integer
);
end loop;
return numer;
end;
';
create or replace function ni_path_denom (
ni_path -- x.x.x path notation
) returns integer
language plpgsql as '
declare
path alias for $1;
numer integer;
denom integer;
postfix text;
sibling text;
begin
numer := 1;
denom := 1;
postfix := ''.'' || path || ''.'';
while length(postfix) > 1 loop
sibling := substr(
postfix
, 2
, instr(postfix,''.'',2) - 2
);
postfix := substr(
postfix
, instr (postfix, ''.'', 2)
, length(postfix) - instr(postfix,''.'',2) + 1
);
numer := ni_child_numer(
numer
, denom
, to_number(
sibling
, repeat(''9'',length(sibling))
)::integer
);
denom := ni_child_denom(
numer
, denom
, to_number(
sibling
, repeat(''9'',length(sibling))
)::integer
);
end loop;
return denom;
end;
';
select ni_path_numer('2.1.1') || '/' || ni_path_denom('2.1.1') as sum;
sum
-------
15/16
(1 row)
select ni_path_numer('3.2') || '/' || ni_path_denom('3.2') as sum;
sum
-------
11/32
(1 row)
select ni_path_numer('1') || '/' || ni_path_denom('1') as sum;
sum
-----
3/2
(1 row)
select ni_path_numer('3') || '/' || ni_path_denom('3') as sum;
sum
-----
3/8
(1 row)
And checking that the conversion functions are inverses of each other.
select ni_path(ni_path_numer('1.1.3'),ni_path_denom('1.1.3'));
ni_path
---------
1.1.3
(1 row)
select ni_path_numer(ni_path(51,32)) || '/' || ni_path_denom(ni_path(51,32)) as sum;
sum
-------
51/32
(1 row)
Distance between two nodes is also useful. The ni_distance
function takes two numerator/denominator pairs in the order distance node1 from ancestor node2. It recursively walks from node1 to the root, checking to see if each subsequent parent node2.
Note. I have implemented this straight from Tropashko, including his self-described kludge which returns a large negative number if node1 is not a descendant of node2. This of course would fail to properly return a negative number for deeply nested nodes; in this case if node1 has a depth of 1 million or more. (Though we’ll see later this is unlikely when using this particular implementation.) A more elegant and robust solution is desirable.
create function ni_distance (
integer -- numer_1
, integer -- denom_1
, integer -- numer_2
, integer -- denom_2
) returns integer
language plpgsql as '
declare
numer_1 alias for $1;
denom_1 alias for $2;
numer_2 alias for $3;
denom_2 alias for $4;
begin
if numer_1 is null then
return -999999;
end if;
-- check if they are the same point
if numer_1 = numer_2
and denom_1 = denom_2 then
return 0;
end if;
return 1 + ni_distance (
ni_parent_numer(numer_1, denom_1)
, ni_parent_denom(numer_1, denom_1)
, numer_2
, denom_2);
end;
';
-- distance between 1.1 and 1
select ni_distance (7,4, 3,2);
ni_distance
-------------
1
(1 row)
-- distance between 1.1.1.1 and 1.1
select ni_distance(31,16, 7,4);
ni_distance
-------------
2
(1 row)
-- distance between 1.1 and 1.1.1.1
select ni_distance(7,4, 31,16);
ni_distance
-------------
-999997
(1 row)
-- distance between 2.1.1 and 3
select ni_distance(15,16, 3,8);
ni_distance
-------------
-999996
(1 row)
With the functions we have so far, it’s possible to build and query a table representing a hierarchy with nested intervals. Let’s take a look at an example of a virtual file system.
create table folders (
folder_name text not null unique
, num integer not null
, den integer not null
, unique (num,den)
);
Note. These ni_
functions work for trees with roots n = 1, 2, 3,… but fail for root 0 at (1,0) with sum = 1⁄1. I’ll choose node 1 as my root, so num/den = 3⁄2.
-- initialize the tree
insert into folders (folder_name, num, den) values ('root', 3,2);
Now, I could calculate the current tree and insert the next folder manually. But a function can do this for us, given the parent folder (node) we want to insert it in. For housekeeping purposes, we’ll fill in the tree sequentially; for example, insert child 1 before child 2. The idea is to count the number of children in of the parent folder n, and insert the new folder as child = n + 1. This won’t work if the tree isn’t filled sequentially.
create or replace function insert_folder (
text -- parent_folder_name
, text -- new_folder_name
) returns void
language plpgsql as '
declare
parent_folder_name alias for $1;
new_folder_name alias for $2;
parent_folder_rational record;
new_child_number integer;
begin
select num, den
into parent_folder_rational
from folders where folder_name = parent_folder_name;
select (count(folder_name) + 1)::integer into new_child_number
from folders
where ni_parent_numer(num, den) = parent_folder_rational.num
and ni_parent_denom(num,den) = parent_folder_rational.den;
insert into folders (folder_name, num, den) values (
new_folder_name
, ni_child_numer(
parent_folder_rational.num
, parent_folder_rational.den
, new_child_number
)
, ni_child_denom(
parent_folder_rational.num
, parent_folder_rational.den
, new_child_number
)
);
return;
end;
';
As a convenience, I’ve declared a parent_folder_rational
record to hold the numerator and denominator of the parent folder. I could have used two integer variables such as parent_folder_num
and parent_folder_den
instead.
select insert_folder('root','usr');
select insert_folder('usr','local');
select insert_folder('local','src');
select insert_folder('local','pgsql');
select insert_folder('pgsql','bin');
select insert_folder('pgsql','data');
select insert_folder('pgsql','doc');
select insert_folder('doc','html');
select insert_folder('pgsql','include');
select insert_folder('pgsql','lib');
select insert_folder('pgsql','share');
And let’s create a view to display the folder hierarchy.
create or replace view folder_hierarchy as
select
folder_name
, num
, den
, num || '/' || den as folder_rational
, repeat(' ',ni_distance(num,den, 3,2))
|| folder_name as folder_name_
, ni_path(num,den) as folder_path
from folders
order by folder_path;
select * from folder_hierarchy;
folder_name | num | den | folder_rational | folder_name_ | folder_path
-------------+------+------+-----------------+-----------------+-------------
root | 3 | 2 | 3/2 | root | 1
usr | 7 | 4 | 7/4 | usr | 1.1
local | 15 | 8 | 15/8 | local | 1.1.1
src | 31 | 16 | 31/16 | src | 1.1.1.1
pgsql | 59 | 32 | 59/32 | pgsql | 1.1.1.2
bin | 119 | 64 | 119/64 | bin | 1.1.1.2.1
data | 235 | 128 | 235/128 | data | 1.1.1.2.2
doc | 467 | 256 | 467/256 | doc | 1.1.1.2.3
html | 935 | 512 | 935/512 | html | 1.1.1.2.3.1
include | 931 | 512 | 931/512 | include | 1.1.1.2.4
lib | 1859 | 1024 | 1859/1024 | lib | 1.1.1.2.5
share | 3715 | 2048 | 3715/2048 | share | 1.1.1.2.6
(12 rows)
Finding the path to a particular folder is the same as finding the ancestors of the node.
-- finding the path to data
select
f1.folder_name
, ni_path(f1.num,f1.den) as folder_path
from folders f1
, folders f2
where ni_distance(f2.num,f2.den, f1.num,f1.den) >= 0
and f2.folder_name = 'data'
order by ni_distance(f2.num,f2.den, f1.num,f1.den) desc;
folder_name | folder_path
-------------+-------------
root | 1
usr | 1.1
local | 1.1.1
pgsql | 1.1.1.2
data | 1.1.1.2.2
(5 rows)
Likewise, finding the subfolders of a particular folder is the same as finding all of the descendants of the node.
-- finding the subfolders of pgsql
select
f1.folder_name
, f1.folder_name_
, f1.folder_path
from folder_hierarchy f1
, folders f2
where ni_distance (f1.num,f1.den, f2.num,f2.den) >= 0
and f2.folder_name = 'pgsql'
order by f1.folder_path;
folder_name | folder_name_ | folder_path
-------------+-----------------+-------------
pgsql | pgsql | 1.1.1.2
bin | bin | 1.1.1.2.1
data | data | 1.1.1.2.2
doc | doc | 1.1.1.2.3
html | html | 1.1.1.2.3.1
include | include | 1.1.1.2.4
lib | lib | 1.1.1.2.5
share | share | 1.1.1.2.6
(8 rows)
Updating and deleting nodes requires a bit more work. To delete a node we need to close the hole left by the deleted node if it’s not the last child, which requires moving all of the “younger” children up, and updating all of their children to reflect their new position in the hierarchy. Moving a node involves filling the hole from its old position and inserting it in its new position. If one wants to insert it between two existing children, all of the “younger” children need to be moved further down the line, and their descendants updated. My folder model depends only on parent-child relationships, not child position, so I didn’t implement inserting a node between existing children, but it’s quite straightforward to do so. One reason to do this might be to keep folders in alphabetical order.
The fundamental operation in these updates is relocating a node and its descendants. Before finding the article where Tropashko describes updating the hierarchy, I made my own implementation using paths, being unable to figure out a more elegant method which I knew must exist.
The two functions ni_reroot_numer
and ni_reroot_denom
simply take the path to the node to be updated, truncate the path from the old root on up and attach it to the path to the new root, and then return the numerator and denominator of the new path. This is where having ni_path
and ni_path_numer/ni_path_denom
being proper inverses pays off. This is not the way I’d recommend doing this, but it does work.
create function ni_reroot_numer (
integer -- $1 old_numer
, integer -- $2 old_denom
, integer -- $3 old_root_numer
, integer -- $4 old_root_denom
, integer -- $5 new_root_numer
, integer -- $6 new_root_denom
) returns integer
stable
language sql as '
select ni_path_numer (
ni_path($5,$6) || substr(
ni_path($1,$2)
, length(ni_path($3,$4)) + 1
, length(ni_path($1,$2)) - length(ni_path($3,$4))
)
);
';
create function ni_reroot_denom (
integer -- $1 old_numer
, integer -- $2 old_denom
, integer -- $3 old_root_numer
, integer -- $4 old_root_denom
, integer -- $5 new_root_numer
, integer -- $6 new_root_demon
) returns integer
stable
language sql as '
select ni_path_denom (
ni_path($5,$6) || substr(
ni_path($1,$2)
, length(ni_path($3,$4)) + 1
, length(ni_path($1,$2)) - length(ni_path($3,$4))
)
);
';
After finding Tropashko’s follow-up article, I ported his translation functions to PL/pgSQL. The two complementary functions, ni_new_numer
and ni_new_denom
, use two helper functions, ni_normalize_numer
and ni_normalize_denom
, to reduce the fractions to simplest form. Functions described earlier such as the ni_x_
and ni_y
functions could be rewritten to call the normalization functions rather than include the code inline.
create function ni_normalize_numer (
integer -- numer
, integer -- denom
) returns integer
language plpgsql as '
declare
numer alias for $1;
denom alias for $2;
ret_numer integer;
ret_denom integer;
begin
ret_numer := numer;
ret_denom := denom;
while floor (ret_numer / 2.0) = ret_numer / 2.0 loop
ret_numer := ret_numer / 2;
ret_denom := ret_denom / 2;
end loop;
return ret_numer;
end;
';
create function ni_normalize_denom (
integer -- numer
, integer -- denom
) returns integer
language plpgsql as '
declare
numer alias for $1;
denom alias for $2;
ret_numer integer;
ret_denom integer;
begin
ret_numer := numer;
ret_denom := denom;
while floor (ret_numer / 2.0) = ret_numer / 2.0 loop
ret_numer := ret_numer / 2;
ret_denom := ret_denom / 2;
end loop;
return ret_denom;
end;
';
And the ni_new_numer
and ni_new_denom
functions themselves. They take advantage of the fact that each subtree is a scaled version of any other, with the scaling factor k being simply the ratio of the denominators of the old and new nodes. There are three basic parts to the functions.
create or replace function ni_new_numer (
integer -- old_numer
, integer -- old_denom
, integer -- old_origin_numer
, integer -- old_origin_denom
, integer -- new_origin_numer
, integer -- new_origin_denom
) returns integer
language plpgsql as '
declare
old_numer alias for $1;
old_denom alias for $2;
old_origin_numer alias for $3;
old_origin_denom alias for $4;
new_origin_numer alias for $5;
new_origin_denom alias for $6;
zoom_factor integer;
ret_numer integer;
ret_denom integer;
ret_numer_temp integer;
begin
-- old is the old_origin
if old_numer = old_origin_numer
and old_denom = old_origin_denom
then
return new_origin_numer;
end if;
if old_origin_denom < new_origin_denom then
zoom_factor := new_origin_denom / old_origin_denom;
new_origin_denom
, old_origin_denom
, zoom_factor;
ret_numer := old_numer
- old_origin_numer * old_denom / old_origin_denom;
ret_denom := old_denom * zoom_factor;
else
zoom_factor := old_origin_denom / new_origin_denom;
ret_numer := (old_numer
- old_origin_numer * old_denom / old_origin_denom
) * zoom_factor;
ret_denom := old_denom;
end if;
ret_numer_temp := ret_numer;
ret_numer := ni_normalize_numer (ret_numer_temp, ret_denom);
ret_denom := ni_normalize_denom (ret_numer_temp, ret_denom);
if ret_denom < new_origin_denom then
ret_numer := new_origin_numer
+ ret_numer * new_origin_denom / ret_denom;
ret_denom := new_origin_denom;
else
ret_numer := ret_numer
+ new_origin_numer * ret_denom / new_origin_denom;
end if;
ret_numer_temp := ret_numer;
ret_numer := ni_normalize_numer (ret_numer_temp, ret_denom);
ret_denom := ni_normalize_denom (ret_numer_temp, ret_denom);
return ret_numer;
end;
';
create or replace function ni_new_denom (
integer -- old_numer
, integer -- old_denom
, integer -- old_origin_numer
, integer -- old_origin_denom
, integer -- new_origin_numer
, integer -- new_origin_denom
) returns integer
language plpgsql as '
declare
old_numer alias for $1;
old_denom alias for $2;
old_origin_numer alias for $3;
old_origin_denom alias for $4;
new_origin_numer alias for $5;
new_origin_denom alias for $6;
zoom_factor integer;
ret_numer integer;
ret_denom integer;
ret_numer_temp integer;
begin
-- old is the old_origin
if old_numer = old_origin_numer
and old_denom = old_origin_denom
then
return new_origin_denom;
end if;
if old_origin_denom < new_origin_denom then
zoom_factor := new_origin_denom / old_origin_denom;
ret_numer := old_numer
- old_origin_numer * old_denom / old_origin_denom;
ret_denom := old_denom * zoom_factor;
else
zoom_factor := old_origin_denom / new_origin_denom;
ret_numer := (old_numer
- old_origin_numer * old_denom / old_origin_denom
) * zoom_factor;
ret_denom := old_denom;
end if;
ret_numer_temp := ret_numer;
ret_numer := ni_normalize_numer (ret_numer_temp, ret_denom);
ret_denom := ni_normalize_denom (ret_numer_temp, ret_denom);
if ret_denom < new_origin_denom then
ret_numer := new_origin_numer
+ ret_numer * new_origin_denom / ret_denom;
ret_denom := new_origin_denom;
else
ret_numer := ret_numer
+ new_origin_numer * ret_denom / new_origin_denom;
end if;
ret_numer_temp := ret_numer;
ret_numer := ni_normalize_numer (ret_numer_temp, ret_denom);
ret_denom := ni_normalize_denom (ret_numer_temp, ret_denom);
return ret_denom;
end;
';
Now we can use ni_new_numer
and ni_new_denom
as part of a function that deletes folders and performs the necessary housekeeping operations. It first checks to make sure the folder to be deleted has no subfolders (this is not a recursive delete), and that it is not the root node. Then, it finds number of siblings the folder has and how many are younger. The folder is deleted, and each younger sibling and its subtree is translated to the next eldest position.
create or replace function delete_folder (
text -- folder_name
) returns boolean
language plpgsql as '
declare
old_folder_name alias for $1;
old_folder_child_number integer;
sibling_count integer;
old_folder record;
parent_folder record;
old_folder_child_count integer;
old_folder_document_count integer;
old_child record;
new_child record;
begin
select num, den
into old_folder
from folders where folder_name = old_folder_name;
select (count(folder_name))::integer
into old_folder_child_count
from folders
where ni_parent_numer(num, den) = old_folder.num
and ni_parent_denom(num, den) = old_folder.den;
-- check to make sure the folder is empty (it has no children)
-- and it is not the root folder, raising exceptions in either case
if old_folder_child_count = 0
and old_folder.num <> 3
and old_folder.den <> 2 then
select
ni_parent_numer(old_folder.num,old_folder.den) as num
, ni_parent_denom(old_folder.num,old_folder.den) as den
into parent_folder;
select (count(folder_name))::integer into sibling_count
from folders
where ni_parent_numer(num, den) = parent_folder.num
and ni_parent_denom(num,den) = parent_folder.den;
old_folder_child_number :=
ni_sibling_number(old_folder.num, old_folder.den);
delete from folders where folder_name = old_folder_name;
if (sibling_count - old_folder_child_number) > 1 then
-- There are younger siblings of the deleted folder.
-- Promote each by one child number to fill the
-- hole left by the deleted folder
for i in (old_folder_child_number + 1)..sibling_count loop
select
ni_child_numer(parent_folder.num, parent_folder.den, i) as num
, ni_child_denom (parent_folder.num, parent_folder.den, i) as den
into old_child;
select
ni_child_numer(parent_folder.num, parent_folder.den, i - 1) as num
, ni_child_denom (parent_folder.num, parent_folder.den, i - 1) as den
into new_child;
update folders set
num = ni_new_numer(
num, den
, old_child.num, old_child.den
, new_child.num, new_child.den
)
, den = ni_new_denom(
num, den
, old_child.num, old_child.den
, new_child.num, new_child.den
)
where ni_distance( num,den, old_child.num,old_child.den) >= 0;
end loop;
end if;
return true;
elsif (old_folder_child_count) <> 0
then raise exception ''Cannot delete non-empty folder.'';
elsif (old_folder.num = 3 and old_folder.den = 2)
then raise exception ''Cannot delete root folder.'';
end if;
end;
';
And to try it out. I’ll delete the data folder, as it has younger siblings with subtrees which will test to see if these parts of the function work.
select delete_folder('data');
delete_folder
---------------
t
(1 row)
select * from folder_hierarchy;
folder_name | num | den | folder_rational | folder_name_ | folder_path
-------------+-----+-----+-----------------+-----------------+-------------
root | 3 | 2 | 3/2 | root | 1
usr | 7 | 4 | 7/4 | usr | 1.1
local | 15 | 8 | 15/8 | local | 1.1.1
src | 31 | 16 | 31/16 | src | 1.1.1.1
pgsql | 59 | 32 | 59/32 | pgsql | 1.1.1.2
bin | 119 | 64 | 119/64 | bin | 1.1.1.2.1
doc | 235 | 128 | 235/128 | doc | 1.1.1.2.2
html | 471 | 256 | 471/256 | html | 1.1.1.2.2.1
include | 467 | 256 | 467/256 | include | 1.1.1.2.3
share | 931 | 512 | 931/512 | share | 1.1.1.2.4
(10 rows)
The folder data
is gone, and doc
, html
, include
, and share
have all been appropriately updated. The delete_folder
function can be modified to handle moving a folder as well by adding a loop that inserts the folder and its subfolders into the new location before promoting the siblings.
The denominator of binary fractions gets large very quickly, moving through the available integers at an astonishing rate. Using int4
, we’re limited to 29 children for node 1 (1.29 is the youngest possible child), and 29 generations after node 1 (1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1 is the most deeply nested node possible), with the maximum nesting depth and the number of children decreasing the with child number and depth: the sum of the numbers in the path must be less than n − 2, where n is the number of bits avaible. Using int8
, we can have 62 children of node 1, and 62 generations. This limitation means there should be additional code which checks the size and shape of the tree when it’s translated to make sure it will fit in its new position. This could be done by finding the maximum sum of materialized path values for each point of the subtree to be moved, and check to see if this and the sum of the sum of the path values for the new root is less than the maximum.
Tropashko’s nested interval work has been brought up at least twice on the PostgreSQL mailing lists. In March 2004 Manfred Koizar—referring to nested intervals as the Obfuscated Materialized Path Method, or OMPM—wrote he feels nested intervals is “irreparably broken”, its primary failing being that “If you try to store materialized path information in a fixed number of integers you run out of bits very soon.” Tropashko responded (in a post that appears to only have hit the newsgroups) but did not directly address the overflow issue. Looking at other encoding algorithms that may be more efficient with integer use, or at other data types, perhaps there’s still hope. What is needed is an encoding schema that encompasses the range of hierarchies you want to encode. All datatypes and encoding schemes have limitations. The question is whether the limitations restrict what you’re trying to do.
Performance issues have been discussed on the list as well. The ni_distance
function, used to calculated children and parents requires a full table scan as an index cannot use this information. Alternatives such as Teodor Sigaev and Oleg Bartunov’s contrib/ltree, Joe Conway’s connectby()
(in contrib/tablefunc), and storing paths directly using strings and ASCII (as well as the possibility of using bytea) were mentioned.
There are at least two methods of working around the datatype limitation. Developing a PostgreSQL base type for rational numbers, perhaps based on Perl’s BigRational or other implementation would allow a much larger range of numbers to be used. The availability of a rational data type would also eliminate the necessity of numerator and denominator variants and their accompanying duplicated code. Another alternative is to use a different algorithm for generating the tree, such as Farey fractions or continued fractions, both of which have been explored by Tropashko, who has mentioned in a recent newsgroup point that he is rewriting his article on using the latter. Continued fractions looks particularly promising, and I hope to explore these functions sometime soon.
For your convenience, here are the ni_
functions and the example data. You will also need to have PL/pgSQL installed in your database and create the instr
functions (with the appropriate modifictions). The code was developed against PostgreSQL 7.4.1.