July 10, 2004

Static Hierarchies and Binary Fractions in PostgreSQL

The nested set model is a convenient way to represent hierarchies in SQL. By taking advantage of the set nature of relations, it provides quick and relatively simple queries to determine, for example, all descendants of a given node. Such queries are more computationally difficult using adjacency lists.

The nested set however, has performance limitations on insert, update, and delete. This arises because the nested set model encodes the hierarchy based on what the set contains: when a node is inserted or deleted, all of the sets containing the node must be updated as well. In other words, the encoding depends on the current state of the database. This behavior makes the nested set model less desirable for large, dynamic hierarchies.

An alternative to a data-dependent encoding is a static hierarchy. An example of this is a materialized path, where nodes 1.1, 1.2, and 1.3 are the first, second, and third children of node 1. Node 1.4 is still the fourth child of node 1 even if there don’t happen to be four (or more) children assigned to node 1 in the database. The encoding value, 1.4, is waiting to be used when the time arises.

With all hierarchies, there are four basic queries:

  • What are the children of this node?
  • What are the descendants of this node?
  • What is the parent of this node?
  • What are the ancestors of this node?

And there are the three additional operations of inserting a node, deleting a node, or moving a node.

With a static hierarchy, encoding values are preselected. The node’s position relative to all another nodes is known directly from the encoding value. Given node 1.2.2, we know its parent is node 1.2, and its ancestors are nodes 1.2 and 1. We know its children will all have an encoding of 1.2.1.x, and all of its descendants will be prefixed by 1.2.1.

This makes inserts trivial. Inserting a child of a node requires only finding out how many children the node currently has, and giving the new node the encoding appropriate for the next child. If node 1 currently has three children, the new child will have encoding 1.4. No nodes need to be updated when a new node is inserted.

Deleting a node requires finding all of the nodes younger siblings: nodes with the same parent and greater sibling number. Continuing the previous example, if we delete node 1.2, nodes 1.3 and 1.4 become 1.2 and 1.3 respectively. If 1.3 and 1.4 have descendants, we need a way of translating the subtrees of 1.3 and 1.4 to their new positions. This makes translation a fundamental operation when working with hierarchies.

If materialized paths are used as the encoding, simply find all of the current descendants of 1.3, and replace the 1.3 with 1.2. This translation is repeated for each younger sibling. Updating is simply translating a node and its subtree to a new inserted position and translating its former younger siblings to fill the hole it left. If the database model requires inserting a node between two existing nodes, just translate the younger siblings to an even younger position.

Materialized paths represented as text in the database makes all hierarchy operations simply a matter of parsing and manipulating strings. In PostgreSQL, one can even create a domain to make sure all paths are properly represented.

create domain ni_path as 
    text
    not null
    check (value ~ '^([1-9]\d*)(\.[1-9]\d*)*$');

One criticism of materialized paths is that string manipulation is undesirable. An encoding that is based on a mathematical algorithm might be better. One possible algorithm is to use binary fractions, which is explained further below. Other algorithms can be used to map the hierarchy as well, such as Farey fractions or continued fractions. Vadim Tropashko has written a number of articles on various nested interval implementations for Oracle using PL/SQL. My implementation is largely a port of his work for PostgreSQL using PL/pgSQL. The point of any of these alogorithms is to generate a static hierarchy. The only difference is how the hierarchy is encoded and manipulated.

Binary Fractions

The model is called the nested interval model, which implies one dimensional intervals each bounded by two points. However, I think the math is easier to follow if I use two dimensions. If you’re not interested in the math, feel free to skip down to the functions.

Take the unit square described by 0 ≤ x ≤ 1, 0 ≤ y ≤ 1. The goal is to divide this square into progressively smaller, non-overlapping domains, nodes being points represented by (x,y) coordinates. Their position, i.e., which domain they lie within, determines their ancestors, and also the domain of their descendants.

Following Tropashko’s example, let’s place the first node—node 1—at (1,12). Where do we place this node’s children? Well, the x-coordinate of the first child of node 1—node 1.1—is the same as its parent, so x1.1 = 1. The y-coordinate of the first child is located at the midpoint of the y-position of the parent (y = 12) and the depth-first convergence point, where the x-position of the parent meets the diagonal x = y. For node 1, the diagonal intercepts x = 1 at y = 1, so the y-coordiate of the first child is y = 34: Node 1.1 is located at (1,34). You can see that node 1.1.1, the first child of node 1.1, is at (1,78), and that 1.1.1.1 is located at (1,1516).

The unit square divided into domains.

So how about siblings? Siblings are located along the line between the first child and the breadth-first convergence point, where the y-position of the parent intercepts the diagonal x = y. The second child—node 1.2—is located at (34, 58). Node 1.3 (the third child) is half-way between node 1.2 and the breadth-first convergence point, at (58, 916).

The domain of node 1 is shown in red, and the domains of nodes 2 and 3 are yellow and green, respectively. The depth-first and breadth-first convergence points are simple the points where the lines x = x1 and y = y1 intersect the diagonal: the other two points of the domain triangle.

Also, note that all immediate children of a node z are located vertically half-way between the diagonal and the line y = yz; and horizontally half-way between their next-eldest sibling and the diagonal. This will be useful when finding the parent and sibling number of a given node.

Following this algorithm, we can divide the unit square into smaller and smaller domains, determined by the location of a parent node. In the figure, all children of node 1 are found within the right triangle formed by the diagonal x = y, y = y1, x = x1. In turn, all children of node 1.1 are within the right triangle bounded by the x = y, x = x1.1, y = y1.1.

This fractal pattern continues to infinity, or to the precision of the computer system. As we move deeper into the tree, numbers get big pretty quickly, and quite close together. Accuracy is important, and floating point math is not sufficiently accurate to safely represent the hierarchy. A rational number type which accurately represents integer numerators and denominators is desirable, and something I’d like to implement in PostgreSQL as a base type, but for the time being I’ll represent rational numbers as numerator-denominator integer pairs. Each point is represented by four integers. For example, node 1 is xnum = 1, xden = 1, ynum = 1, yden = 2. Functions will likewise have numerator and denominator variants.

node x y sum
1 1 12 32
1.1 1 34 74
1.1.1 1 78 158
1.1.1.1 1 1516 3116
1.1.2 78 1316 2716
1.1.3 1316 2532 5132
1.2 34 58 118
1.2.1 34 1116 2316
1.3 58 916 1916
2 12 14 34
2.1 12 38 78
2.1.1 12 716 1516
2.2 38 516 1116
2.3 516 932 1932
3 14 18 38
3.1 14 316 716
3.2 316 532 1132

It should also be obvious from the fact that the algorithm involves repeated halving that these are binary fractions: the denominator is always a power of 2.

Tropashko points out that the x and y values of each node are related and can be equivalently represented as the sum of x and y. Inspection shows that xnum × yden = ynum + 1. For node 1, num1 = 3, den1 = 2, or 32. These sums will be the basis of the encoding.

Given their sum, we can derive the x and y numerators and denominators using functions. I prefixed my functions with ni for nested interval. The functions take advantage of binary fractions in two ways:

  • Adding 1 to or subtracting 1 from a numerator of a fraction in reduced form will give a fraction that is no longer in reduced form. It can be reduced by dividing both the numerator and denominator by two.
  • All binary fractions that are not in reduced form can be reduced by repeatedly dividing the numerator and denominator by two until the numerator cannot be evenly divided by two.

create function ni_x_numer (
      integer -- numer
    , integer -- denom
    ) returns integer
    language plpgsql as '
    declare
        numer alias for $1;
        denom alias for $2;
        ret_numer integer;
        ret_denom integer;
    begin
        -- add one to the numerator so the numerator becomes 2x
        ret_numer := numer + 1;
        ret_denom := denom * 2;
        -- reduce to simplest form
        while floor(ret_numer / 2.0) = ret_numer / 2.0 loop
            ret_numer := ret_numer / 2;
            ret_denom := ret_denom / 2;
        end loop;
        return ret_numer;
    end;
    ';

create function ni_x_denom (
      integer -- numer
    , integer -- denom
    ) returns integer
    language plpgsql as '
    declare
        numer alias for $1;
        denom alias for $2;
        ret_numer integer;
        ret_denom integer;
    begin
        ret_numer := numer + 1;
        ret_denom := denom * 2;
        while floor(ret_numer / 2.0) = ret_numer / 2.0 loop
            ret_numer := ret_numer / 2;
            ret_denom := ret_denom / 2;
        end loop;
        return ret_denom;
    end;
    ';

create function ni_y_numer (
      integer -- numer
    , integer -- denom
    ) returns integer
    language plpgsql as '
    declare
        numer alias for $1;
        denom alias for $2;
        ret_numer integer;
        ret_denom integer;
    begin
        ret_numer := ni_x_numer(numer,denom);
        ret_denom := ni_x_denom(numer,denom);
        while ret_denom < denom loop
            ret_numer := ret_numer * 2;
            ret_denom := ret_denom * 2;
        end loop;
        ret_numer = numer - ret_numer;
        -- reduce to simplest form
        while floor(ret_numer / 2.0) = ret_numer / 2.0 loop
            ret_numer = ret_numer / 2;
            ret_denom = ret_denom / 2;
        end loop;
        return ret_numer;
    end;
    ';

create function ni_y_denom (
      integer -- numer
    , integer -- denom
    ) returns integer
    language plpgsql as '
    declare
        numer alias for $1;
        denom alias for $2;
        ret_numer integer;
        ret_denom integer;
    begin
        ret_numer := ni_x_numer(numer,denom);
        ret_denom := ni_x_denom(numer,denom);
        while ret_denom < denom loop
            ret_numer := ret_numer * 2;
            ret_denom := ret_denom * 2;
        end loop;
        ret_numer = numer - ret_numer;
        while floor(ret_numer / 2.0) = ret_numer / 2.0 loop
            ret_numer = ret_numer / 2;
            ret_denom = ret_denom / 2;
        end loop;
        return ret_denom;
    end;
    ';

The two ni_x functions So, let’s see if it works. I set up a table of the node x-y-sum data.

create table ni_data (
      node ni_path not null unique
    , numer integer not null
    , denom integer not null
    , unique (numer,denom)
);

insert into ni_data (node, numer, denom) values ('1',3,2);
insert into ni_data (node, numer, denom) values ('1.1',7,4);
insert into ni_data (node, numer, denom) values ('1.1.1',15,8);
insert into ni_data (node, numer, denom) values ('1.1.1.1',31,16);
insert into ni_data (node, numer, denom) values ('1.1.2',27,16);
insert into ni_data (node, numer, denom) values ('1.1.3',51,32);
insert into ni_data (node, numer, denom) values ('1.2',11,8);
insert into ni_data (node, numer, denom) values ('1.2.1',23,16);
insert into ni_data (node, numer, denom) values ('1.3',19,16);

insert into ni_data (node, numer, denom) values ('2',3,4);
insert into ni_data (node, numer, denom) values ('2.1',7,8);
insert into ni_data (node, numer, denom) values ('2.1.1',15,16);
insert into ni_data (node, numer, denom) values ('2.2',11,16);
insert into ni_data (node, numer, denom) values ('2.3',19,32);

insert into ni_data (node, numer, denom) values ('3',3,8);
insert into ni_data (node, numer, denom) values ('3.1',7,16);
insert into ni_data (node, numer, denom) values ('3.2',11,32);

select
      node
    , ni_x_numer(numer,denom) || '/' || ni_x_denom(numer,denom) as x
    , ni_y_numer(numer,denom) || '/' || ni_y_denom(numer,denom) as y
    , numer || '/' || denom as sum
from ni_data
order by node;

  node   |   x   |   y   |  sum  
---------+-------+-------+-------
 1       | 1/1   | 1/2   | 3/2
 1.1     | 1/1   | 3/4   | 7/4
 1.1.1   | 1/1   | 7/8   | 15/8
 1.1.1.1 | 1/1   | 15/16 | 31/16
 1.1.2   | 7/8   | 13/16 | 27/16
 1.1.3   | 13/16 | 25/32 | 51/32
 1.2     | 3/4   | 5/8   | 11/8
 1.2.1   | 3/4   | 11/16 | 23/16
 1.3     | 5/8   | 9/16  | 19/16
 2       | 1/2   | 1/4   | 3/4
 2.1     | 1/2   | 3/8   | 7/8
 2.1.1   | 1/2   | 7/16  | 15/16
 2.2     | 3/8   | 5/16  | 11/16
 2.3     | 5/16  | 9/32  | 19/32
 3       | 1/4   | 1/8   | 3/8
 3.1     | 1/4   | 3/16  | 7/16
 3.2     | 3/16  | 5/32  | 11/32
(17 rows)

So far, so good!

As I mentioned before, the key to nested intervals is that the entire tree is determined by the algorithm used to generate it. I plotted the children of each node using two rules:

first child
xfirst child is midway between xparent and the depth-first convergence point, yfirst child is midway between yparent and the breadth-first convergence point.
sibling
xnext sibling is midway between xchild and the depth-first convergence point, ynext sibling is midway between ychild and the breadth-first convergence point.

These rules can also be expressed as functions. The pow function in PostgreSQL is based on floating point math. Manfred Koizar posted a function to compute powers of 2 using integers to the pgsql-general mailing list. I have reproduced his function (with minor editorializing) here and use it in the functions to calculate the sibling number and parent of a node.

create or replace function pow2(bigint) 
    returns bigint 
    language plpgsql as '
    declare
        e bigint;
        ret bigint;
    begin
        e := $1;
        if (e < 0) then
            raise exception ''2 ^ % does not fit into a bigint'', e;
        end if;
        if (e > 62) then
            raise exception ''2 ^ % does not fit into a bigint'', e;
        end if;
        ret = 1;
        while (e > 0) loop
            ret = 2 * ret;
            e = e - 1;
        end loop;
        return ret;
    end;
';

create function ni_child_numer (
      integer -- numer
    , integer -- denom
    , integer -- child
    ) returns integer 
    language plpgsql as '
        declare
            numer alias for $1;
            denom alias for $2;
            child alias for $3;
        begin
            return numer * pow2(child) + 3 - pow2(child);
        end;
    ';

create function ni_child_denom (
      integer -- numer
    , integer -- denom
    , integer -- child
    ) returns integer 
    language plpgsql as '
        declare
            numer alias for $1;
            denom alias for $2;
            child alias for $3;
        begin
            return denom * pow2(child);
        end;
    ';

Let’s see if these worked.

-- node 1, sum = 3/2. first child (1.1)
select ni_child_numer(3,2, 1) || '/' || ni_child_denom(3,2, 1) as child;
 child 
-------
 7/4
(1 row)

-- node 1.1, sum = 7/4, first child (1.1.1)
select ni_child_numer(7,4, 1) || '/' || ni_child_denom(7,4, 1) as child;
 child 
-------
 15/8
(1 row)

-- node 1.1, third child (1.1.3)
select ni_child_numer(7,4, 3) || '/' || ni_child_denom(7,4, 3) as child;
 child 
-------
 51/32
(1 row)

-- node 3, sum = 3/8, second child (3.2)
select ni_child_numer(3,8, 2) || '/' || ni_child_denom(3,8, 2) as child;
 child 
-------
 11/32
(1 row)

Given the child node, we can find its parent and sibling number. These functions are based on the earlier observation of a child’s position relative to its parent:

  • ychild less the vertical distance between the child and the diagonal x = y is equal to yparent.
  • Recursively adding the horizontal distance between xchild and the diagonal to xchild will bring you to xparent.
  • the number of times of the recursion to xparent is one less than the child number.
create function ni_parent_numer (
      integer -- numer
    , integer -- denom
    ) returns integer 
    language plpgsql as '
    declare
        numer alias for $1;
        denom alias for $2;
        ret_numer integer;
        ret_denom integer;
    begin
        -- special case where node is the root
        if numer = 3 then
            return null;
        end if;
        ret_numer := (numer - 1) / 2 ;
        ret_denom := denom / 2 ;
        while floor (( ret_numer - 1) / 4.0) = (ret_numer - 1) / 4.0 loop
            ret_numer := (ret_numer + 1) / 2;
            ret_denom := ret_denom / 2;
        end loop;
        return ret_numer;
    end;
    ';

create function ni_parent_denom (
      integer -- numer
    , integer -- denom
    ) returns integer 
    language plpgsql as '
    declare
        numer alias for $1;
        denom alias for $2;
        ret_numer integer;
        ret_denom integer;
    begin
        if numer = 3 then
            return null;
        end if;
        ret_numer := (numer - 1) / 2 ;
        ret_denom := denom / 2 ;
        while floor (( ret_numer - 1) / 4.0) = (ret_numer - 1) / 4.0 loop
            ret_numer := (ret_numer + 1) / 2;
            ret_denom := ret_denom / 2;
        end loop;
        return ret_denom;
    end;
    ';

create function ni_sibling_number (
    integer -- numer
    , integer -- denom
    ) returns integer
    language plpgsql as '
    declare
        numer alias for $1;
        denom alias for $2;
        ret_numer integer;
        ret_denom integer;
        ret integer;
    begin
        ret_numer := (numer - 1) / 2;
        ret_denom := denom / 2;
        ret := 1;

        while floor ((ret_numer - 1) / 4.0) = (ret_numer - 1) / 4.0 loop
            if ret_numer = 1 and ret_denom = 1 then
                return ret;
            end if;
            ret_numer := (ret_numer + 1) / 2;
            ret_denom := ret_denom / 2;
            ret := ret + 1;
        end loop;
        return ret;
    end;
    ';

-- node 3.2, sum 11/32
select ni_parent_numer(11,32) || '/' || ni_parent_denom(11,32) as parent
    , ni_sibling_number(11,32) as child_number;
 parent | child_number 
--------+--------------
 3/8    |            2
(1 row)

-- node 2.1.1, sum 15/16
select ni_parent_numer(15,16) || '/' || ni_parent_denom(15,16) as parent
    , ni_sibling_number(15,16) as child_number;
 parent | child_number 
--------+--------------
 7/8    |            1
(1 row)

Comparing with our previous data, we can verify that this is correct.

The dotted path notation is a much more intuitive way to represent the position of the node on the tree than the sum of its x and y coordinates. Here are some functions to convert between the sum and the path and back again.

Note. These implementations rely on the instr function which is ported from Oracle. It, along with its support functions, can be found in the PostgreSQL 7.4 Documentation in the appendix of Porting from Oracle PL/SQL. There is an error in the functions as they’re given in the 7.4 docs. The instr function with three arguments, instr(varchar,varchar,varchar), should be instr(varchar,varchar,integer). Just change the datatype of the last parameter from varchar to integer and it should work. This should be fixed in the 7.5 docs.

create function ni_path_inner (
      integer -- numer
    , integer -- denom
    ) returns text 
    language plpgsql as '
    declare
        numer alias for $1;
        denom alias for $2;
        path_string text;
    begin
        if numer is null then
            return '''';
        end if;
        return 
            ni_path_inner(
                ni_parent_numer(numer,denom)
                , ni_parent_denom(numer,denom)
                )
            || ''.'' 
            || ni_sibling_number(numer,denom)
            ;
    end;
    ';

create function ni_path (
      integer -- numer
    , integer -- denom
    ) returns ni_path
    language sql as '
    select substr(ni_path_inner($1,$2),2,length(ni_path_inner($1,$2)) - 1)::ni_path
    ';

The ni_path_inner function returns a string with a leading dot. The ni_path function is a wrapper which calls ni_path_inner and removes the leading dot. This is not only for form’s sake: it’s necessary to close the conversion between path and sum. We want x / y to be equal to ni_path_numer(ni_path(x,y)) / ni_path_denom(ni_path(x,y)).

-- node 1
select ni_path(3,2);
 ni_path 
---------
 1
(1 row)

-- node 1.1.3
select ni_path(51,32);
 ni_path 
---------
 1.1.3
(1 row)

-- node 2.1.1
select ni_path(15,16);
 ni_path 
---------
 2.1.1
(1 row)

create function ni_path_numer (
    ni_path -- x.x.x path notation
    ) returns integer
    language plpgsql as '
    declare 
        path alias for $1;
        numer integer;
        denom integer;
        postfix text;
        sibling text;
    begin
        numer := 1;
        denom := 1;
        postfix := ''.'' || path || ''.'';

        while length(postfix) > 1 loop
            sibling := substr(
                    postfix
                    , 2
                    , instr(postfix,''.'',2) - 2
                );
            postfix := substr(
                    postfix
                    , instr (postfix, ''.'', 2)
                    , length(postfix) - instr(postfix,''.'',2) + 1
                );
            numer := ni_child_numer(
                        numer
                        , denom
                        , to_number(
                            sibling
                            , repeat(''9'',length(sibling))
                            )::integer
                        );
            denom := ni_child_denom(
                          numer
                        , denom
                        , to_number(
                              sibling
                            , repeat(''9'',length(sibling))
                            )::integer
                        );
        end loop;
        return numer;
    end;
    ';

create or replace function ni_path_denom (
    ni_path -- x.x.x path notation
    ) returns integer
    language plpgsql as '
    declare 
        path alias for $1;
        numer integer;
        denom integer;
        postfix text;
        sibling text;
    begin
        numer := 1;
        denom := 1;
        postfix := ''.'' || path || ''.'';

        while length(postfix) > 1 loop
            sibling := substr(
                      postfix
                    , 2
                    , instr(postfix,''.'',2) - 2
                );
            postfix := substr(
                      postfix
                    , instr (postfix, ''.'', 2)
                    , length(postfix) - instr(postfix,''.'',2) + 1
                );
            numer := ni_child_numer(
                          numer
                        , denom
                        , to_number(
                            sibling
                            , repeat(''9'',length(sibling))
                            )::integer
                        );
            denom := ni_child_denom(
                        numer
                        , denom
                        , to_number(
                            sibling
                            , repeat(''9'',length(sibling))
                            )::integer
                        );
        end loop;
        return denom;
    end;
    ';


select ni_path_numer('2.1.1') || '/' || ni_path_denom('2.1.1') as sum;
  sum  
-------
 15/16
(1 row)

select ni_path_numer('3.2') || '/' || ni_path_denom('3.2') as sum;
  sum  
-------
 11/32
(1 row)

select ni_path_numer('1') || '/' || ni_path_denom('1') as sum;
 sum 
-----
 3/2
(1 row)

select ni_path_numer('3') || '/' || ni_path_denom('3') as sum;
 sum 
-----
 3/8
(1 row)

And checking that the conversion functions are inverses of each other.

select ni_path(ni_path_numer('1.1.3'),ni_path_denom('1.1.3'));
 ni_path 
---------
 1.1.3
(1 row)

select ni_path_numer(ni_path(51,32)) || '/' || ni_path_denom(ni_path(51,32)) as sum;
  sum  
-------
 51/32
(1 row)

Distance between two nodes is also useful. The ni_distance function takes two numerator/denominator pairs in the order distance node1 from ancestor node2. It recursively walks from node1 to the root, checking to see if each subsequent parent node2.

Note. I have implemented this straight from Tropashko, including his self-described kludge which returns a large negative number if node1 is not a descendant of node2. This of course would fail to properly return a negative number for deeply nested nodes; in this case if node1 has a depth of 1 million or more. (Though we’ll see later this is unlikely when using this particular implementation.) A more elegant and robust solution is desirable.

create function ni_distance (
      integer -- numer_1
    , integer -- denom_1
    , integer -- numer_2
    , integer -- denom_2
    ) returns integer
    language plpgsql as '
    declare
        numer_1 alias for $1;
        denom_1 alias for $2;
        numer_2 alias for $3;
        denom_2 alias for $4;
    begin
        if numer_1 is null then
            return -999999;
        end if;
        -- check if they are the same point
        if numer_1 = numer_2 
            and denom_1 = denom_2 then
            return 0;
        end if;
        return 1 + ni_distance (
            ni_parent_numer(numer_1, denom_1)
            , ni_parent_denom(numer_1, denom_1)
            , numer_2
            , denom_2);
    end;
    ';

-- distance between 1.1 and 1
select ni_distance (7,4, 3,2);
 ni_distance 
-------------
           1
(1 row)

-- distance between 1.1.1.1 and 1.1
select ni_distance(31,16, 7,4);
 ni_distance 
-------------
           2
(1 row)

-- distance between 1.1 and 1.1.1.1
select ni_distance(7,4, 31,16);
 ni_distance 
-------------
     -999997
(1 row)

-- distance between 2.1.1 and 3
select ni_distance(15,16, 3,8);
 ni_distance 
-------------
     -999996
(1 row)

With the functions we have so far, it’s possible to build and query a table representing a hierarchy with nested intervals. Let’s take a look at an example of a virtual file system.

create table folders (
    folder_name text not null unique
    , num integer not null
    , den integer not null
    , unique (num,den)
);

Note. These ni_ functions work for trees with roots n = 1, 2, 3,… but fail for root 0 at (1,0) with sum = 11. I’ll choose node 1 as my root, so num/den = 32.

-- initialize the tree
insert into folders (folder_name, num, den) values ('root', 3,2);

Now, I could calculate the current tree and insert the next folder manually. But a function can do this for us, given the parent folder (node) we want to insert it in. For housekeeping purposes, we’ll fill in the tree sequentially; for example, insert child 1 before child 2. The idea is to count the number of children in of the parent folder n, and insert the new folder as child = n + 1. This won’t work if the tree isn’t filled sequentially.

create or replace function insert_folder (
      text -- parent_folder_name
    , text -- new_folder_name
    ) returns void
    language plpgsql as '
    declare
        parent_folder_name alias for $1;
        new_folder_name alias for $2;

        parent_folder_rational record;
        new_child_number integer;
    begin
        select num, den
            into parent_folder_rational
        from folders where folder_name = parent_folder_name;

        select (count(folder_name) + 1)::integer into new_child_number
        from folders
        where ni_parent_numer(num, den) = parent_folder_rational.num
            and ni_parent_denom(num,den) = parent_folder_rational.den;

        insert into folders (folder_name, num, den) values (
            new_folder_name
            , ni_child_numer(
                parent_folder_rational.num
                , parent_folder_rational.den
                , new_child_number
                )
            , ni_child_denom(
                parent_folder_rational.num
                , parent_folder_rational.den
                , new_child_number
                )
            );
       return;
    end;
    ';

As a convenience, I’ve declared a parent_folder_rational record to hold the numerator and denominator of the parent folder. I could have used two integer variables such as parent_folder_num and parent_folder_den instead.

select insert_folder('root','usr');
select insert_folder('usr','local');
select insert_folder('local','src');
select insert_folder('local','pgsql');
select insert_folder('pgsql','bin');
select insert_folder('pgsql','data');
select insert_folder('pgsql','doc');
select insert_folder('doc','html');
select insert_folder('pgsql','include');
select insert_folder('pgsql','lib');
select insert_folder('pgsql','share');

And let’s create a view to display the folder hierarchy.

create or replace view folder_hierarchy as 
select 
      folder_name
    , num
    , den
    , num || '/' || den as folder_rational
    , repeat('  ',ni_distance(num,den, 3,2)) 
        || folder_name as folder_name_
    , ni_path(num,den) as folder_path
from folders 
order by folder_path;

select * from folder_hierarchy;
 folder_name | num  | den  | folder_rational |  folder_name_   | folder_path 
-------------+------+------+-----------------+-----------------+-------------
 root        |    3 |    2 | 3/2             | root            | 1
 usr         |    7 |    4 | 7/4             |   usr           | 1.1
 local       |   15 |    8 | 15/8            |     local       | 1.1.1
 src         |   31 |   16 | 31/16           |       src       | 1.1.1.1
 pgsql       |   59 |   32 | 59/32           |       pgsql     | 1.1.1.2
 bin         |  119 |   64 | 119/64          |         bin     | 1.1.1.2.1
 data        |  235 |  128 | 235/128         |         data    | 1.1.1.2.2
 doc         |  467 |  256 | 467/256         |         doc     | 1.1.1.2.3
 html        |  935 |  512 | 935/512         |           html  | 1.1.1.2.3.1
 include     |  931 |  512 | 931/512         |         include | 1.1.1.2.4
 lib         | 1859 | 1024 | 1859/1024       |         lib     | 1.1.1.2.5
 share       | 3715 | 2048 | 3715/2048       |         share   | 1.1.1.2.6
(12 rows)

Finding the path to a particular folder is the same as finding the ancestors of the node.

-- finding the path to data
select 
    f1.folder_name
    , ni_path(f1.num,f1.den) as folder_path
from folders f1
    , folders f2
where ni_distance(f2.num,f2.den, f1.num,f1.den) >= 0
    and f2.folder_name = 'data'
order by ni_distance(f2.num,f2.den, f1.num,f1.den) desc;
 folder_name | folder_path 
-------------+-------------
 root        | 1
 usr         | 1.1
 local       | 1.1.1
 pgsql       | 1.1.1.2
 data        | 1.1.1.2.2
(5 rows)

Likewise, finding the subfolders of a particular folder is the same as finding all of the descendants of the node.

-- finding the subfolders of pgsql
select
    f1.folder_name
    , f1.folder_name_
    , f1.folder_path
from folder_hierarchy f1
    , folders f2
where ni_distance (f1.num,f1.den, f2.num,f2.den) >= 0
    and f2.folder_name = 'pgsql'
order by f1.folder_path;
 folder_name |  folder_name_   | folder_path 
-------------+-----------------+-------------
 pgsql       |       pgsql     | 1.1.1.2
 bin         |         bin     | 1.1.1.2.1
 data        |         data    | 1.1.1.2.2
 doc         |         doc     | 1.1.1.2.3
 html        |           html  | 1.1.1.2.3.1
 include     |         include | 1.1.1.2.4
 lib         |         lib     | 1.1.1.2.5
 share       |         share   | 1.1.1.2.6
(8 rows)

Updating and deleting nodes requires a bit more work. To delete a node we need to close the hole left by the deleted node if it’s not the last child, which requires moving all of the “younger” children up, and updating all of their children to reflect their new position in the hierarchy. Moving a node involves filling the hole from its old position and inserting it in its new position. If one wants to insert it between two existing children, all of the “younger” children need to be moved further down the line, and their descendants updated. My folder model depends only on parent-child relationships, not child position, so I didn’t implement inserting a node between existing children, but it’s quite straightforward to do so. One reason to do this might be to keep folders in alphabetical order.

The fundamental operation in these updates is relocating a node and its descendants. Before finding the article where Tropashko describes updating the hierarchy, I made my own implementation using paths, being unable to figure out a more elegant method which I knew must exist.

The two functions ni_reroot_numer and ni_reroot_denom simply take the path to the node to be updated, truncate the path from the old root on up and attach it to the path to the new root, and then return the numerator and denominator of the new path. This is where having ni_path and ni_path_numer/ni_path_denom being proper inverses pays off. This is not the way I’d recommend doing this, but it does work.

create function ni_reroot_numer (
      integer  -- $1 old_numer
    , integer  -- $2 old_denom

    , integer  -- $3 old_root_numer
    , integer  -- $4 old_root_denom

    , integer  -- $5 new_root_numer
    , integer  -- $6 new_root_denom
    ) returns integer
    stable
    language sql as '
    select ni_path_numer (
        ni_path($5,$6) || substr(
            ni_path($1,$2)
            , length(ni_path($3,$4)) + 1
            , length(ni_path($1,$2)) - length(ni_path($3,$4))
            )
        );
    ';

create function ni_reroot_denom (
      integer  -- $1 old_numer
    , integer  -- $2 old_denom

    , integer  -- $3 old_root_numer
    , integer  -- $4 old_root_denom

    , integer  -- $5 new_root_numer
    , integer  -- $6 new_root_demon
    ) returns integer
    stable
    language sql as '
    select ni_path_denom (
        ni_path($5,$6) || substr(
            ni_path($1,$2)
            , length(ni_path($3,$4)) + 1
            , length(ni_path($1,$2)) - length(ni_path($3,$4))
            )
        );
    ';

After finding Tropashko’s follow-up article, I ported his translation functions to PL/pgSQL. The two complementary functions, ni_new_numer and ni_new_denom, use two helper functions, ni_normalize_numer and ni_normalize_denom, to reduce the fractions to simplest form. Functions described earlier such as the ni_x_ and ni_y functions could be rewritten to call the normalization functions rather than include the code inline.

create function ni_normalize_numer (
      integer -- numer
    , integer -- denom
    ) returns integer
    language plpgsql as '
    declare
    numer alias for $1;
    denom alias for $2;
    ret_numer integer;
    ret_denom integer;
    begin
        ret_numer := numer;
        ret_denom := denom;
        while floor (ret_numer / 2.0) = ret_numer / 2.0 loop
            ret_numer := ret_numer / 2;
            ret_denom := ret_denom / 2;
        end loop;
        return ret_numer;
    end;
    ';

create function ni_normalize_denom (
      integer -- numer
    , integer -- denom
    ) returns integer
    language plpgsql as '
    declare
    numer alias for $1;
    denom alias for $2;
    ret_numer integer;
    ret_denom integer;
    begin
        ret_numer := numer;
        ret_denom := denom;
        while floor (ret_numer / 2.0) = ret_numer / 2.0 loop
            ret_numer := ret_numer / 2;
            ret_denom := ret_denom / 2;
        end loop;
        return ret_denom;
    end;
    ';

And the ni_new_numer and ni_new_denom functions themselves. They take advantage of the fact that each subtree is a scaled version of any other, with the scaling factor k being simply the ratio of the denominators of the old and new nodes. There are three basic parts to the functions.

  • If the node happens to be the old origin, return the new origin.
  • For k ≥ 1, i.e., the subtree is “enlarging”, proceed normally. For k < 1, change the operation slightly to use the reciprocal of k, as we need an integer zoom factor for our integer math.
create or replace function ni_new_numer (
      integer -- old_numer
    , integer -- old_denom

    , integer -- old_origin_numer
    , integer -- old_origin_denom

    , integer -- new_origin_numer
    , integer -- new_origin_denom
    ) returns integer
    language plpgsql as '
    declare
        old_numer alias for $1;
        old_denom alias for $2;

        old_origin_numer alias for $3;
        old_origin_denom alias for $4;

        new_origin_numer alias for $5;
        new_origin_denom alias for $6;

        zoom_factor integer;

        ret_numer integer;
        ret_denom integer;

        ret_numer_temp integer;
    begin
        -- old is the old_origin
        if old_numer = old_origin_numer
            and old_denom = old_origin_denom
            then
            return new_origin_numer;
        end if;

        if old_origin_denom < new_origin_denom then
            zoom_factor := new_origin_denom / old_origin_denom;
                new_origin_denom
                , old_origin_denom
                , zoom_factor;
            ret_numer   := old_numer 
                    - old_origin_numer * old_denom / old_origin_denom;
            ret_denom   := old_denom * zoom_factor;
        else
            zoom_factor := old_origin_denom / new_origin_denom;
            ret_numer   := (old_numer 
                    - old_origin_numer * old_denom / old_origin_denom
                            ) * zoom_factor;
            ret_denom   := old_denom;
        end if;
        ret_numer_temp := ret_numer;
        ret_numer := ni_normalize_numer (ret_numer_temp, ret_denom);
        ret_denom := ni_normalize_denom (ret_numer_temp, ret_denom);
        if ret_denom < new_origin_denom then
            ret_numer := new_origin_numer 
                + ret_numer * new_origin_denom / ret_denom;
            ret_denom := new_origin_denom;
        else
            ret_numer := ret_numer 
                + new_origin_numer * ret_denom / new_origin_denom;
        end if;
        ret_numer_temp := ret_numer;
        ret_numer := ni_normalize_numer (ret_numer_temp, ret_denom);
        ret_denom := ni_normalize_denom (ret_numer_temp, ret_denom);

        return ret_numer;
    end;
    ';

create or replace function ni_new_denom (
      integer -- old_numer
    , integer -- old_denom

    , integer -- old_origin_numer
    , integer -- old_origin_denom

    , integer -- new_origin_numer
    , integer -- new_origin_denom
    ) returns integer
    language plpgsql as '
    declare
        old_numer alias for $1;
        old_denom alias for $2;

        old_origin_numer alias for $3;
        old_origin_denom alias for $4;

        new_origin_numer alias for $5;
        new_origin_denom alias for $6;

        zoom_factor integer;

        ret_numer integer;
        ret_denom integer;

        ret_numer_temp integer;
    begin
        -- old is the old_origin
        if old_numer = old_origin_numer
            and old_denom = old_origin_denom
            then
            return new_origin_denom;
        end if;

        if old_origin_denom < new_origin_denom then
            zoom_factor := new_origin_denom / old_origin_denom;
            ret_numer   := old_numer 
                    - old_origin_numer * old_denom / old_origin_denom;
            ret_denom   := old_denom * zoom_factor;
        else
            zoom_factor := old_origin_denom / new_origin_denom;
            ret_numer   := (old_numer 
                    - old_origin_numer * old_denom / old_origin_denom
                            ) * zoom_factor;
            ret_denom   := old_denom;
        end if;
        ret_numer_temp := ret_numer;
        ret_numer := ni_normalize_numer (ret_numer_temp, ret_denom);
        ret_denom := ni_normalize_denom (ret_numer_temp, ret_denom);
        if ret_denom < new_origin_denom then
            ret_numer := new_origin_numer 
                + ret_numer * new_origin_denom / ret_denom;
            ret_denom := new_origin_denom;
        else
            ret_numer := ret_numer 
                + new_origin_numer * ret_denom / new_origin_denom;
        end if;
        ret_numer_temp := ret_numer;
        ret_numer := ni_normalize_numer (ret_numer_temp, ret_denom);
        ret_denom := ni_normalize_denom (ret_numer_temp, ret_denom);

        return ret_denom;
    end;
    ';

Now we can use ni_new_numer and ni_new_denom as part of a function that deletes folders and performs the necessary housekeeping operations. It first checks to make sure the folder to be deleted has no subfolders (this is not a recursive delete), and that it is not the root node. Then, it finds number of siblings the folder has and how many are younger. The folder is deleted, and each younger sibling and its subtree is translated to the next eldest position.

create or replace function delete_folder (
    text -- folder_name
    ) returns boolean
    language plpgsql as '
    declare
        old_folder_name alias for $1;
        old_folder_child_number integer;
        sibling_count integer;
        old_folder record;
        parent_folder record;

        old_folder_child_count integer;
        old_folder_document_count integer;

        old_child record;
        new_child record;

    begin
        select num, den
            into old_folder
        from folders where folder_name = old_folder_name;

        select (count(folder_name))::integer 
            into old_folder_child_count
        from folders
        where ni_parent_numer(num, den) = old_folder.num
            and ni_parent_denom(num, den) = old_folder.den;

        -- check to make sure the folder is empty (it has no children)
        -- and it is not the root folder, raising exceptions in either case

        if old_folder_child_count = 0 
            and old_folder.num <> 3 
            and old_folder.den <> 2 then
            select 
                ni_parent_numer(old_folder.num,old_folder.den) as num
                , ni_parent_denom(old_folder.num,old_folder.den) as den
                into parent_folder;

            select (count(folder_name))::integer into sibling_count
            from folders
            where ni_parent_numer(num, den) = parent_folder.num
                and ni_parent_denom(num,den) = parent_folder.den;

            old_folder_child_number := 
                ni_sibling_number(old_folder.num, old_folder.den);

            delete from folders where folder_name = old_folder_name;

            if (sibling_count - old_folder_child_number) > 1 then  
                -- There are younger siblings of the deleted folder.
                -- Promote each by one child number to fill the
                -- hole left by the deleted folder
                for i in (old_folder_child_number + 1)..sibling_count loop

                    select 
                        ni_child_numer(parent_folder.num, parent_folder.den, i) as num
                        , ni_child_denom (parent_folder.num, parent_folder.den, i) as den
                    into old_child;

                    select 
                        ni_child_numer(parent_folder.num, parent_folder.den, i - 1) as num
                        , ni_child_denom (parent_folder.num, parent_folder.den, i - 1) as den
                    into new_child;

                    update folders set 
                        num = ni_new_numer(
                            num, den 
                            , old_child.num, old_child.den
                            , new_child.num, new_child.den
                            )
                        , den = ni_new_denom(
                            num, den 
                            , old_child.num, old_child.den
                            , new_child.num, new_child.den
                            )
                    where ni_distance( num,den, old_child.num,old_child.den) >= 0;
                end loop;
            end if;
            return true;
        elsif (old_folder_child_count) <> 0
            then raise exception ''Cannot delete non-empty folder.'';
        elsif (old_folder.num = 3 and old_folder.den = 2) 
            then raise exception ''Cannot delete root folder.'';
        end if;
    end;
    ';

And to try it out. I’ll delete the data folder, as it has younger siblings with subtrees which will test to see if these parts of the function work.

select delete_folder('data');
 delete_folder 
---------------
 t
(1 row)
select * from folder_hierarchy;
 folder_name | num | den | folder_rational |  folder_name_   | folder_path 
-------------+-----+-----+-----------------+-----------------+-------------
 root        |   3 |   2 | 3/2             | root            | 1
 usr         |   7 |   4 | 7/4             |   usr           | 1.1
 local       |  15 |   8 | 15/8            |     local       | 1.1.1
 src         |  31 |  16 | 31/16           |       src       | 1.1.1.1
 pgsql       |  59 |  32 | 59/32           |       pgsql     | 1.1.1.2
 bin         | 119 |  64 | 119/64          |         bin     | 1.1.1.2.1
 doc         | 235 | 128 | 235/128         |         doc     | 1.1.1.2.2
 html        | 471 | 256 | 471/256         |           html  | 1.1.1.2.2.1
 include     | 467 | 256 | 467/256         |         include | 1.1.1.2.3
 share       | 931 | 512 | 931/512         |         share   | 1.1.1.2.4
(10 rows)

The folder data is gone, and doc, html, include, and share have all been appropriately updated. The delete_folder function can be modified to handle moving a folder as well by adding a loop that inserts the folder and its subfolders into the new location before promoting the siblings.

Limitations

The denominator of binary fractions gets large very quickly, moving through the available integers at an astonishing rate. Using int4, we’re limited to 29 children for node 1 (1.29 is the youngest possible child), and 29 generations after node 1 (1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1 is the most deeply nested node possible), with the maximum nesting depth and the number of children decreasing the with child number and depth: the sum of the numbers in the path must be less than n − 2, where n is the number of bits avaible. Using int8, we can have 62 children of node 1, and 62 generations. This limitation means there should be additional code which checks the size and shape of the tree when it’s translated to make sure it will fit in its new position. This could be done by finding the maximum sum of materialized path values for each point of the subtree to be moved, and check to see if this and the sum of the sum of the path values for the new root is less than the maximum.

Tropashko’s nested interval work has been brought up at least twice on the PostgreSQL mailing lists. In March 2004 Manfred Koizar—referring to nested intervals as the Obfuscated Materialized Path Method, or OMPM—wrote he feels nested intervals is “irreparably broken”, its primary failing being that “If you try to store materialized path information in a fixed number of integers you run out of bits very soon.” Tropashko responded (in a post that appears to only have hit the newsgroups) but did not directly address the overflow issue. Looking at other encoding algorithms that may be more efficient with integer use, or at other data types, perhaps there’s still hope. What is needed is an encoding schema that encompasses the range of hierarchies you want to encode. All datatypes and encoding schemes have limitations. The question is whether the limitations restrict what you’re trying to do.

Performance issues have been discussed on the list as well. The ni_distance function, used to calculated children and parents requires a full table scan as an index cannot use this information. Alternatives such as Teodor Sigaev and Oleg Bartunov’s contrib/ltree, Joe Conway’s connectby() (in contrib/tablefunc), and storing paths directly using strings and ASCII (as well as the possibility of using bytea) were mentioned.

There are at least two methods of working around the datatype limitation. Developing a PostgreSQL base type for rational numbers, perhaps based on Perl’s BigRational or other implementation would allow a much larger range of numbers to be used. The availability of a rational data type would also eliminate the necessity of numerator and denominator variants and their accompanying duplicated code. Another alternative is to use a different algorithm for generating the tree, such as Farey fractions or continued fractions, both of which have been explored by Tropashko, who has mentioned in a recent newsgroup point that he is rewriting his article on using the latter. Continued fractions looks particularly promising, and I hope to explore these functions sometime soon.

Code

For your convenience, here are the ni_ functions and the example data. You will also need to have PL/pgSQL installed in your database and create the instr functions (with the appropriate modifictions). The code was developed against PostgreSQL 7.4.1.

References

Posted by Michael at July 10, 2004 12:06 AM