Tree edge list expansion

As part of several projects, I need to expand a tree structure, e.g.:

  • node 0
    • node 1
    • node 2
      • node 4
      • node 5
    • node 3
      • node 6
        • node 7

I use the following terminology:

  • root node: node without parent (0)
  • internal node: node with parent and child(ren) (2, 3, 6)
  • leaf node: node without a child (1, 4, 5, 7)

The tree structure is a deliberate and mostly accurate simplification: in some cases, the structure is actually a directed acyclic graph (DAG), and in other cases, there's not one single tree but a forest.

input

EDITED based on tesing using the Hierarchy path action:

  • added root nodes as child with empty parent
  • added a test case for a singleton tree
  • added a test case for multiple trees (a forest)
  • added a test case for directed acyclic graphs

The tree structure is represented by an edge list, e.g.:

parent child
0
0 1
0 2
0 3
2 4
2 5
3 6
6 7

A single, parentless, childless node is added to the test data to test how the flow deals with multiple roots (a forest instead of a single tree), and to test how the flow deals with singleton trees.

parent child
8

A simple "diamond" structure is added to the test data to test how the flow deals with directed acyclic graphs (trees in which nodes may have more than one parent).

parent child
9
9 10
9 11
10 12
11 12

output

EDITED to reflect added test cases (see input above), and changed the desired output to more closely align with that of the Hierarchy path action.

The output may come in two shapes:

  • paths
  • reachability pairs

and which rows get included in the output is determined by the use case. Options:

  1. where should paths start/end?
    • only from root nodes?
    • only to leaf nodes
    • only to/from specific nodes?
  2. should partial paths be included?
  3. should the path from a node to itself be included?

The below table shows what the output would look like for questions 2 and 3 (3 is always a subset of 2: a self-path is necessarily a partial path). Question 1 is a matter of choosing specific ancestors and descendants.

ancestor path descendant 2. 3.
0 0 0 x x
0 0|1 1
0 0|2 2 x
0 0|2|4 4
0 0|2|5 5
0 0|3 3 x
0 0|3|6 6 x
0 0|3|6|7 7
1 1 1 x x
2 2 2 x x
2 2|4 4 x
2 2|5 5 x
3 3 3 x x
3 3|6 6 x
3 3|6|7 7 x
4 4 4 x x
5 5 5 x x
6 6 6 x x
6 6|7 7 x
7 7 7 x x
8 8 8 x x
9 9 9 x x
9 9|10 10 x
9* 9|10|12 12*
9 9|11 11 x
9* 9|11|12 12*
10 10 10 x x
10 10|12 12 x
11 11 11 x x
11 11|12 12 x
12 12 12 x x

* reachability pairs should not be duplicated, even if multiple paths exist

Hi Laurens,

Did you check the "Hierarchy path" action? It seems like it would be trivial to do what you need with the help of this action, as long as the graph is acyclic.

1 Like

Hi @dgudkov,
Definitely a great action to work with! It needs some extra work to accommodate for self paths, and to get to the desired output, but that should be no problem.

EDITED to reflect added test cases (see input in the original post) and changed desired output (see output in the original post). Then EDITED again to fix a bug and add test data inflation.

Here's the test set I use to verify it all works.

test data - input.csv (88 Bytes)
test data - output.csv (682 Bytes)
testbed.morph (25.4 KB)

dummy.morph (3.7 KB)

I tried implementing this with the Hierarchy path action suggested by @dgudkov.

Hierarchy path.morph (11.4 KB; EDITED to match changed input and output specifications)

findings

  • minor changes to the data required, mainly adding a root node with empty parent.
    This also pointed out a flaw in the original input specification: single nodes without parent or child(ren) could not be represented, but are present in the use cases (see below). I've adjusted the input specification and test data.
  • paths are extracted very easily using the Split delimited text into columns action
  • reachability pairs are extracted fairly easily using a combination of keepbefore() and mirror()
  • the action only generates all partial paths from a root, e.g. 0, 0|3, 0|3|6 and 0|3|6|7
    the action does not generate all partial paths to a leaf, e.g. 3, 3|6, 3|6|7, 6, 6|7 and 7 are missing
    the missing partial paths can be derived fairly easily
  • filters can be applied to specify the options
  • directed acyclic graph (DAG) don't work because the child is a primary key, so duplicates (required for specifying multiple parents) are not allowed
    I have not found a workaround for this and welcome any suggestions

So most of it works well. How important is the directed acyclic graph? Unfortunately, one of the primary use cases can't do without it.

use case

Users are assigned roles:

user role
john 1
jane 2
minnie 3

Roles can be (and often are) nested, e.g.:

  • role 1
    • role 2
      • role 3

Actual permissions can be (and are) assigned to any role (not only the leaf), e.g.:

role permission
1 read invoices
1 create invoices
2 modify invoices
3 approve invoices

In this specific use case inheritance goes up the tree, i.e. role 2 has all permissions of roles 2 and 3, and role 1 has all permissions of roles 1, 2 and 3. A junior would get role 3 (fewest permissions), a senior role 2 (more permissions) and a manager role 1 (most permissions).

This example is still simple and (crucially) a strict tree, not a directed acyclic graph. In reality, the lowest level roles contain permissions for individual tasks, like "view sales orders" or "modify AP master data". These tasks are included in the (job) roles of several employees, and therefore have multiple parent roles.

I implemented this with a pair of Repeat actions, one for the tree expansion, another for partial paths (pretty much the same as with the Hierarchy path.

Repeat.morph (27.1 KB)

findings

  • moving the output even closer to that of Hierarchy path simplified the flow. This is a testament to the design of the I/O formats for Hierarchy path.
  • I didn't find a way of having the input passed to the Repeat action, so I resorted to writing it to disk outside the Repeat action, and reading it from disk inside the Repeat action.
  • The actual tree expansion was the easy bit. My addition of specifications (and guarantees) about partial paths and self-paths added quite a bit of logic.
  • I didn't expect to even get close to the performance of a native easymorph action, but I shouldn't have bothered checking: yes, the native Hierarchy path is probably faster, but it doesn't matter. Both run in milliseconds, even on an artificially inflated dataset of 1 million rows.
  • Fortunately, I did check performance, because it pointed me to memory being a limiting factor. Below is the max memory consumption for the entire flow (the difference between the two options.
    rows max. memory
    0 175MB
    1000 500MB
    10000 740MB
    100000 2.5GB
    1000000 9.5GB

The memory consumption came as somewhat of a surprise to me. The as the data for the largest input (1 million rows, 2 columns with longest total text length of 4 characters) would take up 12MB uncompressed (assuming null-terminated strings and UTF16). This is purely theoretical, of course, and I realise that the data structure would take up extra space. On the other hand, the data should compress well; even a simple integer replacement of unique values would reduce the theoretical data size to 3MB. Saving the tree to disk showed that in practice, the data structure takes up much more space: on disk, the input data takes up 250MB. I might dig into this a bit more.

1 Like

EasyMorph uses columnar compression that doesn't work well on columns with long unique strings. In such cases, columnar compression even increases data, instead of compressing it. This can be particularly noticeable on synthetic datasets.

In real life, datasets typically have mixed column types, so overall columnar compression does compress data.

Besides that, the "Repeat" action keeps all intermediate results of each iteration in memory and that increases memory consumption as well.