Imagine you want to call a module iterated upon a dataset, and in that module you want to update a kind of temporary table that can store data. But as you call the module many times, like thousands or millions of time, you want that storage to be fast, only in memory. How would you do that ?
I mean today you can iterate, passing only one table but this table can not be updated for recursive goals. I don’t see ways to store tables in memory, just the time of the process, except maybe writing datasets in the called module but it’s using file system in that case so it will be slower and you don’t necessary want it to be definitely stored.
What if there was a type of storage to store things only in easymorph memory ? Something like “virtual dataset” ?
To recursively update an in-memory table, use for iterations the “Repeat” action, instead of “Iterate table”. In the “Repeat” action, the output dataset of an iteration becomes the input dataset of the next iteration, and therefore it can be modified in each iteration.
I’d like to have the possibility in the “repeat” action to add a table (or even several, up to 9) like in the “iterate table” action.
The “repeat” action is very powerful and provides graph exploration features (depth first search, breadth first search, reachability, transitive closure, component extraction…). I can lead fraud detection inquiries within a few minutes instead of days.
But most of the time I have to export the datasets I need in a local dset file and call them back in the repeat loop.
An “iterate tables repeat action” would be fantastic as it would “sandbox” the data exchanges between the calling module and the repeat module.
most of my work these days is about fraud detection in telco.
this is not about classical fraud detection like international revenue sharing or bypass, it’s a little more elaborated.
For that I designed an application that still evolve but quite big (currently 37 modules, 76 groups, 670 tables and 3713 actions)
It’s massively using graph theory, one trick is that I handle hypergraph and multigraph (an edge can link more than 2 vertex and vertex can be linked by several edges)
So no way to rely on any python or java or R existing library and no graph database.
I’ve created modules based on Repeat or Iterate tables for depth first search, breadth first search, reachability, transitive closure, component extraction…
Some of these modules can be used dozen of times in the same workflow.
So I export tables in dset files in the calling module and reimport them in the called modules (or sub module or sub sub module…)
I can’t simply use the results of the previous repeat action in each loop because if I make a join of the previous result with itself, then I get a graph exploration in relative mode.
Imagine a lift : on each floor, the floor under is the -1 and the floor above is the 1 and it’s updated every time you reach a floor. Now try to get out of the building.
Performance and reliability of the results are amazing but
maintenance of the workflow need to be accurate, I have to be sure to finish one loop, clean all the temporary files before to start another one, so each evolution is a moment of pure anxiety.
I tried to use shared memory instead. It’s very hard to maintain. The same with the use of a SQLite export/import
as I put some data out of the system (disk access in write mode than read mode for the dset files), I have to generate a datalineage report for each run instead of simply validating the workflow. For compliancy purpose, I have to proof that these files haven’t been altered between the write and read operations. I use osquery to show that no interferring operations were running on my computer in same time.
I was considering to develop a self service application for that running on a EM server, but how to manage these temporary files if several users launch the projet in same time and then potentialy mess with the temporary files.
On my side, I realized I have a workaround : I can use “append another table” before caling the repeat loop and merge all the tables I need in the repeat loop, split the table in the first step of the repeat loop and manage with “skip actions on conditions”
for the in-memory datasets, I guess we would load them before to call a module in a repeat loop, recall them within the loop and erase them after the repeat action.
this is powerful, very powerful. But how to manage it if a sub-module is called several in parallel.
So with the annotation, you may need to store the hierachy of the ‘generating’ action : master module.table.action number.called module.table.action number.“Remember dataset action”
It would be a very powerful feature but it’s hard to figure out all the possibilities. In a way a dataset could be seen as a global variable of a project.
the “start/finish exclusive access” may need to lock several resources. I think you’ll need to create a specific action dedicated to in-memory datasets or add “freeze dataset” and “release dataset” in the commands.
“List datasets” introduces the capability to manage conditional workflows.
YES+++ for me