How to store and update data in memory only?

RJO · February 16, 2023, 12:51pm

Hi guys,

Imagine you want to call a module iterated upon a dataset, and in that module you want to update a kind of temporary table that can store data. But as you call the module many times, like thousands or millions of time, you want that storage to be fast, only in memory. How would you do that ?

I mean today you can iterate, passing only one table but this table can not be updated for recursive goals. I don’t see ways to store tables in memory, just the time of the process, except maybe writing datasets in the called module but it’s using file system in that case so it will be slower and you don’t necessary want it to be definitely stored.

What if there was a type of storage to store things only in easymorph memory ? Something like “virtual dataset” ?

dgudkov · February 16, 2023, 1:37pm

It already exists!

To recursively update an in-memory table, use for iterations the “Repeat” action, instead of “Iterate table”. In the “Repeat” action, the output dataset of an iteration becomes the input dataset of the next iteration, and therefore it can be modified in each iteration.

PS. See the help article on the “Repeat” action to understand better how it works.

cvo · February 16, 2023, 4:04pm

greetings,

I’d like to have the possibility in the “repeat” action to add a table (or even several, up to 9) like in the “iterate table” action.
The “repeat” action is very powerful and provides graph exploration features (depth first search, breadth first search, reachability, transitive closure, component extraction…). I can lead fraud detection inquiries within a few minutes instead of days.
But most of the time I have to export the datasets I need in a local dset file and call them back in the repeat loop.
An “iterate tables repeat action” would be fantastic as it would “sandbox” the data exchanges between the calling module and the repeat module.

Regards

RJO · February 17, 2023, 8:17am

All right it’s existing for one table, but what if you need to store 3 tables ?

David · February 17, 2023, 9:45am

I second that feeling! Having a Repeat or Iterate Tables (with the ability to pass and fetch multiple distinct datasets in-memory to and from the submodule) would be great !

dgudkov · February 17, 2023, 11:24am

That’s an interesting twist of discussion.

What are the downsides of using .dset files? Are the downsides grave enough to avoid using .dset?

RJO · February 17, 2023, 12:44pm

Downside : use of disk storage with .dset, much slower than live memory, makes a big difference when you are doing millions of call.

cvo · February 18, 2023, 7:35pm

Greetings,

most of my work these days is about fraud detection in telco.
this is not about classical fraud detection like international revenue sharing or bypass, it's a little more elaborated.
For that I designed an application that still evolve but quite big (currently 37 modules, 76 groups, 670 tables and 3713 actions)
It's massively using graph theory, one trick is that I handle hypergraph and multigraph (an edge can link more than 2 vertex and vertex can be linked by several edges)
So no way to rely on any python or java or R existing library and no graph database.

I've created modules based on Repeat or Iterate tables for depth first search, breadth first search, reachability, transitive closure, component extraction...
Some of these modules can be used dozen of times in the same workflow.
So I export tables in dset files in the calling module and reimport them in the called modules (or sub module or sub sub module...)
I can't simply use the results of the previous repeat action in each loop because if I make a join of the previous result with itself, then I get a graph exploration in relative mode.
Imagine a lift : on each floor, the floor under is the -1 and the floor above is the 1 and it's updated every time you reach a floor. Now try to get out of the building.

Performance and reliability of the results are amazing but

maintenance of the workflow need to be accurate, I have to be sure to finish one loop, clean all the temporary files before to start another one, so each evolution is a moment of pure anxiety.
I tried to use shared memory instead. It's very hard to maintain. The same with the use of a SQLite export/import
as I put some data out of the system (disk access in write mode than read mode for the dset files), I have to generate a datalineage report for each run instead of simply validating the workflow. For compliancy purpose, I have to proof that these files haven't been altered between the write and read operations. I use osquery to show that no interferring operations were running on my computer in same time.
I was considering to develop a self service application for that running on a EM server, but how to manage these temporary files if several users launch the projet in same time and then potentialy mess with the temporary files.

Regards

dgudkov · February 19, 2023, 11:11am

It sounds like we need to make an in-memory store of datasets with the following commands:

Remember dataset
Recall dataset
Forget dataset
Forget all datasets
List datasets

The in-memory datasets remain accessible from the called modules/projects. But once the main project finishes, they are erased from memory. For persistent storing, the .dset files remain.

I guess it would also help if an in-memory dataset could have an optional annotation. Also, the Desktop should have a dialog to view current in-memory datasets.

cvo · February 19, 2023, 12:19pm

On my side, I realized I have a workaround : I can use “append another table” before caling the repeat loop and merge all the tables I need in the repeat loop, split the table in the first step of the repeat loop and manage with “skip actions on conditions”

for the in-memory datasets, I guess we would load them before to call a module in a repeat loop, recall them within the loop and erase them after the repeat action.
this is powerful, very powerful. But how to manage it if a sub-module is called several in parallel.
So with the annotation, you may need to store the hierachy of the ‘generating’ action : master module.table.action number.called module.table.action number.“Remember dataset action”

In memory dataset would change the way I use EM

dgudkov · February 19, 2023, 12:41pm

You can try using the "Start/finish exclusive access" action. That's the purpose of the action - synchronize simultaneous access to a resource.

RJO · February 20, 2023, 5:58am

Definitely YES !

cvo · February 20, 2023, 12:55pm

It would be a very powerful feature but it’s hard to figure out all the possibilities. In a way a dataset could be seen as a global variable of a project.
the “start/finish exclusive access” may need to lock several resources. I think you’ll need to create a specific action dedicated to in-memory datasets or add “freeze dataset” and “release dataset” in the commands.
“List datasets” introduces the capability to manage conditional workflows.
YES+++ for me

Regards