Data lineage for EasyMorph Server jobs

cgr-stez · July 5, 2024, 9:37am

Good morning, I would like to propose a new implementation. When you have many jobs defined for example 100 jobs scheduled on EasyMorph Server, it begins to be essential to understand which path the data takes, which jobs use certain data and how they use it (reading/writing for example) this to determine the dependencies between various projects.
It would be really useful to be able to determine this information from the list of active jobs in EasyMorph Server.
I tried to develop a job that created an output of all the imported and exported tables, but it is not easy to manage as determining the tables from custom sql scripts is not easily manageable. Furthermore, when working on project files it is not possible to insert the parameters set during the execution of the job on the server.
If a job fails for example due to network connection problems, there can often be a cascade of problems, automatic documentation would be needed (so even if the project files are changed it is automatically updated) in such a way as to be able to immediately identify which processes are involved that use data.

olysak · July 5, 2024, 10:16am

Hi!

If I understand correctly, you are suggesting that EasyMorph Server should automatically generate a dependency graph for running tasks. If this is indeed your suggestion, I have a few concerns regarding its implementation in a generic case, given the various methods of data export and import.

While it is feasible to track straightforward .dset file imports and exports, this process becomes more complex when output paths are computed parameters. The complexity further escalates with SQL-related actions, especially those involving custom SQL queries. Finally, generating such a dependency graph seems nearly impossible when considering data transfers involving all actions interacting with external services.

cgr-stez · July 5, 2024, 10:43am

It is certainly not an easy thing but it is undoubtedly a useful thing for those who have many flows running.
When there are many flows running, it becomes very difficult to debug and restore normal functionality if a problem occurs in a specific job which imports data on which other jobs are then based.
Furthermore, it is useful for keeping the documentation updated regardless of who modifies the jobs.

dgudkov · July 5, 2024, 1:20pm

Hi @cgr-stez,

Did you know that you can collect and analyze workflow dependencies with the "Project metadata" action? The action collects all the connectors and file dependencies of a workflow. You can then merge and analyze this information with EasyMorph or a BI tool.

cgr-stez · July 5, 2024, 2:00pm

Yes, I know in fact, as I was saying I made a project that extracts a lot of information, but it is not related to the jobs performed but only to the actions and parameters in the project itself.
In any case, my request arises from the need to have a global vision of the data managed with the EasyMorph tool and therefore of their use flow. An overall vision that is currently missing.

cvo · July 5, 2024, 3:08pm

Greetings,

May I suggest to use the "Workflow run ID", this feature is really powerful.
If the sqlite database that stores the connectors and the shared memory could also store the history of projects execution. (project name, workflow run ID, timestamp, status, failure...)
Also, if we could have the "workflow run ID" in the metadata of generated dset and the capabilities to extract the metadata of the dataset files (like for the QVD files).
This would allow to build a full "compliance chain", knowing when, by who and with which project a dataset has been generated.
currently I add a column with the "workflow run id" in most of the dst files I generate, I also have a database that store all the project history but that's not that easy to maintain.

Regards

reynsnivea · July 7, 2024, 12:43pm

Hi

We also use a metadata table with the run_id of our ETL-flow and store it in a separate metadata table in the database.

What I would like to have is that the current lineage feature in EasyMorph Desktop would be able to track a variable end-to-end. We create our ETL-chain by call different EasyMorph projects (staging layer, ODS layer, presentation layer). Because of that, it is not easy to have complete lineage of a column across all linked projects.

dgudkov · July 7, 2024, 7:15pm

I can't see how absolute column lineage is possible, even in theory. If you have a "Schroedinger column" calculated as:

[C] = IF random() > 0.5 THEN [A] ELSE [B]

What's the lineage here, A -> C or B -> C?

And then we have pivots/unpivots, lookups, conditional branching in workflows, and many other transformations where it's not possible to establish column lineage.

cgr-stez · July 8, 2024, 10:45am

A solution could be to assign tags (manageable as desired) for each action on the project. By manually assigning these tags, the path of the data could then be traced.

Another useful thing, even if off topic, is certainly having the possibility of managing parameters for the project that are global at project level so as not to have to copy and paste them into the submodules of the project itself.

reynsnivea · July 9, 2024, 12:59pm

Hi

In that case for me C depends on both A and B even though the assignment is random.
We would like to be able to see dependencies in case we have to refactor flows for intance.
I understand it is a complex problem.

Kind regards
Nikolaas

cvo · July 9, 2024, 8:14pm

Hi,

I did an application for that last year.
This way you'll get the full lineage inside a project.
You'll need Qlik Sense to read the result.
In the dset files you generate, you'll have to put the module, group and table that generate the file.
then you'll have the full datalineage you are looking for.

Regards