Data validation, profiling, quality check using EM

ashish_jain · July 10, 2024, 8:16pm

Hi there,

I would like to know that besides creating our own logic using expressions and formulas and selecting options in the context menu of each action, is there an easier way to do the below data quality checks in EM

data profiling (which can be stored as a result or can be used downstream in the flow)
data structure check for incoming files (like the number of columns remaining same and same data type)
data integrity (like if any input file is empty which happens in automatic downloaded files from other systems)

For the context, we already created a very complex flow for the above for data from one system while integrating. But because of the effort and complexity, we would like to know if these can be done in a repetitive way by just configuring the rules for each system in EM and use it like plug-n-play.

We are open to consider if there such another software which can compliment EM by first doing the Data quality check and then using EM process the data.

Please provide your experience and thoughts about this.

Thank you!

Regards,
Ashish

dgudkov · July 10, 2024, 10:28pm

Hi Ashish,

You're bringing up an interesting topic. As I understand, the main challenge is re-use and composability - i.e. to be able to chain pre-designed data quality checks and produce a summary of all checks. Is that correct? Or you're for the ability to easier perform specific checks?

ashish_jain · July 11, 2024, 1:37am

Hi Dmitry,

Yes, I am looking for the reusable option which can be configured for each data source.

But to begin with, any of the options is good.

Thank you!

dgudkov · July 11, 2024, 6:48pm

We can add a new action, "Call with another table" that should simplify applying re-usable data quality checks. The action would call another (re-usable) project and pass another table to it. The project does whatever data quality check you design and returns a list of errors in whatever format you prefer.

You will be able to put several actions "Call with another table" in a row, where each call does a different data quality check. Each "Call with another table" action will append its result to the result of the previous action, thus building up a list of errors returned by each check. In the end, you will either have a list of errors, or an empty list and will be able to decide what to do with it next. Basically, you take a table, and pass it through a pipeline of checks, and collect the results into one table.

Something like in the sketch below:

With this action, it should be possible to re-use data validation logic and build data quality check pipelines.

How does it sound to you?

jcaseyadams · July 15, 2024, 5:35pm

And maybe call data quality checks, business/data transformation rule checks, etc. from the catalog?

ashish_jain · July 15, 2024, 6:06pm

@dgudkov That sounds interesting. But it would be great to see some example in action if possible.

@jcaseyadams - Thank you for the reply! Can you please elaborate a bit or provide pointers as catalog is new to me.

Regards,
Ashish

jcaseyadams · July 15, 2024, 6:24pm

@ashish_jain Here you go: EasyMorph | Data Catalog basics

dgudkov · October 29, 2024, 10:12pm

We addressed the challenge of reusable data validations in the recently released version 5.8.1:

We've added the "Verify data in another table" action. The action contains about 40 built-in checks for column values and column names, plus unlimited custom checks using expressions. The action verifies data in another table, and returns a list of violated rules together with the column name and row number for each where the rule was violated for the 1st time.

The action can be chained, i.e. you can put multiple "Verify" actions one after another and accumulate verification results in one table.

The "Verify" action is intended for quick simple checks. For more complex and custom checks you can use another new action - the "Call with another table" action. That action calls another module (or project) with another table and can also append results into one table if it's chained.

The "Call with another table" action can be mixed in the same data validation pipeline - just make sure that "Call with another table" returns columns with the same names as the "Verify" action.