Data validation, profiling, quality check using EM

Hi there,

I would like to know that besides creating our own logic using expressions and formulas and selecting options in the context menu of each action, is there an easier way to do the below data quality checks in EM

  1. data profiling (which can be stored as a result or can be used downstream in the flow)
  2. data structure check for incoming files (like the number of columns remaining same and same data type)
  3. data integrity (like if any input file is empty which happens in automatic downloaded files from other systems)

For the context, we already created a very complex flow for the above for data from one system while integrating. But because of the effort and complexity, we would like to know if these can be done in a repetitive way by just configuring the rules for each system in EM and use it like plug-n-play.

We are open to consider if there such another software which can compliment EM by first doing the Data quality check and then using EM process the data.

Please provide your experience and thoughts about this.

Thank you!

Regards,
Ashish

Hi Ashish,

You're bringing up an interesting topic. As I understand, the main challenge is re-use and composability - i.e. to be able to chain pre-designed data quality checks and produce a summary of all checks. Is that correct? Or you're for the ability to easier perform specific checks?

Hi Dmitry,

Yes, I am looking for the reusable option which can be configured for each data source.

But to begin with, any of the options is good.

Thank you!

We can add a new action, "Call with another table" that should simplify applying re-usable data quality checks. The action would call another (re-usable) project and pass another table to it. The project does whatever data quality check you design and returns a list of errors in whatever format you prefer.

You will be able to put several actions "Call with another table" in a row, where each call does a different data quality check. Each "Call with another table" action will append its result to the result of the previous action, thus building up a list of errors returned by each check. In the end, you will either have a list of errors, or an empty list and will be able to decide what to do with it next. Basically, you take a table, and pass it through a pipeline of checks, and collect the results into one table.

Something like in the sketch below:

With this action, it should be possible to re-use data validation logic and build data quality check pipelines.

How does it sound to you?

1 Like

And maybe call data quality checks, business/data transformation rule checks, etc. from the catalog?

@dgudkov That sounds interesting. But it would be great to see some example in action if possible.

@jcaseyadams - Thank you for the reply! Can you please elaborate a bit or provide pointers as catalog is new to me.

Regards,
Ashish

@ashish_jain Here you go: EasyMorph | Data Catalog basics

1 Like