I would like to know that besides creating our own logic using expressions and formulas and selecting options in the context menu of each action, is there an easier way to do the below data quality checks in EM
data profiling (which can be stored as a result or can be used downstream in the flow)
data structure check for incoming files (like the number of columns remaining same and same data type)
data integrity (like if any input file is empty which happens in automatic downloaded files from other systems)
For the context, we already created a very complex flow for the above for data from one system while integrating. But because of the effort and complexity, we would like to know if these can be done in a repetitive way by just configuring the rules for each system in EM and use it like plug-n-play.
We are open to consider if there such another software which can compliment EM by first doing the Data quality check and then using EM process the data.
Please provide your experience and thoughts about this.
You're bringing up an interesting topic. As I understand, the main challenge is re-use and composability - i.e. to be able to chain pre-designed data quality checks and produce a summary of all checks. Is that correct? Or you're for the ability to easier perform specific checks?
We can add a new action, "Call with another table" that should simplify applying re-usable data quality checks. The action would call another (re-usable) project and pass another table to it. The project does whatever data quality check you design and returns a list of errors in whatever format you prefer.
You will be able to put several actions "Call with another table" in a row, where each call does a different data quality check. Each "Call with another table" action will append its result to the result of the previous action, thus building up a list of errors returned by each check. In the end, you will either have a list of errors, or an empty list and will be able to decide what to do with it next. Basically, you take a table, and pass it through a pipeline of checks, and collect the results into one table.
We addressed the challenge of reusable data validations in the recently released version 5.8.1:
We've added the "Verify data in another table" action. The action contains about 40 built-in checks for column values and column names, plus unlimited custom checks using expressions. The action verifies data in another table, and returns a list of violated rules together with the column name and row number for each where the rule was violated for the 1st time.
The action can be chained, i.e. you can put multiple "Verify" actions one after another and accumulate verification results in one table.
The "Verify" action is intended for quick simple checks. For more complex and custom checks you can use another new action - the "Call with another table" action. That action calls another module (or project) with another table and can also append results into one table if it's chained.
The "Call with another table" action can be mixed in the same data validation pipeline - just make sure that "Call with another table" returns columns with the same names as the "Verify" action.