When it comes to data preparation one of the main rules is “garbage in – garbage out”. It means that no matter how correct your data transformation logic is, if the input data is inconsistent then the result of your data preparation workflow can also be inconsistent. In the worst case, it could be incorrect in a way that is not easily observable which would make wrong data propagate to end-user reports and dashboards and cause misleading conclusions (and decisions) before the error is noticed and its cause is identified.
There typically are two ways to deal with data quality problems:
- Fix data
- Abort data preparation workflow
Fixing data problems automatically is typically done when the problems are well know ahead of time and they appear frequently. For instance, if it’s well known that in the source data New York would be sometimes designated as just “NY” or “NYC”.
Aborting data preparation workflow is done when data problems are not generally expected but nevertheless still possible, so it is necessary to verify assumptions about source data in order to guarantee correctness of the result. However, since the probability of an error is low it’s not reasonable to spend time designing data fixing for every possible case that can go wrong.
In an ideal world where everyone has unlimited time, budget and never gets bored, all data preparation processes would have data fixing logic for all possible cases. But since we’re living in a world that is slightly less than ideal, aborting workflows is a reasonable compromise.
The “Halt on Condition” action
In EasyMorph, the “Halt on Condition” action is used to abort project execution when a certain condition is not met.
The action itself is pretty simple. Just specify a condition to check, and optionally provide a custom error message. The screenshot below shows the “Halt on condition” action that would stop project execution if in column “city” in at least one row there is no value. In this case the project will stop and error “A city must be specified!” will be shown to the user or logged in the Server task log.
Do not create too complex conditions in one “Halt…” action. It’s better to show specific error messages for various conditions. Therefore use several “Halt…” actions with simple conditions and different specific error messages, rather than one action with complex condition and a generic error message.
Complex data quality checks
Using an expression in the “Halt…” action helps detecting rather simple data quality problems. For instance:
- Empty values
- Numbers that fall outside of reasonable/acceptable range (e.g. negative age of a person)
- Wrong data types (e.g. numbers in a column where only text values were expected, or text values that look like numbers)
- Unacceptable text values (e.g. with line breaks, trailing spaces, system characters, etc.)
However, it is possible to perform data quality checks of virtually any complexity using a combination of the “Halt…” action and other transformations.
In this case, typically a derived table is derived in which certain calculations are performed and then the “Halt…” action is applied. A few examples:
Check for duplicates
In a tabled derived from the main table:
- Action “Keep duplicates”
- Action “Halt on condition”, mode “Halt if table is NOT empty”
Check for new values
An additional reference table with all possible (acceptable) values in the column of interest.
In a tabled derived from the main table:
- Action “Keep mismatching”
- Action “Halt on condition”, mode “Halt if table is NOT empty”
Check for mandatory values
An additional reference table with all mandatory values required in the column of interest. In that table:
- Action “Keep mismatching” that removes all values found in the column of interest in the main table
- Action “Halt on condition”, mode “Halt if table is NOT empty”
Check that no dates omitted
Assuming that the data is time series where each row has a distinct date in ascending order. (If not, then deduplicate dates and sort them in ascending order).
In a table derived from the main table:
- Action “Shift column” that created a new column [Previous date] with dates shifted down by 1 row
- Action “Calculated new column” that calculates [Date diff] = [Date] - [Previous date]
- Action “Trim table” that removes the top 1 row (for it the previous date is always empty)
- Action “Halt on condition” with condition [Date diff] > 1
Synchronization of data quality checks
Data quality checks must be performed before the result of data preparation is exported for further use in a report or a dashboard. Therefore, you have to ensure that the exporting always happens ONLY when all data quality checks are done. In case of simple checks, when “Halt…” actions are inserted into the main transformation chain this is not a problem. However, if you perform complex data quality calculations in separate derived tables you may run into a case when exporting happened before data quality checks are finished. This can happen because EasyMorph always tries to calculate in parallel all tables derived from the same table. Because such derived tables are calculated in parallel, sometimes one derived table is calculated faster, sometimes another. While it’s good for overall performance, it may lead to nondeterministic behavior.
In order to synchronize all data quality checks before exporting, use the “Synchronize” action to synchronize calculations in several derived tables. This action simply waits until another table is calculated.
In the example below, two data quality checks are made in derived tables, which are then synchronized before exporting data. Without the synchronization, exporting could have started before the checks are finished.
Parameter verification
The “Halt on condition” action can also be used to verify project parameters. If you have a project with parameters that is used by other people, it may be a good idea to built in parameter verification.
One way to do it, is to create a table with parameter values using the “Parameter table” action, and then add a few “Halt on condition” actions to it to verify that project parameters are within a specified range of values, or don’t contradict each other.
It would make sense to perform all other calculations in the project only after parameter verification is passed. In the example below the main calculation didn’t start because parameter verification failed.
Hints and tips
The “Halt on condition” action is a powerful mechanism to create data quality guards and ensure consistent data quality in EasyMorph projects. The more people use EasyMorph projects in an organization the more reasons to start using data quality checks and verification in these projects.
Just a few more tips:
- Check data quality not only for source data, but also for results too
- It’s frequently a good idea to check data for key duplicates before doing a merge (join) and halt before merge if a duplicate is found
- If you want a project to always fail at certain point (e.g. while doing development/debugging) then insert “Halt…” with an impossible condition, e.g.
1=2
, or justfalse()
- Write custom error messages in a way that conveys the expectations, e.g. “There should be no omitted dates in column [Dispatched date]” or “A new value is detected in [Region] but not expected”. In this case the end users will know exactly what to do to resolve the problem.