Data quality is a huge problem for just about every business. Human beings make mistakes. Business systems don’t always validate the data on entry. Data transformation logic might not account for every possible scenario in the data. There are so many factors that can lead to bad data.
If you participated in or watched the recordings of our recent series of webinars on Data Quality, you’ll already have seen us talk about what Data Contracts are and seen a preview of a “Toolkit” we’ve been putting together to demonstrate how you can perform checks in your EasyMorph workflows to identify if the data meets a given Data Contract.
If you missed the webinar, here is the relevant part which I’d highly recommend you watch before diving further into this post.
As outlined in the webinar, we’ve been testing the Toolkit with a handful of customers over the last few months, and we’re ready to open it up to a wider audience. However, this is still a work in progress and there are lots more we’d like to do to improve it going forward. The end goal would be for us to bake the functionality into EasyMorph so that anyone can easily define a Data Contract and test a dataset against it.
The main reason we want to make this available now is to not only test it more, but also to get your feedback, suggestions and improvements. The value of the collective hive mind that is all of you will no doubt have many ideas we’ve never even considered.
The toolkit is a Zip file containing the following files:
- ExampleDataSet.dset - An example EasyMorph Dataset which contains example data quality problems. The dataset is based on the EasyMorph Inc5000 example that ships with EasyMorph Desktop.
- DataContract-Example.xlsx - An example data contract definition for the sample data set
- TestDataContract.morph - The main brains of the toolkit. It can be called from any other EasyMorph project, be passed a table and the path to the data contract file it should be tested against and it will perform all of the required checks and pass back a set of results. Think of it like a black box. You don’t need to understand how it works internally, just that it will test your data to see if it meets the requirements in the contract.
- Example-DataTest.morph - A simple project showing how you can test a set of data. It loads in the example dataset, passes it to the TestDataContract morph along with the example data contract file and shows you the results.
- Example-CustomCheck.morph - An example of performing a simple custom check which can then be listed in the Data Contract and which will then be called by the CheckDataContract.morph automatically
You can download the Toolkit here: EM Data Quality Toolkit v0.4-Beta.zip (271.1 KB)
Currently, it is capable of the following types of data quality checks:
- Data Type - Number, Text, Date Number, Date Text, Timestamp Number, Timestamp Text and Any (AKA Mixed)
- Empty Values - is the field allowed to contain empty values or not?
- Uniqueness - are all the values expected to be unique or are duplicates allowed?
- Numbers within range - if the value is numeric, what range is permitted?
- Integers only - are numeric values expected to be integers only
- Text length within range
- Dates within range (for both Date Number and Date Text)
- Values only in acceptable list - specify a list of acceptable values and check if any value exists in the data which is not in the list.
- Text format - check that values meet a required format defined using a regular expression (e.g. are my company codes always the correct format?)
- Custom checks - Build your own check .morph files to perform any bespoke checks you might need.
Watch the video below that explains how exactly use the sheets with data contracts to define data quality checks:
As stated, this is very much an early beta release and so documentation is a little lacking. There are however lots of notes in the example data contract Excel file which will hopefully help. There are also a lot of comments and annotations in the morph files which will hopefully help you to understand how you can use it. And of course, if you get stuck, come ask here on the EasyMorph Community and we’ll point you in the right direction.
If you’d like to contribute to improving the Toolkit then we’d love your input. Whether that be sharing your own custom check morphs which others might find useful (e.g. validating US zip codes), making improvements and changes to the “black box” to implement new types of checks, or just suggestions for what you’d like to see added - we’d love to see them.
We’ve lots of ideas for how we can continue to improve the Toolkit over the coming months and I’ll share this with you all soon.
Happy bad data hunting
Matt

