The EasyMorph Data Quality Toolkit (and data contracts)

Data quality is a huge problem for just about every business. Human beings make mistakes. Business systems don’t always validate the data on entry. Data transformation logic might not account for every possible scenario in the data. There are so many factors that can lead to bad data.

If you participated in or watched the recordings of our recent series of webinars on Data Quality, you’ll already have seen us talk about what Data Contracts are and seen a preview of a “Toolkit” we’ve been putting together to demonstrate how you can perform checks in your EasyMorph workflows to identify if the data meets a given Data Contract.

If you missed the webinar, here is the relevant part which I’d highly recommend you watch before diving further into this post.

As outlined in the webinar, we’ve been testing the Toolkit with a handful of customers over the last few months, and we’re ready to open it up to a wider audience. However, this is still a work in progress and there are lots more we’d like to do to improve it going forward. The end goal would be for us to bake the functionality into EasyMorph so that anyone can easily define a Data Contract and test a dataset against it.

The main reason we want to make this available now is to not only test it more, but also to get your feedback, suggestions and improvements. The value of the collective hive mind that is all of you will no doubt have many ideas we’ve never even considered.

The toolkit is a Zip file containing the following files:

  • ExampleDataSet.dset - An example EasyMorph Dataset which contains example data quality problems. The dataset is based on the EasyMorph Inc5000 example that ships with EasyMorph Desktop.
  • DataContract-Example.xlsx - An example data contract definition for the sample data set
  • TestDataContract.morph - The main brains of the toolkit. It can be called from any other EasyMorph project, be passed a table and the path to the data contract file it should be tested against and it will perform all of the required checks and pass back a set of results. Think of it like a black box. You don’t need to understand how it works internally, just that it will test your data to see if it meets the requirements in the contract.
  • Example-DataTest.morph - A simple project showing how you can test a set of data. It loads in the example dataset, passes it to the TestDataContract morph along with the example data contract file and shows you the results.
  • Example-CustomCheck.morph - An example of performing a simple custom check which can then be listed in the Data Contract and which will then be called by the CheckDataContract.morph automatically

You can download the Toolkit here: EM Data Quality Toolkit v0.4-Beta.zip (271.1 KB)

Currently, it is capable of the following types of data quality checks:

  • Data Type - Number, Text, Date Number, Date Text, Timestamp Number, Timestamp Text and Any (AKA Mixed)
  • Empty Values - is the field allowed to contain empty values or not?
  • Uniqueness - are all the values expected to be unique or are duplicates allowed?
  • Numbers within range - if the value is numeric, what range is permitted?
  • Integers only - are numeric values expected to be integers only
  • Text length within range
  • Dates within range (for both Date Number and Date Text)
  • Values only in acceptable list - specify a list of acceptable values and check if any value exists in the data which is not in the list.
  • Text format - check that values meet a required format defined using a regular expression (e.g. are my company codes always the correct format?)
  • Custom checks - Build your own check .morph files to perform any bespoke checks you might need.

Watch the video below that explains how exactly use the sheets with data contracts to define data quality checks:

As stated, this is very much an early beta release and so documentation is a little lacking. There are however lots of notes in the example data contract Excel file which will hopefully help. There are also a lot of comments and annotations in the morph files which will hopefully help you to understand how you can use it. And of course, if you get stuck, come ask here on the EasyMorph Community and we’ll point you in the right direction.

If you’d like to contribute to improving the Toolkit then we’d love your input. Whether that be sharing your own custom check morphs which others might find useful (e.g. validating US zip codes), making improvements and changes to the “black box” to implement new types of checks, or just suggestions for what you’d like to see added - we’d love to see them.

We’ve lots of ideas for how we can continue to improve the Toolkit over the coming months and I’ll share this with you all soon.

Happy bad data hunting

Matt

5 Likes

This functionality is really interesting to me. I started testing today and it seems to work well. I was concerned about entering and maintaining long lists of items like state codes in the 'Possible Values' column. We store validation data in our SQL Server and it can change. What I came to realize is that I could embed a Power Query in the contract workbook to populate the Possible Values with a delimited list. Since we are working with Excel users for the maintenance of the Contract workbook, I thought I would share how I went about it:

  1. Create a Power Query to select the state data as connection only
  2. Create a reference query and then use Text.Combine to convert the column to a single string of data like this: = Text.Combine(Table.Column(Query1,"two_letter_code"), "|")
  3. Output the resulting single row table to a new worksheet in the DataContract workbook.
  4. In the Fields worksheet, use a formula to populate the Possible Values cell from the Table on the new worksheet: =INDEX(StateShort[StateShort],1)

I tried this with state_short and state_long and it worked well.

A periodic refresh will help to keep the Possible Values that are based on Power Queries current.

John

1 Like

Hi John
Glad you are finding it useful.

That's an interesting point I hadn't really considered and a clever solution. You may want to add the "Excel command" action to recalculate the data contract Excel file just before you call the data contract check. That way you can be sure the fields sheet will have been refreshed with the latest list of possible value.

Normally this shouldn't be needed but Excel can sometimes be funny about when it recalculates everything and I've had occasions where an Excel has been modified but it hasn't all the dependant calculations.

Another way to tackle long and changing lists of values would be to create a custom check to handle getting the list of possible values and then checking the data against them.

I wouldn't say either way is better than the other, so its a case of whatever works best for you.

Regards
Matt

1 Like

Great ideas Matt!

I gave the Excel Command - Recalculate a quick test.

To make it work, I had to make the queries refresh when opening the file.

To do that, Data -> Queries & Connections. In the pane that opens on the right side, right click on the query for the query properties.

I set my query properties as shown:

Really nice!

John

1 Like

Good point!

Looking forward to more feedback and ideas as you continue or experiment

Just sharing a few regular expressions which I've created to test certain things which I thought others may benefit from:

  • Email address - ^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!@$%^&*-]).{8,}$
  • URL - ^https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$
  • GUID - ^(?:\\{{0,1}(?:[0-9a-fA-F]){8}-(?:[0-9a-fA-F]){4}-(?:[0-9a-fA-F]){4}-(?:[0-9a-fA-F]){4}-(?:[0-9a-fA-F]){12}\\}{0,1})$
  • UUID - ^[0-9a-f]{8}-[0-9a-f]{4}-[0-5][0-9a-f]{3}-[089ab][0-9a-f]{3}-[0-9a-f]{12}$
  • IP address - ^(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
  • US Zip Code (5 or 9 character) - ^[0-9]{5}(?:-[0-9]{4})?$

These can be put in the Data Contract Excel file in the "Regex Format" column.

4 Likes

A video with the Data Quality Tookit walkthrough has been added to the opening post.

1 Like