Data quality check

RRB · May 5, 2021, 12:36pm

Hi,

Sorry this might seem like a silly question but I am not able to figure it out.

I am looking for ways to check if the data received today is similar to the one received yesterday.

To elaborate - we receive several excel files (.csv, .xlsx etc) with 30-40 columns and 1000s of rows. Now although it is understandable that the data received yesterday won’t be exactly the same as that received today, but what I am trying to achieve is just have some sort of checks in place to see if any of the columns are missing than what we would normally expect or if the total number of records has drastically changed as compared to the day before.

Is there any way to do that? If anyone could just guide me on this, that’d be really appreciated.

Thank you.
R

dgudkov · May 5, 2021, 1:00pm

You can start with clear definitions of checks. A clear definition is usually a result of answering questions, for instance:

How would you define if a column is missing? For example, is it a column name that was present yesterday and but not today or something else? Should the column name have exactly the same spelling or there can be slight variations? Do you have optional columns which can be missing and that's OK?

What's "drastically" here? Is 10% change drastic or not? How about 5%, 20%?

And so on.

Once you have the rules clearly defined, make sure that you store somewhere and can access the data required for the rules - e.g. column names, max values, etc.

Finally, use actions and expressions in EasyMorph to construct the checks according to the definitions and using the stored data from the previous load.

RRB · May 5, 2021, 1:16pm

Hi @dgudkov

Ah ok, thank you for explaining that.

I’ll try that and let you know how it goes.

Thank you so much.
R