Often we do not want to stop the ETL-flow because of bad data quality. Some books suggest to tag the bad data in the datawarehouse so that business can review and correct the data.
Any ideas how to create a central table in the datawarehouse that could store all sorts of quality checks ?
Typically we also want to do checks at several points e.g. just after import from our staging layer but also after important transformations / calculations.
Types of checks:
//Checks for a single column
- Check if data type is correct
- Check if column values fall in a certain range
- Check if the format is correct (e.g. dates).
- Check invalid values for a column
- More complex rules proposed by business
- Sum of amounts
- Avg of amounts,
- Rowcounts before, after merges
I am looking for a good table structure in the datawarehouse that could store all of these checks and works for all datasets.
.Any ideas or experiences of how organizations typically do this ?