Use AI to clean data

reynsnivea · October 8, 2023, 6:18pm

Hi,

We often see that a lot of time is spent cleaning values that should be equal but are not equal in the source data. For example company names, company numbers, etc.
A lot of replacements are made when there's no API available to clean this or when we do not have the authority to require from the source to have it cleaned. Generating regex expressions is an option but even with that, it is very difficult to cover a good percentage of the cases.

So it would be great if some text cleaning actions that use AI under the hood could be made available to cope with new cases.

Also a regex generator for the non technical user could be an interesting action. By providing some sample data and selecting the parts of the text in several records that the user is interested in, a regex could be generated.

Any ideas on those type of actions ?

dgudkov · October 8, 2023, 7:14pm

There is a lot of hype about AI these days, we have a rather careful approach to it. The problem with AI is that it's not deterministic - you don't know how it operates therefore you can never be sure what the result will be. On the other hand, ETL must be deterministic. There is no place for hallucinations in ETL (and finance, btw). So I don't think it's a good idea to use AI for unsupervised automated data cleansing. Nevertheless, AI can be used as a helper in a highly curated workflow that assumes manual intervention of a qualified human.

I don't think it's a good idea. I've seen a Tableau demo where they suggest that AI should generate a regex, and, in my opinion, it demonstrates the lack of understanding of how AI can be used, even among large IT vendors.

If a user doesn't understand regex, s/he won't understand how the AI-generated regex works and it's a straight road to bugs and errors, especially once data slightly changes. Generative AI is only good when the user is enough qualified to detect when the AI generates bullshit. Otherwise, the consequences will be disastrous. Promoting AI-generated regex to people who don't understand it is short of irresponsible and clearly done for the wow effect and click generation.

AI can be helpful for data classification, e.g. detecting language or sentiment. This can be helpful in automation rather than in data transformation and it can potentially work well with boards and issues, as they allow including a human into the automation loop. We're going to add a simple OpenAI/ChatGPT integration soon to see where it goes.

digenie · October 10, 2023, 12:01pm

I get this a lot from clients, "Can't you use AI to clean/fix/add the data?"

I use EasyMorph to find the issues and export them, but it's a very dangerous road to ask AI to clean/update data.

If you are certain that some things can be updated because of typos, capitalization etc. then this can be done in EasyMorph (and other tools), but to rely on AI to intelligently add or correct data is asking for trouble, particularly (and ironically) at scale at the moment. How do you check at scale that AI has got it right? Who is liable at that point if it doesn't?

There's also the issue of data protection/privacy, as the data has to be sent to the AI provider - what actually happens to that data?

reynsnivea · October 10, 2023, 4:03pm

Let me be clear. It was not my aim to let AI loose and not check its output.
The aim was to clean the values with AI and for example store it in a separate column so that it can be compared with the source data.

I can only establish that business spend a lot of time to clean values. Ideally it is tackled at the source but we do not have always control over the source or they are not willing to clean their own databases. And every year, new type of cases emerge and the flow has to be updated. That's not something we want to do every time.

I have used ChatGPT to generate some regex expressions for me and then tested the output in https://regexr.com/ and it worked pretty good. One should also ask the right question to chatGPT...

reynsnivea · October 10, 2023, 4:07pm

I understand that but see my other reaction to Dmitry. You should not take the output and apply it without checking.
How do you check that deterministic rules you use to clean textual data are always correct ? I have seen cases where in some case it works but then another case shows up where you don't want to match that case and it introduces an error. It think in both cases deterministic vs AI a 100 % cleaning cannot be achieved. I Just thought AI could save a lot of time instead of inventing all the rules myself.

dgudkov · October 10, 2023, 5:15pm

I personally don't think it can be avoided. Again, the main problem with AI is its undeterministic behavior.

In any case, try the "Ask ChatGPT" action when we add it in the next version (5.6.1). It would be interesting to see what uses it may find.

We also have a few ideas for experimenting with auto-ML models, but that's not on the roadmap yet. They can be good for data classification and pattern-matching.

reynsnivea · October 10, 2023, 5:26pm

Hi
Thanks for the responses. I think we all agree there's no magic solution.
I am looking forward testing those new actions.

jcaseyadams · October 13, 2023, 4:33pm

I have been working extensively with ChatGPT Plus with Advanced Data Analysis enabled with regards to data. Initial perception is that, specifically for data and data analysis it is quite good and quite accurate for the tasks I have tried. Essentially, it is taking plain English and converting it into Python commands and running Python in the background, Pandas, numpy, pytorch, matlibplot, seaborn, sweetviz, etc.. That said, I do not believe it is a solution for ETL. Quick analysis, fast formatting, data profiling, etc. to get a better understanding of data before passing it into a well governed process seem to be a strength. Very fast and you can do quite interesting things.

@dgudkov You are absolutely correct in your comments and it should not be used for processes that need to have good governance and follow strict business rules. Maybe in the future, but not today.

dgudkov · October 13, 2023, 5:01pm

@jcaseyadams thank you for sharing a practical perspective.

JFinderup · October 14, 2023, 7:29am

My approach would be to use AI for profiling data sets. Assisting in finding patterns and outliers in new/unknown datasets and then approaching data owners in what is right, wrong, acceptable or to be investigated further. Then based on this build deterministic cleansing or quality rules. Over time it can be used to monitor how known data sets change over time to see if new patterns surface.
I see AI more as an assistant in data analysis.
In EasyMorph it could be interesting to add the output in a separate column next to a deterministic quality flag column where the AI coulumn could flag "something looks different here" - "maybe you should have a look".
I have not tried it yet but it is on my to-do list.

dgudkov · October 15, 2023, 12:17am

Thank you, John, and welcome to the Community!

Noah · October 17, 2023, 8:11pm

There are several places were the strategic use of ML (stop calling it artificial intelligence, they are all machine learning models). As mentioned earlier, classification of data is growing in accuracy and very practical in scenarios where 100% accuracy is not required, or at a minimum providing the Human in the Loop with top recommendations.

Entity resolution/matching/de-duplication is also an area ripe for ML use, this also works for digital assets, such as identifying duplicate images not based on file size etc.

Overall anomaly detection is another area where ML can point the Data Analyst in the right direction, including digital assets as well as text or numerical data sets.

Finally data extraction into unique columns is ramping up with accuracy and data specific domains.

The primary viewpoint should be how can ML make you a superhuman data analyst not direct automation.