Use AI to clean data

dgudkov · October 8, 2023, 7:14pm

There is a lot of hype about AI these days, we have a rather careful approach to it. The problem with AI is that it's not deterministic - you don't know how it operates therefore you can never be sure what the result will be. On the other hand, ETL must be deterministic. There is no place for hallucinations in ETL (and finance, btw). So I don't think it's a good idea to use AI for unsupervised automated data cleansing. Nevertheless, AI can be used as a helper in a highly curated workflow that assumes manual intervention of a qualified human.

I don't think it's a good idea. I've seen a Tableau demo where they suggest that AI should generate a regex, and, in my opinion, it demonstrates the lack of understanding of how AI can be used, even among large IT vendors.

If a user doesn't understand regex, s/he won't understand how the AI-generated regex works and it's a straight road to bugs and errors, especially once data slightly changes. Generative AI is only good when the user is enough qualified to detect when the AI generates bullshit. Otherwise, the consequences will be disastrous. Promoting AI-generated regex to people who don't understand it is short of irresponsible and clearly done for the wow effect and click generation.

AI can be helpful for data classification, e.g. detecting language or sentiment. This can be helpful in automation rather than in data transformation and it can potentially work well with boards and issues, as they allow including a human into the automation loop. We're going to add a simple OpenAI/ChatGPT integration soon to see where it goes.