Built In Data Normalization and Cluster Algorithms

Noah · June 13, 2024, 9:45pm

This feature would allow the user to select from one the common classical text similarity algorithms in order to cluster and ultimately merge/normalize similar values.

Here is a good overview of the classic cluster/text difference methods.

In other data tools I have seen the options to have these methods calculate scores of similarity for a field/column, create the clusters of similar values based on a user defined threshold, find the most common version of the clustered values, then present these groups to the user to confirm that they should be normalized to the most prevalent value, select a new final value, remove values from the clustered group. This can then be re-ran on the same dataset as new data is added, if any new data meets the existing approved cluster groups then it is automatically normalized, any new values get sent to some secondary job for review and addition to the ongoing approved clusters.

I can provide more details in a call as needed

dgudkov · June 14, 2024, 9:29am

This looks closer to fuzzy text matching. Note that EasyMorph already can do fuzzy text matching using the Damerau-Levinstein algorithm (also mentioned in the linked article). Adding more algorithms and making it work like clustering rather than matching certainly has value.

How about clustering for numbers? Do you have a use for it?

Noah · June 14, 2024, 6:53pm

Yes there is value for numbers as well. For example when trying to normalize measurements:
0.33
.3
1/3
0.3
.33
could be clustered and normalized

Often times it is a combination of numeric and text values
0.3 HP
1/3 Horse Power
.3 hp

The algorithms alone are useful and likely could be implemented with calculations today, but the clustering and resolution of the cluster is where a project could get really complicated without more built in support.