How could we manage diacritics and map string on a specific field.
For example in a text field I would like to replace characters based on a specific mapping
ISO Latin1 decimal code ISO Latin 1 character ASCII map character Description
192 À a Capital A, grave accent
193 Á a Capital A, acute accent
194 Â a Capital A, circumflex accent
195 Ã a Capital A, tilde
196 Ä a Capital A, dieresis or umlaut mark
197 Å a Capital A, ring
198 Æ a Capital AE diphthong
199 Ç c Capital C, cedilla
200 È e Capital E, grave accent
201 É e Capital E, acute accent
202 Ê e Capital E, circumflex accent
203 Ë e Capital E, dieresis or umlaut mark
204 Ì i Capital I, grave accent
205 Í i Capital I, acute accent
…
Nested replace() is quite hard to manage if we have files with many different languages
We may need to replace one character with several, i.e. œuf => oeuf
At this point I can only think of splitting all the words using the “Split fixed width text” transformation into 15-20 (to accommodate for the longest word) single-character columns, then Lookup with a lookup table for every column (ugly, I know), then concatenate everything back.
Note taken on a special-purpose transformation for this.
With maybe the mapping sets stored in a repository like the connectors
Mapping 1: scandinavian
Mapping 2: Eastern europ
Mapping 3:…
This way all the projects would have the same mapping.
I didn’t see this feature in any data wrangling tool but it’s a must have in sentiment/survey analysis.
Very nice and useful dataset ! @dgudkov : I noticed that some programming languages have some sort of function that makes the “diacritic” characters into a normal one. Maybe that could be added in EasyMorph as well ?