Map string and diacritical character mapping [DONE]

Hi,

How could we manage diacritics and map string on a specific field.
For example in a text field I would like to replace characters based on a specific mapping
ISO Latin1 decimal code ISO Latin 1 character ASCII map character Description
192 À a Capital A, grave accent
193 Á a Capital A, acute accent
194 Â a Capital A, circumflex accent
195 Ã a Capital A, tilde
196 Ä a Capital A, dieresis or umlaut mark
197 Å a Capital A, ring
198 Æ a Capital AE diphthong
199 Ç c Capital C, cedilla
200 È e Capital E, grave accent
201 É e Capital E, acute accent
202 Ê e Capital E, circumflex accent
203 Ë e Capital E, dieresis or umlaut mark
204 Ì i Capital I, grave accent
205 Í i Capital I, acute accent

A table of classical diacritical character mapping can be found here
https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html

Nested replace() is quite hard to manage if we have files with many different languages
We may need to replace one character with several, i.e. œuf => oeuf

Regards

At this point I can only think of splitting all the words using the “Split fixed width text” transformation into 15-20 (to accommodate for the longest word) single-character columns, then Lookup with a lookup table for every column (ugly, I know), then concatenate everything back.

Note taken on a special-purpose transformation for this.

With maybe the mapping sets stored in a repository like the connectors
Mapping 1: scandinavian
Mapping 2: Eastern europ
Mapping 3:…
This way all the projects would have the same mapping.
I didn’t see this feature in any data wrangling tool but it’s a must have in sentiment/survey analysis.

Regards

Now available in the 4.7 version.
A rare feature that makes the difference with competitors.

Thank you

Thank you, Christophe. Does the “Replace with lookup” action work as expected?

As usual, perfect…

Just cleaned an name list of 42000 employees in a few seconds.
attached a useful list for diacritic mapping
diacritic.dset (7.5 KB)

2 Likes

Very nice and useful dataset !
@dgudkov : I noticed that some programming languages have some sort of function that makes the “diacritic” characters into a normal one. Maybe that could be added in EasyMorph as well ?

Here is a project to inject the dataset into the shared memory repository and to get the list back.

Regards

diacritic shared mem.morph (9.3 KB)

1 Like

good idea, but we need to upgrade first to that version :).