Map string and diacritical character mapping [DONE]

cvo · October 10, 2017, 12:57pm

Hi,

How could we manage diacritics and map string on a specific field.
For example in a text field I would like to replace characters based on a specific mapping
ISO Latin1 decimal code ISO Latin 1 character ASCII map character Description
192 À a Capital A, grave accent
193 Á a Capital A, acute accent
194 Â a Capital A, circumflex accent
195 Ã a Capital A, tilde
196 Ä a Capital A, dieresis or umlaut mark
197 Å a Capital A, ring
198 Æ a Capital AE diphthong
199 Ç c Capital C, cedilla
200 È e Capital E, grave accent
201 É e Capital E, acute accent
202 Ê e Capital E, circumflex accent
203 Ë e Capital E, dieresis or umlaut mark
204 Ì i Capital I, grave accent
205 Í i Capital I, acute accent
…

A table of classical diacritical character mapping can be found here
https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html

Nested replace() is quite hard to manage if we have files with many different languages
We may need to replace one character with several, i.e. œuf => oeuf

Regards

dgudkov · October 10, 2017, 1:42pm

At this point I can only think of splitting all the words using the “Split fixed width text” transformation into 15-20 (to accommodate for the longest word) single-character columns, then Lookup with a lookup table for every column (ugly, I know), then concatenate everything back.

Note taken on a special-purpose transformation for this.

cvo · October 10, 2017, 2:14pm

With maybe the mapping sets stored in a repository like the connectors
Mapping 1: scandinavian
Mapping 2: Eastern europ
Mapping 3:…
This way all the projects would have the same mapping.
I didn’t see this feature in any data wrangling tool but it’s a must have in sentiment/survey analysis.

Regards

cvo · April 8, 2021, 10:55am

Now available in the 4.7 version.
A rare feature that makes the difference with competitors.

Thank you

dgudkov · April 8, 2021, 11:08am

Thank you, Christophe. Does the “Replace with lookup” action work as expected?

cvo · April 9, 2021, 8:46am

As usual, perfect…

Just cleaned an name list of 42000 employees in a few seconds.
attached a useful list for diacritic mapping
diacritic.dset (7.5 KB)

reynsnivea · November 9, 2021, 7:10am

Very nice and useful dataset !
@dgudkov : I noticed that some programming languages have some sort of function that makes the “diacritic” characters into a normal one. Maybe that could be added in EasyMorph as well ?

cvo · November 9, 2021, 11:15am

Here is a project to inject the dataset into the shared memory repository and to get the list back.

Regards

diacritic shared mem.morph (9.3 KB)

reynsnivea · November 9, 2021, 12:05pm

good idea, but we need to upgrade first to that version :).