About EasyMorph Tutorials & Examples Web-help

Map string and diacritical character mapping


#1

Hi,

How could we manage diacritics and map string on a specific field.
For example in a text field I would like to replace characters based on a specific mapping
ISO Latin1 decimal code ISO Latin 1 character ASCII map character Description
192 À a Capital A, grave accent
193 Á a Capital A, acute accent
194 Â a Capital A, circumflex accent
195 Ã a Capital A, tilde
196 Ä a Capital A, dieresis or umlaut mark
197 Å a Capital A, ring
198 Æ a Capital AE diphthong
199 Ç c Capital C, cedilla
200 È e Capital E, grave accent
201 É e Capital E, acute accent
202 Ê e Capital E, circumflex accent
203 Ë e Capital E, dieresis or umlaut mark
204 Ì i Capital I, grave accent
205 Í i Capital I, acute accent

A table of classical diacritical character mapping can be found here
https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html

Nested replace() is quite hard to manage if we have files with many different languages
We may need to replace one character with several, i.e. œuf => oeuf

Regards


#2

At this point I can only think of splitting all the words using the “Split fixed width text” transformation into 15-20 (to accommodate for the longest word) single-character columns, then Lookup with a lookup table for every column (ugly, I know), then concatenate everything back.

Note taken on a special-purpose transformation for this.


#3

With maybe the mapping sets stored in a repository like the connectors
Mapping 1: scandinavian
Mapping 2: Eastern europ
Mapping 3:…
This way all the projects would have the same mapping.
I didn’t see this feature in any data wrangling tool but it’s a must have in sentiment/survey analysis.

Regards