About EasyMorph Tutorials & Examples Web-help

Text Qualifier Assumption Bug


#1

Hi,

I have been using easymorph for the latter part of two years however I just ran into this issue. I am not sure if it is new but this is in the latest version.

When importing data from a text file (delimited) easymoph seems to make assumptions on text qualifiers and this causes issues on data loads causing parsing errors for fallout data that is formatted correctly. This seems to happen quite a bit on data I have been working with that is dirty and contains double quotes in some places sporadically. I would say that the default should not be any assumption on text qualifiers or make that an option (just like the delimiter in the import settings or in excel data import from text). Please see the below example of how the bug occurs.

Header followed by data (pipe delimited, no text qualifier):
col 1|col 2|col 3|col 4|col 5
this|is|"the|example|data"

The above data will be interpreted as only 3 columns of data instead of 5 and of course have a parsing error. I tried to attach an example project but it will not allow me since I am new to the forums. Please see below example data and project setting to recreate.

Please let me know if there are any work arounds for this issue today or if any further information is needed.

Thank you for your help!
Nick

Settings:
Import from delimited text file
Encoding UTF-8
Separator Pipe
Advanced options -> check show errors

Data:
Column 1|Column 2|Column 3|Column 4|Column 5
this|is|an|"example|here"
This|is"|another|"example|here
"one|"more|example|right|here
Last|"example|right|here|five"


#2

Hi Nick,

the only workaround I can think of at this moment is to replace double-quotes with some symbol before loading, then after loading replace it back.

The example (attached) does exactly this -- it uses a powershell command to replace double quotes (denoted as [char]34) with colons in your example data, save the result into a temporary file (noquotes.txt), then load it and replace colons back with double quotes using "Table-wide replace" transformation (in loader.morph).

You will need to change working directory in "Run command" transformation to the folder that contains the files unzipped. Note also that Powershell's repalce command works as regular expression.

quotes.zip (1.8 KB)


#3

Hi Dmitry,

Thank you for the very quick workaround! I really appreciate your help and I just used it!

Would this behavior be expected or is this in fact a bug that will need to be fixed?

Thank you again,
Nick


#4

You're welcome, Nick.

This is not a bug. Double quotes take precedence over delimiters by design. This is exactly the reason they are needed in delimited data formats -- to encapsulate strings that contain characters that are used as field delimiters.

I'd never encountered such cases previously. We will be adding a few advanced options to tweak parser behavior. Maybe we need to consider an option for disabling text qualifiers. At least, there is a relatively simple workaround for it.


Compressing file output