Include column in error message for easier Troubleshooting [DONE]

Steve · October 1, 2017, 3:38am

I have a large csv file with over 2 millions rows of data.
I am using the Export to DB (MySQL) command to load the CSV file into table.
My problem is that when the load runs into an error, I get a generic error message saying the SQL is invalid near ’ ’
Sometimes I can infer which row is the offending one and filter it out.
But at other times, the row fragment is so simple, I can’t determine which column the value came from.

Each row in my csv file contains a unique “transaction id.” If this were included in the error message, it would be so incredibly easy for me to trouble shoot and fix individual rows. Is there a way to more easily determine which row is failing to load?

p.s. I tried using the bulk insert but this was worse in that it merely errors out with “A fatal error has occurred.” It gives me nothing to work wiSo I won’t be using bulk insert at present.

dgudkov · October 1, 2017, 4:53pm

Great point. I believe it should be possible for us to at least indicate the range of rows where an error occured, since the export is done in batches of 10/10/1000 rows. We’ll see if we can do more that that.

Steve · October 2, 2017, 1:42pm

Hi Dmitry,

Great!
Two things –

If you don’t do more that, could you somehow allow the failed batch of rows to be exported to a file. Again, the reason is because of the size of the file, I can’t easily open it in Excel or a text editor. So if I can isolate the batch, it’ll make it easier for me to find the offending row(s).
Question – I know the Export to DB is Atomic mean that it all succeeds or all fails. Is there another way to process a file where this is not the case? I ask because in my humble opinion, it seems to go against the…”philosophy” of ETL, meaning you should be able to configure a process that loads good rows successfully and puts “bad” rows in a separate area to be reviewed. With the way it is right now, it seems like I’ll have to create a custom process for each file/datasource. Instead, I feel like I should be able to create a single process that will work for any file the adheres to a certain column layout. Hope this makes sense.
BTW: I really love this product. The ability to easily troubleshoot/isolate problem rows is the only thing stopping me for paying for it right now.

Thanks!

Steve

andrew.rybka · October 2, 2017, 1:53pm

Hi Steve,

Can you please post to the community some of those values which where making Export to MySQL to fail?

Thanks.

dgudkov · October 2, 2017, 4:52pm

To clarify on this request – this is to make sure the failure doesn’t happen because EasyMorph generates SQL incorrectly.

dgudkov · October 2, 2017, 5:15pm

Hi Steve,

It looks like we will be able to pinpoint the failing row. Will confirm it later.

I understand that you're coming from a perspective of an interactive workflow where there is a "revision step" to manually deal with failed rows. We can introduce a mode in which the Export to DB transformation would always be successful, but its resulting dataset would contain failed batches (or be empty if no errors). In this case additional actions with the failed batches can be taken, e.g. export into a text file, or some ad-hoc morphing to fix data quality issues.

Another possible solution is to try to detect failures even before exporting. We can make a transformation to check data type inconsistencies and filter rows that fail the checks. However such transformation wouldn't cover other possible reasons for an export to fail, such as constraint violations.

Thoughts?

Thank you! Appreciate that you reached to us and shared your suggestions. This is very helpful.

Steve · October 4, 2017, 2:55am

Of course, I completely understand.
Sorry for the delay. Busy week and then it took me a little while to realize how I could share some of the rows.
The source file is 8GB and I don’t have enough to isolate the single row causing the failure. And even if I could isolate the single row, Excel doesn’t appear interested in even trying to open the file (it literally does nothing - no error, just shows an hour glass for a sec, changes it mind then opens to a blank worksheet.) (I can almost swear I hear it saying, “ummmm…nope!”)

Anyway, here’s what I’ve done. I reversed the filter transformation to KEEP ONLY these rows where VendorName = “FRANKLIN COVEY COMPANY” (whereas before I was removing these rows. I then added an “Export to delimited text file” transformation step. The resulting file is here:
bad_importrows.csv (1.8 MB)

Disclaimer: it is possible that doing it this way changes the original “shape” of the row and the problem gets “transformed out” somehow. I didn’t see a transformation step that would just let me dump the original row into a textfile. If there’s a way to do this let me know and I’ll do so.

I’m also including the sql statement to create the table I’m trying to load the data into:
creeate_table.txt (3.5 KB)
I am using MySQL v 5.5.57 on Windows 7 Pro 64bit.

If there is a facility somewhere where I can upload/FTP an 8GB file for the community to attempt to load, let me know and I’ll kick it off before I go to bed (I’m in CST timezone.)

Let me know what else I can provide.

Thanks.

Steve · October 4, 2017, 3:05am

Thoughts:
Of the two possible solutions, I like the first.
My thinking is that no matter how diligent we (think we) are, no matter how much we scrub the data, put checks and filters in etc, there will always be that one scenario we didn’t think of. In your example, you mentioned creating a transformation that checks data type inconsistencies. It is possible that the data types match, however the length of the field is too short (e.g. VARCHAR(255) when the actual data is 256 chars in length.) I think you’ll be forever adding transformations in an attempt to allow users to catch everything.
Your first option will allow you and your users to maintain our sanity: “Here are the bad rows - now go do something with them, or to your database to accept them.” We users can then adjust the file to abide by EasyMorphs constraints.
Just my two cents worth.

Steve

andrew.rybka · October 4, 2017, 1:12pm

Steve, thank you for the file.

The source of the problem are backslashes at the end of several text values.

This is surely a bug. We will be able to fix it in the next release of EasyMorph.

For now you can remove those backslashes with Modify column transformation.

Here is an example project which removes trailing backslashes from the problematic column:
RemoveTrailingBackslashes.zip (1.8 MB)

I have been able to export result dataset from that project with no errors.

As for reporting failing batches and/or failing row - we are still considering our options but it looks like this feature will be implemented in one of the following releases.

dgudkov · October 5, 2017, 2:54am

@Steve,

The bug has been fixed. Please get the updated version 3.6.1 from our download page: EasyMorph | Download free ETL tool

dgudkov · October 24, 2017, 3:09am

@Steve,

You made a very good suggestion and we rushed to have it done a.s.a.p. We will be adding this capability in minor release 3.6.2. There will be a new mode in "Export to DB" that will keep only "bad" batches. It won't locate the exact row that failed, but you can make batches as small as 10 rows which should help locating the failing row. Further, the "bad" batches can be exported to a text file, or analyzed in EasyMorph.

Also, in the regular export mode (that halts on 1st failed batch) EasyMorph will try to pinpoint failed row by automatically attempting to insert batches of smaller size until it finds the failed row, which number will be shown in the error message.

See the screenshot below to get a better idea what it's going to look like.

Would that make working with failing exports easier? Is anything missing?

Thanks!

Steve · October 24, 2017, 8:43pm

Hello,

This looks very promising!
I especially like the functionality of having EasyMorph trying to pinpoint the failed row.
Also, let me make sure I understand something.
Let’s say I have a file with 100 rows, and I set the batch size to 10.
If 2 rows in two different batches fail, then EasyMorph will have two fail batches for a total of 20 rows.
And I can export these 20 rows into a text file.
Will I still be able to go on and do other transformations to the 80 rows that successfully loaded?
In other words, will I be able to have two paths? One path for the failed rows, and another for the successful rows?

Thanks for fast tracking this!
Steve

Steve · October 24, 2017, 8:45pm

BTW, I never said thank you for this bug fix either. I was able to run all my files through - 60GBs worth - with no further issues. I was SHOCKED.
Thank you so much!

dgudkov · October 24, 2017, 9:29pm

@Steve,

In this case you will have the original dataset (100 rows) and the failed batches (20 rows). You should be able to produce a dataset with the successfully exported 80 rows by excluding the "bad" records (20) from the full datset (100). For instance, using the Keep Mismatching transformation on a key field, or a combination of fields.

Any possible problems/inconveniences with this approach?

Thanks!

Steve · October 25, 2017, 1:03am

Hi, I think there is a possible inconvenience with this approach. It sounds like an extra step is needed in order to filter out the batches that failed and to use only the batches that succeeded. Remember, as a user I expect that successful rows will be “carried forward” unless I specifically take some action or add a transformation to do otherwise. Therefore, I don’t want to have to do another comparison to focus only on the successful rows. I would expect the successful batches to be available in the next transformation automatically.
Said another way, if a row or a batch has failed in a transformation, I don’t want it present in future steps as it may cause problems downstream. I want to catch them and fix them at the earliest possible place in my process.
Let me know if I have misunderstood something or if this isn’t clear.

dgudkov · October 25, 2017, 2:01am

I see your point. Yes, if the successfully exported rows are needed for further transformations that would be inconvenient, I agree.

What if instead of keeping only “bad” batches, all rows will be kept and a new column will be added that will flag rows that belong to failed batches? In this case all rows can further be split into two datasets (successful/failed rows) using two derived tables and filters on the flag. Or, if only failed rows (or only exported ones) needed, one filter would suffice.

Thank you for keeping the conversation rolling!

PS. This approach has another benefit that it plays well with conditionally derived tables that will appear in ver.3.7. Such table will be able to skip all its transformations (without execution) if a condition is not satisfied (for instance no rows flagged as “failed”). It will make it convenient to implement logic such as “if there are failed rows, dump them into a file and send an email notification, otherwise continue as usually”.

Steve · October 25, 2017, 2:40am

BINGO.
Do that.
It allows the user identify the two populations (successful vs failed rows), allows the user to select a different transformation path for each, its intuitive and it ties into future functionality. It’s a win all the way around.
Nice idea.

dgudkov · October 25, 2017, 3:05am

Thank you!