I have a large csv file with over 2 millions rows of data.
I am using the Export to DB (MySQL) command to load the CSV file into table.
My problem is that when the load runs into an error, I get a generic error message saying the SQL is invalid near ’ ’
Sometimes I can infer which row is the offending one and filter it out.
But at other times, the row fragment is so simple, I can’t determine which column the value came from.
Each row in my csv file contains a unique “transaction id.” If this were included in the error message, it would be so incredibly easy for me to trouble shoot and fix individual rows. Is there a way to more easily determine which row is failing to load?
p.s. I tried using the bulk insert but this was worse in that it merely errors out with “A fatal error has occurred.” It gives me nothing to work wiSo I won’t be using bulk insert at present.
Great point. I believe it should be possible for us to at least indicate the range of rows where an error occured, since the export is done in batches of 10/10/1000 rows. We’ll see if we can do more that that.
If you don’t do more that, could you somehow allow the failed batch of rows to be exported to a file. Again, the reason is because of the size of the file, I can’t easily open it in Excel or a text editor. So if I can isolate the batch, it’ll make it easier for me to find the offending row(s).
Question – I know the Export to DB is Atomic mean that it all succeeds or all fails. Is there another way to process a file where this is not the case? I ask because in my humble opinion, it seems to go against the…”philosophy” of ETL, meaning you should be able to configure a process that loads good rows successfully and puts “bad” rows in a separate area to be reviewed. With the way it is right now, it seems like I’ll have to create a custom process for each file/datasource. Instead, I feel like I should be able to create a single process that will work for any file the adheres to a certain column layout. Hope this makes sense.
BTW: I really love this product. The ability to easily troubleshoot/isolate problem rows is the only thing stopping me for paying for it right now.
It looks like we will be able to pinpoint the failing row. Will confirm it later.
I understand that you’re coming from a perspective of an interactive workflow where there is a “revision step” to manually deal with failed rows. We can introduce a mode in which the Export to DB transformation would always be successful, but its resulting dataset would contain failed batches (or be empty if no errors). In this case additional actions with the failed batches can be taken, e.g. export into a text file, or some ad-hoc morphing to fix data quality issues.
Another possible solution is to try to detect failures even before exporting. We can make a transformation to check data type inconsistencies and filter rows that fail the checks. However such transformation wouldn’t cover other possible reasons for an export to fail, such as constraint violations.
Thank you! Appreciate that you reached to us and shared your suggestions. This is very helpful.
Of course, I completely understand.
Sorry for the delay. Busy week and then it took me a little while to realize how I could share some of the rows.
The source file is 8GB and I don’t have enough to isolate the single row causing the failure. And even if I could isolate the single row, Excel doesn’t appear interested in even trying to open the file (it literally does nothing - no error, just shows an hour glass for a sec, changes it mind then opens to a blank worksheet.) (I can almost swear I hear it saying, “ummmm…nope!”)
Anyway, here’s what I’ve done. I reversed the filter transformation to KEEP ONLY these rows where VendorName = “FRANKLIN COVEY COMPANY” (whereas before I was removing these rows. I then added an “Export to delimited text file” transformation step. The resulting file is here: bad_importrows.csv (1.8 MB)
Disclaimer: it is possible that doing it this way changes the original “shape” of the row and the problem gets “transformed out” somehow. I didn’t see a transformation step that would just let me dump the original row into a textfile. If there’s a way to do this let me know and I’ll do so.
I’m also including the sql statement to create the table I’m trying to load the data into: creeate_table.txt (3.5 KB)
I am using MySQL v 5.5.57 on Windows 7 Pro 64bit.
If there is a facility somewhere where I can upload/FTP an 8GB file for the community to attempt to load, let me know and I’ll kick it off before I go to bed (I’m in CST timezone.)
Of the two possible solutions, I like the first.
My thinking is that no matter how diligent we (think we) are, no matter how much we scrub the data, put checks and filters in etc, there will always be that one scenario we didn’t think of. In your example, you mentioned creating a transformation that checks data type inconsistencies. It is possible that the data types match, however the length of the field is too short (e.g. VARCHAR(255) when the actual data is 256 chars in length.) I think you’ll be forever adding transformations in an attempt to allow users to catch everything.
Your first option will allow you and your users to maintain our sanity: “Here are the bad rows - now go do something with them, or to your database to accept them.” We users can then adjust the file to abide by EasyMorphs constraints.
Just my two cents worth.
You made a very good suggestion and we rushed to have it done a.s.a.p. We will be adding this capability in minor release 3.6.2. There will be a new mode in “Export to DB” that will keep only “bad” batches. It won’t locate the exact row that failed, but you can make batches as small as 10 rows which should help locating the failing row. Further, the “bad” batches can be exported to a text file, or analyzed in EasyMorph.
Also, in the regular export mode (that halts on 1st failed batch) EasyMorph will try to pinpoint failed row by automatically attempting to insert batches of smaller size until it finds the failed row, which number will be shown in the error message.
See the screenshot below to get a better idea what it’s going to look like.
Would that make working with failing exports easier? Is anything missing?
This looks very promising!
I especially like the functionality of having EasyMorph trying to pinpoint the failed row.
Also, let me make sure I understand something.
Let’s say I have a file with 100 rows, and I set the batch size to 10.
If 2 rows in two different batches fail, then EasyMorph will have two fail batches for a total of 20 rows.
And I can export these 20 rows into a text file.
Will I still be able to go on and do other transformations to the 80 rows that successfully loaded?
In other words, will I be able to have two paths? One path for the failed rows, and another for the successful rows?
In this case you will have the original dataset (100 rows) and the failed batches (20 rows). You should be able to produce a dataset with the successfully exported 80 rows by excluding the “bad” records (20) from the full datset (100). For instance, using the Keep Mismatching transformation on a key field, or a combination of fields.
Any possible problems/inconveniences with this approach?
Hi, I think there is a possible inconvenience with this approach. It sounds like an extra step is needed in order to filter out the batches that failed and to use only the batches that succeeded. Remember, as a user I expect that successful rows will be “carried forward” unless I specifically take some action or add a transformation to do otherwise. Therefore, I don’t want to have to do another comparison to focus only on the successful rows. I would expect the successful batches to be available in the next transformation automatically.
Said another way, if a row or a batch has failed in a transformation, I don’t want it present in future steps as it may cause problems downstream. I want to catch them and fix them at the earliest possible place in my process.
Let me know if I have misunderstood something or if this isn’t clear.
I see your point. Yes, if the successfully exported rows are needed for further transformations that would be inconvenient, I agree.
What if instead of keeping only “bad” batches, all rows will be kept and a new column will be added that will flag rows that belong to failed batches? In this case all rows can further be split into two datasets (successful/failed rows) using two derived tables and filters on the flag. Or, if only failed rows (or only exported ones) needed, one filter would suffice.
Thank you for keeping the conversation rolling!
PS. This approach has another benefit that it plays well with conditionally derived tables that will appear in ver.3.7. Such table will be able to skip all its transformations (without execution) if a condition is not satisfied (for instance no rows flagged as “failed”). It will make it convenient to implement logic such as “if there are failed rows, dump them into a file and send an email notification, otherwise continue as usually”.
It allows the user identify the two populations (successful vs failed rows), allows the user to select a different transformation path for each, its intuitive and it ties into future functionality. It’s a win all the way around.