Split Delimited File on 31Gb CSV Gets Stuck

Good Morning,

I am using the Split delimited file action as recommended to consume a 31Gb CSV.

However many rows I split by the process always seems to get stuck at, I think, the last split. EasyMorph doesn’t lock up and allows me to cancel, but will continue to show the progress window indefinitely.

There is plenty of disk space, CPU and memory usage aren’t high.

Is there anyway for me to determine what is happening here?

Thanks!

While the process is running, the last file generated is 0kb in size and stays like this indefinitely. It is also locked in the OS and cannot be deleted. When I cancel the EasyMorph project the file then gets some size:

This file is not well formed however and appears incomplete when viewing it in VS code.

When trying to delete this file I can’t because it is locked by EasyMorph. Closing EasyMorph releases the file lock and enables me to delete it in windows explorer.

I hope that helps.

Thanks

Thank you for the details, that’s certainly helpful. We’re investigating it and will get back to you.

That’s great, thanks Dmitry. If you need any further details please let me know.

Hello Nathan,

Unfortunately, we weren’t able to replicate this issue.

Can you please test it with a smaller chunk size (like 100K rows)? Will it get stuck too?
If so - which file will be the last one created?

Is it possible for you to share that CSV file with us? If so - maybe you’ll be able to upload it to some filesharing service? Or, at least, upload the last created file (the one that is not well formed and incomplete)?

Hi @andrew.rybka ,

I had tried with 100,000, 500,000 but got the same issue.

I’m checking with the relevant departments as to sharing the data with you, however it’s nothing particularly special.

Hi @dgudkov , @andrew.rybka ,

Unfortunately we would need to heavily redact this data which would render it useless for debugging purposes.

I am exploring what other ways we could facilitate this with our security and governance team, If I find one I will email you directly with details.

Thanks for looking into this.

Hi Nathan,

Thank you for the update.

I tried to test the action with a 33Gb file with different batch sizes and it worked without any issues.

So it seems that the hanging is caused either by the contents or structure of the file or by something on your computer. Can you please try to run the action with the same file on a different computer?

Hi @andrew.rybka , I will run this today on another PC and see if there is anyway I can examine the data in a bit more detail, It’s proving hard given it’s ridiculous size.

Hi @andrew.rybka,

I have run this on another device and I experience the same behaviour.

I have managed to import the CSV into SQL Server using SSIS to have a look at the full structure.

This CSV has a variable number of columns unfortunately. There is a field that isn’t text qualified but sometimes can contain multiple commas (,) resulting in a variable number of columns on some rows. I wouldn’t expect this to affect the file splitting though, I’d expect that function to be quite dumb and just split on rows regardless of columns, this could be where the issue is.

Does that all make sense?

Hi Nathan,

Thank you for the details.

Does this file also contain doublequote characters inside values?

The Split delimited text action parser text qualified and not text qualified values
differently, so incorrectly placed doublequote characters may result in unexpected behavior.

Hi @andrew.rybka this CSV does not have qualified text (with quotes (")).

I would say this is not a well formed CSV but its not malformed either.

There is an option in the Import delimited text file action called Add new columns when rows have more values than headers that deals with this kind of eventuality, so my assumption as a user would be that the Split delimited file action should also be able to handle this in the same way that the Import delimited text file action does.

Hi Nathan,

The Split by fixed row count mode doesn't try to parse lines that don't contain double quote characters to separate values. It tested this mode with a 30Gb+ file with an intermittent number of values in rows, and it worked without any issues.

You wrote that

There is a field that isn't text qualified but sometimes can contain multiple commas (,)

I assume that that field contains a text sentence, a product/record description, or something like that.

If so, is it possible that some of those values contain double quote characters inside them? I understand that the file in question doesn't have qualified text, but maybe it contains only a few double quote characters scattered across the file?

Hi @andrew.rybka

I have some more answers for you...

I assume that that field contains a text sentence, a product/record description, or something like that.

You are correct, we have a field that includes commas (,) but is not text qualified by quotes ("). This field is a string of text that contains commas (,):

MR Left Knee, MR Knee Right,,

The full row:

2010-08-26 14:48:45.936,Appliance,-12115,HGSHGS-QWE123123123,HITCHMAN,NATHAN M H SIMON,QWE123123123,3985211,MR Left Knee, MR Knee Right,,

If so, is it possible that some of those values contain double quote characters inside them? I understand that the file in question doesn’t have qualified text, but maybe it contains only a few double quote characters scattered across the file?

I have used findstr in CMD to search for this occurrence and found the following:

2010-08-26 14:48:45.936,Appliance,-12115,HGSHGS-QWE123123123,HITCHMAN,NATHAN M H "SIMON",QWE123123123,3985211,CR XR CHEST,,

There is an odd field:

NATHAN M H "SIMON"

There is another field without the quotes:

2010-08-26 14:48:45.936,Appliance,-12115,HGSHGS-QWE123123123,HITCHMAN,NATHAN M H SIMON,QWE123123123,3985211,CR XR CHEST,,

I have even found a row with one quote ("):

2010-08-26 14:48:45.936,Appliance,-12115,HGSHGS-QWE123123123,HITCHMAN,NATHAN M H SIMON,QWE123123123,3985211, "Import of Radiograph Torso,L,

After this further analysis of this messy CSV, I would now definitely class it as malformed!

Hi Nathan,

Thank you for the details. I finally was able to replicate the issue. It seems that the single double quote is causing the hanging.

I hope that we will be able to fix it or\and add a workaround in a few days.

Hi @andrew.rybka ,

That’s great, thanks for looking into that, apologies that I wasn’t able to provide the relevant detail sooner.

This isn’t a showstopper for me by the way as I used SSIS in this case.

I’m actually looking to make an initial purchase this month prior to a full team wide and server deployment.

Thanks for your help.

Hi Nathan,

You are welcome.

Additional research showed that the changes required to fix this issue are too severe to be published as a hotfix to the current release. We want to publish those changes with the next release, which is planned for the start of July, if that’s OK with you.

Hi @andrew.rybka ,

Absolutely no problem. This was forming part of my trial and proof of concept so there is no operational demand on me.

Thanks

2 posts were split to a new topic: Mailing list for notifications about releases

Hi, also interested by this fix, no emergency. I removed the double quotes in my file and it went finally well so same issue on our side.