Best Practice for loading a huge file

michel.baldellon · July 28, 2017, 7:15am

Hi,
In a transformation, I have to import a very big csv file (more than 5M rows) that changes only once a month, and many other small files (100 lines max) that change frequently.
Each time, I have to wait the big csv file to be imported.
What would you recommend to avoid or limit this time consuming task (except at the beginning of each month). Should I export the csv in a MySQL database? Other solution?
Best,
Michel

dgudkov · July 28, 2017, 2:23pm

Hi Michel,

Exporting to a database won’t help because it would be similar performance. Load the CSV file in EasyMorph and export it into a Qlik QVD file. This is a compressed data format and EasyMorph reads it faster than regular text files.

michel.baldellon · July 28, 2017, 2:57pm

Thank you, Dmitry, for this advice.
Is an “Import a .tde file (Tableau)” transformation in the roadmap ?
Michel

dgudkov · July 28, 2017, 3:01pm

No, it's not. Tableau doesn't provide API or specification to read .tde files. We can't do much here.

If one day they start providing a specification we will surely add it.

dgudkov · July 28, 2017, 3:12pm

BTW, starting from version 4.0 it will be possible to save loaded data right in EasyMorph projects. So that when you open a project its start transformations already contain last loaded data. It will also be possible to export/import to a native EasyMorph format which will be very fast to read/write.

michel.baldellon · July 28, 2017, 3:14pm

That’s what I thought, too!

reynsnivea · June 10, 2018, 8:35pm

I have two questions on this topic.

Will easymorph be much faster from verson 4.0 onwards for loading huge CSV-files ?
When will version 4.0 become available ?

Kind regards !

dgudkov · June 10, 2018, 10:19pm

No definite plans to speed up loading CSV files so far.

Presumably, by the end of this year or in early 2019.

The native EasyMorph file format will be available sooner than 4.0. Probably in 3.9.1. Check out our download page for short-term release plans.

reynsnivea · June 11, 2018, 8:52am

Hi Dmitry,

I had another question that came to my mind…
Is there a way to subset a CSV-file (or other file) before it is loaded into easyMorph or is this technically not possible ?

That could be useful to first load a subset of data in memory to see how it looks like.

Thanks in advance !

andrew.rybka · June 11, 2018, 10:26am

@reynsnivea, Import from delimited text action has a “Maximum numbers of lines to load” option in the Advanced Options dialog.

reynsnivea · June 11, 2018, 11:25am

Thanks I will check this out !

reynsnivea · June 29, 2018, 12:21pm

Hi dmitry,

I just want to come back to my question related to loading large CSV. I think that in practice many data analysts work with these large CSV-files. It’s not that comfortable if one needs to convert first the CSV to another file type which is not really an open file type like the QVD-file type.

Why not improving the import speed for CSV and other text files ?

Kind regards !

dgudkov · June 29, 2018, 12:48pm

EasyMorph operates with compressed data in memory (that’s how it can fit big amounts of data in RAM). So when it loads a file it compresses it to the internal format which takes some time.

The QVD format and the future native format are already compressed. Therefore, they can be loaded much faster than a non-compressed data file.

Using big CSV files as the main data store is typically not a good idea because they always need to be either converted into something or parsed, which affects performance. The correct approach typically is either to load them into a database, or convert to a file format suitable for fast loading.

reynsnivea · October 3, 2018, 3:13pm

Hi Dmitry,

Can we do some other workarounds to load our CSV’s faster or to convert them quickly to the native format?
Do we always have to read in the CSV first before converting or is there a workaround?

When you go above 1 millions records, it can take some time to load a CSV. Are there any improvements planned to speed this up ? The load via the native format only makes sense when the conversion goes fast and when you have to load the dataset multiple times.

Also the sample feature to limit the number of lines when loading a CSV would be better placed in the main screen of the transformation and not in the popup because one can easily forget that in some stage he/she has put a limit on the number of records.

Kind regards !

dgudkov · October 3, 2018, 6:26pm

No workarounds exist. A CSV should be read entirely in order to be compressed and saved in the native format.

dgudkov · October 3, 2018, 6:30pm

You can try splitting a big CSV into smaller files using the "Split delimited file" action. Then load multiple smaller CSVs using the "Load list of files" mode. But overall, the current performance of loading CSV files won't improve any time soon.

reynsnivea · November 30, 2018, 3:38pm

Hi Dmitry,

Would it be possible to invest some effort to optimize the load from CSV ? We work a lot with big CSV-files.
It would be nice if the load time could be reduced…
These are CSV’s that are processed maybe 1 or two times a year so it is not really interesting to convert them to native format.
Any new suggestions to deal with that ?

Kind regards

dgudkov · November 30, 2018, 4:12pm

If you process them 1 or 2 times a year then I guess it’s not such a big problem from a practical perspective.

dgudkov · September 30, 2019, 7:52pm

Good news! In version 4 loading CSV files should be much faster.

I tested loading a 100MB CSV file and in version 4 (beta .35) it was loaded 4 times faster than in v3.9.x.

michel.baldellon · September 30, 2019, 8:12pm

Great news !!!