Best Practice for loading a huge file

Hi,
In a transformation, I have to import a very big csv file (more than 5M rows) that changes only once a month, and many other small files (100 lines max) that change frequently.
Each time, I have to wait the big csv file to be imported.
What would you recommend to avoid or limit this time consuming task (except at the beginning of each month). Should I export the csv in a MySQL database? Other solution?
Best,
Michel

Hi Michel,

Exporting to a database won’t help because it would be similar performance. Load the CSV file in EasyMorph and export it into a Qlik QVD file. This is a compressed data format and EasyMorph reads it faster than regular text files.

Thank you, Dmitry, for this advice.
Is an “Import a .tde file (Tableau)” transformation in the roadmap ?
Michel

No, it's not. Tableau doesn't provide API or specification to read .tde files. We can't do much here.

If one day they start providing a specification we will surely add it.

BTW, starting from version 4.0 it will be possible to save loaded data right in EasyMorph projects. So that when you open a project its start transformations already contain last loaded data. It will also be possible to export/import to a native EasyMorph format which will be very fast to read/write.

That’s what I thought, too!

I have two questions on this topic.

  1. Will easymorph be much faster from verson 4.0 onwards for loading huge CSV-files ?
  2. When will version 4.0 become available ?

Kind regards !

No definite plans to speed up loading CSV files so far.

Presumably, by the end of this year or in early 2019.

The native EasyMorph file format will be available sooner than 4.0. Probably in 3.9.1. Check out our download page for short-term release plans.

Hi Dmitry,

I had another question that came to my mind…
Is there a way to subset a CSV-file (or other file) before it is loaded into easyMorph or is this technically not possible ?

That could be useful to first load a subset of data in memory to see how it looks like.

Thanks in advance !

@reynsnivea, Import from delimited text action has a “Maximum numbers of lines to load” option in the Advanced Options dialog.

Thanks I will check this out !

Hi dmitry,

I just want to come back to my question related to loading large CSV. I think that in practice many data analysts work with these large CSV-files. It’s not that comfortable if one needs to convert first the CSV to another file type which is not really an open file type like the QVD-file type.

Why not improving the import speed for CSV and other text files ?

Kind regards !

EasyMorph operates with compressed data in memory (that’s how it can fit big amounts of data in RAM). So when it loads a file it compresses it to the internal format which takes some time.

The QVD format and the future native format are already compressed. Therefore, they can be loaded much faster than a non-compressed data file.

Using big CSV files as the main data store is typically not a good idea because they always need to be either converted into something or parsed, which affects performance. The correct approach typically is either to load them into a database, or convert to a file format suitable for fast loading.

Hi Dmitry,

Can we do some other workarounds to load our CSV’s faster or to convert them quickly to the native format?
Do we always have to read in the CSV first before converting or is there a workaround?

When you go above 1 millions records, it can take some time to load a CSV. Are there any improvements planned to speed this up ? The load via the native format only makes sense when the conversion goes fast and when you have to load the dataset multiple times.

Also the sample feature to limit the number of lines when loading a CSV would be better placed in the main screen of the transformation and not in the popup because one can easily forget that in some stage he/she has put a limit on the number of records.

Kind regards !

No workarounds exist. A CSV should be read entirely in order to be compressed and saved in the native format.

You can try splitting a big CSV into smaller files using the "Split delimited file" action. Then load multiple smaller CSVs using the "Load list of files" mode. But overall, the current performance of loading CSV files won't improve any time soon.

Hi Dmitry,

Would it be possible to invest some effort to optimize the load from CSV ? We work a lot with big CSV-files.
It would be nice if the load time could be reduced…
These are CSV’s that are processed maybe 1 or two times a year so it is not really interesting to convert them to native format.
Any new suggestions to deal with that ?

Kind regards

If you process them 1 or 2 times a year then I guess it’s not such a big problem from a practical perspective.

Good news! In version 4 loading CSV files should be much faster.

I tested loading a 100MB CSV file and in version 4 (beta .35) it was loaded 4 times faster than in v3.9.x.

Great news !!!