About EasyMorph Tutorials & Examples Web-help

Required RAM for large dataset


#1

Hi,

On the website of EasyMorph we read that :

  • It is advised to have >16GB of RAM if your typical datasets exceed 1 billion data points (i.e. rows x columns).

I am working on a dataset of 1.7 mln records and about 80 columns in a rather complex ETL (many transformations maybe mor than 100).
I have the impression that I am using about 65% of RAM (16 GB machine) and it takes more than 5 min to run it in desktop. This can be annoying when a recalc is necessary because I entered a transformation in the beginning of the flow.
Do I need more RAM to speed this up ? How can I make it more efficient?

Thanks,
Nikolaas


#2

You can temporarily limit the number of rows you work with by inserting the “Trim table” action in the beginning of workflow.


#3

Tested to read the CSV (about 6.1 GB file size) on a 32 GB RAM machine and it ran out of memory. Would 64 GB be enough ?


#4

I can’t tell if 64GB is enough or not because it strongly depends on data type in the CSV file – data cardinality (uniqueness), the share of text values, lengths of text values, and some other factors that influence the compression rate and memory consumption.

Generally, it’s not recommended to load such big CSV files at once. Such big datasets almost always can be partitioned – e.g. by date, by customer, or by region. The “Split delimited file” action is intended exactly for such cases. It splits a big CSV file into many small ones which are more convenient for processing.