Splitting non-text files and/or Pre-filtering

mkhaja · April 11, 2018, 2:01pm

I have situations where I have to process very large sas7bdat files which cannot fit in memory on my local machine. For example, I’m looking at 1.34 GB SAS file that is 60 columns x 8.6 million rows. This causes EM to use about 20 GB of memory. I know I’m only interested in a subset (particular columns and rows) of the file, but I have to load the entire file before I’m able to filter it.

Would it be possible to have an equivalent of the Split delimited file transformation for file formats other than text? This would probably be complicated by the structures of various file formats.

Alternatively, it would be useful to be able to select particular columns, as well as the ability to use filters before the contents of the entire file is loaded into memory. The filtering would be much slower, but it would be a worthwhile trade-off.

This is sort of a medium-data case where the source data is too large to process entirely in memory, but the actual data of interest is a much smaller subset.