EasyMorph data format

HI Dmitry,

Could you tell more about the Easymorph data format planned to be released with the 3.7 version ?

Regards

Hi Christophe,

The data format is a binary compressed columnar format native to EasyMorph. Key features:

  • High-speed export from / import to EasyMorph
  • Columns stored independently. Columns are of 2 types: vocabulary compressed, and constant (i.e. having a single value only). A compressed column has a vocabulary and a vector of indexes. Note that unlike QVD each column has its own vector.
  • Data types: text, 128-bit decimal, boolean, null. One column can contain mixed data types.
  • Open format specifications and the source code of read/write drivers in major languages (C#, Java, Python) on github.

The binary data will be stored in a binary container file with hierarchical metadata headers. One binary container file will allow mixing isolated binary blobs of several types. For instance one file can store EasyMorph native datasets and PNG images. This would allow storing several tables in one file. It would also allow moving all binary data out of .morph files so that .morphs stay friendly to version control systems (e.g. git or svn).

The specification for the binary container files will also be open.

How does all that look to you? Your feedback would be appreciated.

PS. We’re slightly reshuffling our roadmap. The data format may be moved to 4.0 (Q1-2018).

Sounds great.
Currently I’m using a lot QVD files format to store various dataset that I reuse (meteo, historical exchange rate, data from international monetary fund…)
Migration to EasyMorph could be great but then an extractor of the metadata of the QVD files would be useful.
It’s not that easy to extract the table name of the QVD file for example.
A few questions :
will this files be searchable ?

It's a compressed data format, so it won't be searchable.

@dgudkov I’m curious about the new EM data format! Will there be an option for incremental loading?

Idea: Based on a date column only daily updating changed rows or inserting the new ones from source table. May there is already an easier approach available or will be part of the new data format?

We start from simple file store – i.e. a table can be exported entirely into a file in the native format, or imported from it.

Eventually, we will be adding some database-like capabilities – i.e. pre-filtering data on load (similar to WHERE clause in SQL), appending, and even joining tables. These operations will be performed on the Server side in cases when data is retrieved from EasyMorph Server. So basically, EasyMorph Server will become a lightweight and simple shared data store (with an ODBC driver).

The ability to update tables in a single step hasn’t been discussed yet, but it’s an interesting suggestion.

@dgudkov Thank you for the feedback. I support your step-by-step approach with EM data format. Looking foward to the performance gains with the native data format and all related upcoming improvements.

You can try import/export to the native data format in beta 3.9 which is already available on our download page.

So far, my impressions are such that in terms of data compression the native format is on par with Qlik QVDs and Tableau TDEs. The difference in file size is typically within a +/-10% range. The load speed is around 5-7 million values per second on my 3-year old laptop.