Open discussion about EasyMorph v4: modules, column bar, lineage

dgudkov · April 9, 2019, 4:14pm

Version 4 of EasyMorph will deliver many exciting changes to EasyMorph. Here I’m going to explain a few major features coming in v4 and would like to hear your opinion on them. This is to make sure that we don’t miss anything important and maybe hear some new bright ideas from the EasyMorph user community.

Here is what ver.4 is going to look like (conceptual drawing, actual look may be different, click to zoom):

As you can see there is a lot of changes here. Let’s go through the most important ones:

Modules

Modules (previously known as blocks) eliminate the need to have a separate project for iterations. The ability to iterate an external project will remain, but it won’t be strictly necessary anymore. In the vast majority of cases iterations will be arranged using modules.

A module is very much what you used to think of a project previously – it has tabs with tables and charts, actions in tables, parameters, and optionally marked default result table. The only difference between a module and the old understanding of a project is that modules don’t have embedded connectors.

A project will consist of one or more modules. Embedded connectors are stored in project (as it is currently) and are available to all modules in the project, as well as shared connectors. You can switch between modules using the new right sidebar (can be seen in the screenshot above). The sidebar is collapsible. It can be hidden when not necessary. Switching between modules can also be done by hot keys Ctrl+PgUp, Ctrl+PgDown.

All actions that previously required an external project – Call, Iterate, Iterate table will be able to use modules too (you can see it in the left sidebar in screenshot above). Using a module will be the default choice in them. It will also be possible to specify a module name using a parameter.

Therefore, instead of one project iterating/calling another project, one module will iterate/call another module in the same project. Any module can call/iterate any module in the same project as long as it doesn’t create a cyclical dependency.

Modules are independent from each other. It’s not possible to reference a table in another module in an action. E.g. merging or appending a table from another module won’t be possible. However, as with projects now, it’s always possible to pass a dataset from one module to another module either in-memory via the Input action, or through an external file (e.g. .dset).

In a project there always is a default module. It’s marked with a red flag, just like default result tables. When a project is executed by a Server/Launcher task, or another project, the default module is the starting point.

Modules can be cloned, or copied/pasted to other projects.

Modules bring in many advantages:

No need to create another project(s) for iterations
No need to save iterated project to apply changes
Publishing to Server is simplified because modules eliminate the need to publish related sub-projects
Easier designing/debugging
Elements of prod/dev version control – instead of duplicating a project file, clone its module make changes, then mark as the default module and publish. If something doesn’t work right, you can rollback the changes by setting the original module as the default one. Basically, you can have multiple versions of a workflow in one project in different modules.
Easier project sharing as even complex workflows can be wrapped into just one project and shared

Column bar

Another big novelty in v4 is the addition of the column bar (can be seen in the screenshot above). The column bar is used for finding columns, changing column format and displaying various metadata about currently selected column.

The column bar will make editing actions more convenient. Currently, when editing an action if the user clicks a column it forces switching from the action properties editor (in the left sidebar) to column format settings. This is inconvenient, because it’s frequently necessary to explore a column while editing an action. However, since both column and action properties are currently shown in the same sidebar, it’s not possible to show them at the same time and requires switching back and forth.

The column bar solves this problem and removes the inconvenience. With it it will be possible to edit action settings and search, explore columns and their values at the same time.

Another convenience is the instant indication of column data types, so that you can click a column and immediately see if it contains numbers, text, errors or empty values. Clicking a data type indicator will instantly filter column values to that particular type (by adding the “Filter by type” action). The sum of column values will be shown as well.

It will be possible to change formatting of multiple columns at once, as it is currently.

Finally, the column bar will show automatic suggestions. The suggestions will automatically detect common data quality issues such as trailing spaces, or text values that are actually numbers.

Lineage

EasyMorph will track and display column lineage. Lineage is information about the column’s lifecycle – in which action it was created, in which actions it was modified, how it changed from action to action.

Notice in the screenshot above that some columns have green or blue headers instead of black ones. A green header indicates a newly created column. For instance, all import actions basically create new columns and therefore their resulting datasets will always have green headers.

A blue header indicates a changed column. For instance, columns changed using the “Modify column” action will have blue headers. Merged columns in the “Merge” action will also be shown blue (they are not new because they were created in another table). Renaming a column modifies it, therefore a renamed column will also be shown blue.

It will be possible to see the full lineage chain for each column – starting from the action that created it and all the actions that modified it.

Lineage also changes how column widths and formats are managed. Changing column format/width will automatically change the format/width of this column in results of all actions along the lineage path. This should simplify column formatting and make it more intuitive.

Finally, column lineage opens the possibility to instantly see column metadata – e.g. the data origin (e.g. database table name, file name), the original name in the external system, annotations with explanations of encoded column values (e.g. 0 = male, 1 = female). It also makes it technically possible to automatically populate column metadata from external systems when importing it into EasyMorph.

How does all of it sound to you? Comments, suggestions, questions are welcome.

PS. Version 4 is planned for release in June this year.

adambeltz · April 10, 2019, 2:43am

Hi @dgudkov -
I’m really looking forward to being able to perform iterations within the same project. It will be less confusing to track going forward. I have a lot of NAME_A --> NAME_B --> NAME_C type things in folders.

I’ve often wondered about performance on iterations as well. So maybe off topic but lets say I have a table of 2,000 rows that are going to process inventory adjustments. Each row iterates today into another project that actually executes this change (command line CURL) and then update the database with the new value. If I split this 2,000 into lets say 5 groupings A,B,C,D, can I realistically have 4 queues of iterations hitting the second project that is processing the data with confidence that the integrity of the information remains in tact? The goal is not only to process them accurately but process them quickly.

dgudkov · April 10, 2019, 3:51am

Hi Adam,

Yes, you can derive 5 tables from the list of 2000 values, filter in each a different group - A, B, C, D, E and iterate second project in each derived table. In this case it will be iterated in 5 parallel queues.

For EasyMorph executing the same project in parallel queues is not a problem. Every queue would execute a clone of the iterated project so they are independent from each other and don’t interfere. However, the iterated project should be designed in a way that guarantees safe parallel execution of multiple copies of the project at the same time. For instance, it shouldn’t save data into a file with the same hardcoded name (e.g. temp.csv), because two copies may start writing into temp.csv simultaneously which may result in a file access conflict, or corrupt data.

Another issue if the remote systems that are invoked by the iterated project can handle simultaneous requests. E.g. a database may have a limited pool of open connections, or limited number of simultaneous sessions for the same user account.

Also, parallel execution doesn’t preserve order, so CURL queries with values from groups A,B,C,D,E may end up in an order that is very different than the initial one.

Finally, if the iterated project deals with a large amount of data then running it in 5 queues would cause 5 fold increase of memory consumption.

If the iterated project is suitable for simultaneous execution then running it in parallel queues may significantly reduce total run time.

Jochen_Marquardt · April 11, 2019, 6:20am

Awesome!
I really like the modules. It will make it much easier than switching between projects.
The column bar seems to be another great feature.
Great job! I really like it.

michel.baldellon · April 11, 2019, 2:01pm

Hi Dmitry,
Very good functionalities! Looking forward to see that…
I would have appreciated the propagation of the new name after a Rename transformation

RJO · April 24, 2019, 12:43pm

We really like these new features, the small paragraph on metadata import feature also has its effect Adding sub-project files not reusable is quite painful.

We are still waiting for 2 major features that would make easymorph grow up quickly : auto group by feature to make big data tables queriable and also data quality on column other than numeric (that you can find in Tableau for example).

Thank you for all your great job !

dgudkov · April 24, 2019, 7:19pm

Thank you for the feedback, Romain

This is coming in v4.1, right after we release 4.0. Queries will be significantly improved and will support grouping and aggregation, and later, joins and unions.

dgudkov · April 27, 2019, 5:14pm

We’re still discussing internally what would be the right term to describe parts of project that are currently designated as “modules”. Currently the options are: module, routine, or unit.

The term should describe two main characteristics: it’s a sequence of actions, and it can be used as a building block to compose something bigger, a workflow.

What do you think it should be called? Please vote below:

Module
Routine
Unit

0 voters

cvo · May 3, 2019, 11:11am

Hi Dmitry,
The most impressive with Easymorph is the highly intuitive user interface. It’s so intuitive that there’s no “options” or “settings”. But release after release, there’s now a lot of features and it’s not that easy to have an overview of the possibilities. Maybe it’s time to introduce Option with a tab (between “Report” and “About” ?) where we could define some default formats (Date, numbers) and have an overview of all the possible transformations with a little description and the possibility to activate or deactivate them for the current project, i.e. I never export to 1010data or use Regular Expression.

Another thing is that I have now hundreds of .morph files spreaded over directories of projects, customers… An “history” or “bookmark” repository tab would be great : creation date, modification date, last use date, source, export, parameters list and a few automated tags like the top 10 transformations used in the project, connection used from the connection manager, the number of rows loaded and exported…

Regards

reynsnivea · May 3, 2019, 6:10pm

Hi Dmitry,

Thanks for sharing the new features in the coming major release ! There are nice improvements.
In the list below I would like to have some smaller features implemented fast "e.g. zoom in/out, auto-arrange, etc.)

Some suggestions:

Make it possible to create a new action from a series of transformations and save that to a library of custom actions that is then shown as a tab “custom” in the transformations pane. This way, we would not have to call other projects to reuse the same transformation logic over and over again. Now, sometimes we create such a project that can be call via call a project action but it would be better if we could integrate that into a custom action.
Please create a zoom in/out and overview button (helicopter view). I realy would appreciate that…
Please add an auto arrange feature to arrange the different table in a clear overview.
Now it is not possible to create extra white space in the workspace when going to left or right. It is only possible by creating for example a new table that will expand the window. Please make it possible that when we push the scrol bar to the right, that white space is created.
When the same set of actions are done on different columns, it would be nice that there would be some sort of auto grouping that would bundle all those actions in 1 action. This can occur on wide datasets during profiling. After a while, one forgets that we already introduced a similar action (e.g. replacement of some value) for a particular column. If easyMorph could detect that and bundle it into 1 action for all columns that need to undergo the same transformation, it could simplify the project.
There already exists a feature to look for where actions are used or where columns are used in an action and then go to that action but I think we cannot see from that window in which tab it is used. This could be added. Also being able to search on a part of the column name would be nice.
Please add recurrent file listing in the Amazon command and some features like downloading the X most recent files from Amazon instead of doing a file listing every time and an iteration to download only a portion of the files in a bucket.
I also thought having read on the forum or website that when we save an easyMorph project, the data would be kept in memory so that when we reopen the project later we do not have tot wait to recalculate the whole project. I have used the DSET-format and other file formats but EasyMorph is always importing the datasets first.
Adding sampling features for all file formats at import. For a CSV we can choose to import only the first X-rows. For other file formats this seems not to be possible at the moment. Also random sampling would be nice so that we can develop ETL on a sample and then run it on a large dataset. This could gain us some time. The problem is that some datasets are very large. If we could generate a reliable sample to detect all data quality issues with that, we could model on the sample and then execute that on the entire set.
Applying the same formula on multiple columns without having to pivot the entire dataset in order to achieve this.
Adding search boxes for columns in every action.
Removing the burden to always check the data type when filtering with a condition. It should as simple as saying column A > 10 and not like now: if( isnumber(column A) , if(column A > 10, , )) …

Kind regards !
Nikolaas

lcaroli · May 7, 2019, 9:58am

Hi Dmitry,
Thanks for sharing the new features in the coming! There are nice improvements.

Some suggestion for the future:

Output: excel, possibiliy to mange the formatting (font, date formatting ect…)
Output: increase the capabilities of the reporting (character dimension, font, fromatting ect…)
Output: add export pdf function

Thanks
Leonardo