Data catalog features

reynsnivea · October 13, 2021, 7:40am

Hi,

I read the page https://www.datacatalog.com/ about the data catalog.
Very interesting !

Am I right that the data catalog will also act as a data portal through which users/machines can upload/download data files ?
Is there a demo page or video where we can see how this will look like ?
Will it be possible to show the descriptions of fields/tables from databases in the catalog ? Currently we add field descriptions in our postgres databases.
Can the data catalog be hosted on premise ?
How will access rights be controlled ? Will there be an option to control access using openID connect ? We have a central access and identity management system, so we would like to onboard the data catalog onto that system.

Thanks
Nikolaas

dgudkov · October 13, 2021, 9:02am

Technically, the Data Catalog is a feature of EasyMorph Server which is installed on-premises. Depending on customers' needs, it may obtain later a public-facing read-only internet portal provided through the EasyMorph Gateway, but not in early versions.

Not yet. But it would be great to arrange a call with EasyMorph customers and discuss UI sketches. We'll try to arrange it in the next month or so.

The descriptions of fields will be prepopulated from column metadata of the project's result table. Therefore it will be trivially easy to show the descriptions of database fields in the Data Catalog. Currently, it's not clear whether it will be possible to use database table descriptions. Probably, no. But any description can always be entered and edited manually.

We'll also provide a way to update descriptions programmatically with a special action. Although, that may come in later versions, not initially.

Access to the Data Catalog will be managed through Server spaces just like it's done currently. A catalog is basically a part of Server space, just like Tasks or Files. We're open to the idea of adding other authentication providers (such as Open ID or SAML) for the Server besides Active Directory, but that's a different issue.

reynsnivea · October 13, 2021, 9:20am

Hi Dmitry,

Thanks for the info !

Could you also answer the question about data portal functionality of the data catalog ? We are looking for an application through which internal and external users can upload files to EasyMorph server in a secure way i.e. only having access to the folders they may see (using our authentication provider service). Or can we use the current upload manager in that way and is there an easy way how machines kan upload files to folders EasyMorph server uses in ETL-flows? All info about simple solutions that can be put in place rapidly are welcome.
Do I understand it correctly that we could update the descriptions in the catalog in the automated way by doing some sort of query on our database to pull all descriptions from it and then through EasyMorph load it into the data catalog ?

Thanks
Nikolaas

dgudkov · October 13, 2021, 9:54am

No, the data catalog is not a cloud-based file collection/sharing service. It doesn’t operate with files. It provides datasets that are produced dynamically, on the fly.

If you need a simple way to collect files from internal and external users, why not just let them upload the files into a cloud storage folder in Google Drive or Amazon S3 or OneDrive (will be supported in v5.0.1) and then collect them with a scheduled task in EasyMorph Server?

I guess it’s better to discuss it in a separate topic.

reynsnivea · October 13, 2021, 9:56am

Ok thanks. Unfortunately for privacy reasons we cannot use the cloud.

dgudkov · October 13, 2021, 10:07am

Well, then you may try to have a separate EasyMorph Server installation in DMZ for file uploading/downloading. EasyMorph Server uses the HTTP.SYS web-server on which IIS is based so it’s rather robust security-wise.

dgudkov · November 2, 2021, 11:11pm

A few UI sketches of the Data Catalog:

Data Catalog Start page in EasyMorph Desktop

What you can see here:

Hierarchical, folder-like categories in which data catalog items (entities) are stored.
A list of catalog items
In each item: name, short description, an indicator whether it returns a table or a file, an indicator of certification; an icon (from a pre-defined collection of icons)
Favorite and recently accessed items
A search field for searching across categories by entity name, description, field, or field description

Item details

The “Item details” dialog is shown when the “More details” button is clicked in an item.

What you can see here:

Item name
Item short description
Item certification status
Long description
A linked project (on Server)
A list of related items
Field metadata. For each field: name, description, data type, data role (dimension, measure, attribute).
Sample data
A button to initiate retrieving data

Retrieving data

This is the same dialog as above, just with new content.

What you can see here:

Parameter values with which the linked project will be executed
The Start button

When running:

A log with statuses (similar to what you see in Launcher)

When finished

Button “Open” to open the table in the Analysis View (only if the result is a table). The Analysis View is pretty much the same as you can see in version 5.0. It has filtering and charts.
Button “Save as” to save into a file

Data Catalog Item editor

The editor will open by pressing a toolbar button in the Project menu. It is used for creating/editing entity metadata and linking it to the current workflow.

What you can see here:

Entity metadata: category, name, descriptions, result type (table or file)
Field metadata: name, description, data type, role
Button “Edit workflow” - returns to workflow editing
Button “Publish to Data Catalog” - publishes this metadata and workflow as a Data Catalog item available in the catalog start screen described above in the post.

Remarks

A data catalog item can return either a table produced by an EasyMorph workflow or a file (e.g. a PDF file) retrieved by an EasyMorph workflow from somewhere else.
The “Import from Data Catalog” action will allow importing datasets and fetching files from the Data Catalog in EasyMorph workflows.
A special action (Data Catalog Command) will allow creating and updating data catalog items programmatically from EasyMorph workflows. It won’t be available in the 1st release though.
Accessing the Data Catalog will be possible via EasyMorph Server UI too using a browser (only in Enterprise Edition because it would require AD authentication).

Later this month we will arrange a call to discuss Data Catalog features. Anyone interested will be able to participate. The date and time will be announced later.

RJO · November 3, 2021, 4:37pm

Wow ! Your UI is really impressive !

There will be one thing missing but it’s not especially related to data catalog : it’s the parameter possible values. This is needed for tasks and will be needed for data catalogs : we would need a feature to specify dynamic values for each parameter. If it’s too difficult to specify a sql query or a data catalog, maybe you could just enable to link each parameter to a dataset on the server. By this way we could handle the dataset feeding on our side with periodic tasks.

This whole data catalog feature is really powerful and will make the difference with other competitive tools.

RJO · November 3, 2021, 5:33pm

One simple feature to add if you can is the selection of a subset of the output fields. Because users will want only a part of the fields of course but also because it enables you to implement fine tuning of queries like in Business Objects or Denodo : you can skip joins which are not necessary to output results. Requested fields could be put in the « input » of the data catalog project.

Im thinking of a great feature enabled by your solution : row level security ! At least you can get the workspace inside the project and confront it to an authorization table. Another great and expected feature to mention explictly in the advertising.

David · November 3, 2021, 6:47pm

Hi Dmitry. Many thanks for the mockups! They clearly convey the idea, and show each one of the features you listed in the previous thread (here).

The one thing I’m missing (if I’m not mistaken) in the mockups is the bi-directional support, or more exactly the input, manipulation and final write-back of data into the data sources. Looking at the mockups, the management of the dataset metadata along with the metadata of its underlying fields is clearly illustrated - but in addition, I expected a kind of grid-like interface for the management of the actual dataset data. Would the data catalog allow for row-level data management, or would the focus lay with the dataset-level and field-level?

Also, I’m curious how the ‘dependencies’ tab in mockup #2 would look like.

Anyway, count me in for the upcoming call

dgudkov · November 4, 2021, 10:51am

Excellent point! This will be possible to do with the "Multiple choice" parameter type and the "Select by lookup" action.

To copy all column headers in a list of values:

Select all columns (click the first column and Shift+click the last column)
Right-click the last header name and select "Copy headers"
In the "Multiple choice" parameter, press "Paste values".

We will be adding a new system function in version 5.1:

system('identity')

It returns the Windows identity of the currently logged-in user in EasyMorph Server (AD spaces only) or in EasyMorph Desktop. For non-AD spaces, tasks triggered by the scheduler or via an API, and CLW it returns nothing.

The function will enable row-level security.

dgudkov · November 4, 2021, 11:05am

The bi-directional support will be implemented in later stages by adding an action for exporting a tabular dataset (or a file) into the Data Catalog. Deleting/updating rows in a Data Catalog item will also be possible by specifying a key field. Every write operation (insert, update, delete) will have a different starting module in the underlying project and will require the respective module to use the "Input" action.

A grid-like editor might come at some point, I won't rule that out. However, any ideas about it are in a very early stage, so I can't say much about it.

RJO · November 5, 2021, 10:43am

About the data catalog command, can you please anticipate the possibility to enrich the documentation of data catalog item and fields in the most automated way ? In Natixis we use Zeena for documentation and they will provide an API. So If I can retreive this documentation to automatically document easymorph data catalog items, it would be awesome.

dgudkov · November 5, 2021, 11:02am

The Data Catalog Command action will do the following:

Create/update item’s short description, long description
Rename item
Create/Update/Delete/Rename item fields
Create/Update item field descriptions
Create a new item using the specified project as a template

So as long as you can pull metadata from any system (e.g. a database, or data governance application) you will be able to update Data Catalog item descriptions and metadata in a fully automated way.

In the future, the Data Catalog Command action will be extended to update information about item dependencies, so it will be possible to add and chain additional dependencies to the dependencies graph.

RJO · November 5, 2021, 12:17pm

Nice. We would like a different system than “Easymorph Server Command” which is not usable for us because you have a fixed drop down list of task and you can not parameter it. We would like a full parameterized action with possibility to specify item with text or id and possibly using parameter there.

It would be also good that easymorph server command can be used with text/parameter to specify the name of the task. I suppose it’s easy to do.

dgudkov · December 16, 2021, 12:12am

We keep working on the Data Catalog. Here is an updated, high-fidelity mockup of the Catalog’s UI.

RJO · December 22, 2021, 1:13pm

I like the style !

I can see on the screen different status : datasets loading, loaded or not. What is also important is the parameters used to generate them. It may be good to find a way to show the parameters used to generate something. For example on Customers which has the button ‘Save’, I guess it’s something which already run, but it would be good to know with which parameters it run, maybe with a tooltip or a dedicated space in the screen, or simply an information button near “Save”. An information button would be nice. As I understand the screen, a colleague would not have to run twice the same dataset, he would see that someone has already run it ? But he has to check with which parameters it was run before saving it on his side.

I wonder what are these “folders” that we see on the right screen. You can create/delete folders dedicated to data catalog item ? I think it can help to categorize things at least.

One nice option would be to change the color of the rectangles depending on the item. It would be a property to set on each item. Also why not introducing tags (or equivalent) to categorize items. It would be a little bit more rich than categorizing by folders.

If several dataset were run, why not adding a general “save”’ button to save them all with a generic name ?

At last did you think of sharing a dataset link ? I mean there is one dataset interesting in one folder and you want to share the place to someone. Why not provide a kind of link to get directly to it.

That’s only suggestions, the global look seems great !

dgudkov · December 22, 2021, 10:51pm

It will be possible to see parameter values. The "Bring up" button will open a dialog with more details about the running task.

Each user of an AD space will only see his/her own items running. If two people retrieve the same item it will be executed independently twice. Later, we will add caching to avoid re-computing datasets when possible.

Catalog items will be organized in a folder-like structure of directories. Each directory can have subdirectories as well as catalog items. So it's very much like folders and files.

Let us release the first version of data catalog. We will think about bells and whistles after that

Yes, we discussed that with the team and we will be able to generate a shareable link (URL) for any item. A shareable link can be opened in a browser, or pasted into the search box in Desktop.

The Data Catalog will be a powerful application with many capabilities. In the initial release, we will lay a foundation and introduce the most quintessential functionality and will keep adding more features with the following releases.

In the initial release, there will be 6 types of catalog items:

Item type	Description	Possible actions (Desktop)
Computed dataset	A dataset that is computed on the fly using a workflow	Open in the Analysis View. Save to disk as a file. Import in a workflow. Send to a sandbox in a workflow.
Static dataset	A .dset that is stored in the Server	Open in the Analysis View. Save to disk as a file. Import in a workflow. Send to a sandbox in a workflow.
Computed file	A file (e.g. PDF) that is stored in the Server and which path is computed on the fly using a workflow.	Save to the user's computer in a local folder.
Static file	A file (e.g. PDF) that is stored in the Server.	Save to the user's computer in a local folder.
Computed URL	A URL (e.g. for a Power BI report) that is computed on the fly using a workflow.	Open in a web-browser.
Static URL	A constant URL (e.g. for a Power BI report) entered manually.	Open in a web-browser.

cvo · December 23, 2021, 9:48am

Hi,

Any kind of facet based filters to select the datasets ?
or maybe keywords extraction list ? search engine ?

regards

dgudkov · December 23, 2021, 10:29am

Hi Christophe,

Not sure I understand your question. Can you elaborate?