Data Catalog: instructions for beta-testers

dgudkov · April 18, 2022, 10:00am

The Data Catalog is a feature of EasyMorph Server. However, for the purpose of beta-testing, you don't have to install EasyMorph Server. Instead, we've made a public demo site that is available at https://demo.datacatalog.com.

Configure the test environment

1) Request a demo account

Send me a DM (direct message) here on this forum to obtain a demo account. The demo account is a Server space (with password protection). Note that in the production version, the Data Catalog will only be available for spaces with Active Directory authentication (for licensing purposes).

Test spaces are isolated from each other using Server workers that run under different Windows accounts. Beta-testers can't access files of each other.

2) Download EasyMorph Desktop v5.2 (BETA).

Here is the download link: Download EasyMorph Desktop v5.2 (BETA)

Trial key valid until May, 31st: EasyMorph.zip (you will need it if you install the beta under a different Windows account)

To avoid overlapping with an existing EasyMorph installation, it is recommended to install the beta Desktop under a temporary Windows account on your machine, or on a machine without an EasyMorph installation.

Certain features of the beta version may not work correctly. Don't use it for production data or workflows.

3) Configure Server Link

Configure Server Link to point to demo.datacatalog.com, use HTTPS, port 443.

Also, configure your Connector Manager to use connectors from your test space:

Data Catalog overview

The purpose of the Data Catalog is to simplify access to various sources of business data and provide means for access audit and governance.

Data Catalog has a hierarchical folder-like structure of directories. Each directory contains catalog items and/or other directories.

Note that catalog directories are NOT subfolders of the Public folder (tab Files). The directories are a separate data structure that is stored in an internal database and not accessible externally.

Catalog items can be of 3 types:

Datasets
Files
URLs

In turn, each of the item types can be static or computed (using an EasyMorph workflow). Therefore, in total there are 6 item types:

Static EasyMorph datasets - regular .dset files stored in the Public folder of EasyMorph Server.
Computed datasets - EasyMorph datasets computed dynamically on the fly using a published project stored in the Public folder of the Server. The result table of the project is the resulting dataset.
Static file - any file stored in the Public folder of EasyMorph Server. For instance, a PDF file.
Computed file - a file stored in the Public folder. The relative path to the file is computed dynamically on the fly using a published project stored in the Public folder. The first value of the first column is the resulting relative path to the file.
Static URL - any URL (e.g. a hyperlink to a PowerBI report)
Computed URL - a URL that is computed dynamically on the fly using a published project stored in the Public folder. The first value of the first column is the resulting URL.

As you can notice, computed URLs and files paths use the top (first) value of the first (leftmost) column. Example:

If the result dataset has more columns/rows - they are ignored. The first column name doesn't matter.

Since computed items (datasets, files, and URLs) use a project to calculate its result, project parameters can be specified before retrieving a computed item. This makes computed catalog items somewhat similar to Server tasks. However, they are not Server tasks. The Catalog is a separate feature of EasyMorph Server

More item types will become available in later releases.

Working with the Data Catalog in Desktop

EasyMorph Desktop now consists of two integrated applications - Workflow Editor and Data Catalog. You can switch between them in the app bar, introduced in version 5.2 (see below). The Workflow Editor is what EasyMorph Desktop was prior to v5.2. Let's look closer at the Data Catalog part:

The app sidebar (1) is where you switch to the Catalog. It's been introduced in v5.2.

The "Recent computations" sidebar (2) shows the recent computed items. Note that static items aren't displayed here.

Finally, the biggest part of the screen is occupied by the catalog browser (3). Here is you can browse directories and sub-direcories, create new directories and catalog items, and retrieve items.

Let's look closer at a catalog item:

When hovered with mouse, it reveals two buttons - "More details" and "Retrieve". The former displays a dialog with more information about the item (such as annotation or related items). The latter retrieves the item.

Retrieving items

What happens when you press "Retrieve" depends on the item type:

Retrieving a dataset will download it to your machine and open it in the Dataset Viewer (more on that below).

Retrieving a file will save it to the specified folder on your machine.

Retrieving a URL will open the URL in the default web-browser on your machine.

Parameters

When retrieving a computed item, you may be prompted to provide parameters first. Note that parameter annotations are displayed as well.

When a catalog item is created/edited, parameters to enter should be explicitly selected from a list of available project parameters. By default, computed items have no parameters.

Dataset Viewer

The Dataset Viewer displays retrieved datasets, computed or static. In the Viewer you can:

View dataset
See dataset metadata (unique counts, etc.)
Find values
Send dataset to a sandbox (new or existing) in Workflow Editor
Save dataset in a supported file format (csv, xlsx, dset, etc.)
View catalog item details (description, related items, parameters, etc.)
Discard dataset from memory

The Dataset Viewer can keep several datasets open. Since all of them are stored in memory, make sure to discard large datasets when they are no longer needed to avoid running out of memory.

Adding catalog items

Adding catalog items is done from the start screen of the Catalog.

Note that if you are creating a computed item, you will need to create a corresponding EasyMorph project first, and publish it to the Server. Then, when creating a computed item, you can specify the published project in item settings.

Item fields

It is possible to describe fields of each item, be it a dataset or a file (e.g. a PDF file). These fields are searchable - you can find an item by a field. In the initial release, item fields should be created manually. In later releases, we will add tools for automated field creation/editing.

Working with Data Catalog in the web UI

After logging in EasyMorph Server, the Catalog is available in a new tab unsurprisngly named "Catalog".

From the web UI, it is possible to view item details and retrieve items (computed or not). Note that viewing datasets is not possible from the web UI. Retrieved datasets will be available for downloading and further viewing locally in EasyMorph Desktop.

Retrieving files and opening URLs works as expected.

Journal

Most operations with catalog items are recorded in the Server journal. Therefore you can view who accessed what item and when.

Accessing Server journal of the public demo site is not possible for beta testers. If you would like to test journalling of catalog operations, request a Server installer of v5.2. from our support.

RJO · April 21, 2022, 4:36pm

Here is my first feedback. Globally it’s working quite well and I understand the logic :

In the desktop, when I click on details on a catalog item, I have a description tab but I should have a Fields tab and it’s not there. On server, the tab is there. So Fields description are not displayed in the desktop, you can only use them to search
On Server, the recent computations look quite similar between them. It would be good to add the values of parameters used, at least the date of running. Same on Desktop. Or we can not differentiate results even if I guess that they are displayed by last run date desc. It’s a bit confusing.
On Server when I click on success on last results I have an error : Unable to get or render event details: Event not found
On Server for computed dataset, I see that I can just download the file .dset. That is really disappointing because I would expect to be able to download at least as excel as in the desktop. I really appreciate this feature with the fact that you don’t need to store the .dset. The thing missing is the output : users would expect more possibilities => they will always have to use the desktop in this case
Question : can we disable the last results in desktop and server ? I mean disable the storage of last computations ? Because I see that anyone in the workspace can download the result of another one. So if you intend to apply a security based on login for example, you can’t because each one has access to others computations.
I had bugs with file computations. I understand that I have to store the result on the server in this case. I used this path : Dataset\states.csv. On desktop, If I wanted to save in X folder, I understood that I had to create MYSELF the folder Dataset under X so that it can work. This will be not clear for a user. More than that, It’s not working : I think you are searching the file to download in the local saving folder defined by the user but you should rather search on the server, where I computed it. So it’s not finding the good file. On server when I wanted to download I had a red “OK” message but no download.
On dataset from desktop, when you click to open the dataset and before the dataset is opening there might be seconds or minutes if it’s very big. I would suggest to display a loading animation or something or we don’t really know what is happening.

Edit : the “Morph it” button after dataset upload does nothing for me. I think it’s the only way to store running configurations including parameters. It would be good to have bookmarks including on the server, as we already spoke together

RJO · April 22, 2022, 3:14pm

It seems I can now download the computed file on my laptop only if I create before the server folder tree where the file is supposed to be. On the server, I can not still download the file, there is an “OK” message in red and that’s all.

Note that when you click on retreive on a computed filed, you are directed to a wrong page => it’s redirecting to “default” workspace instead of my workspace.

Regular users will be a bit surprised by this behaviour. They expect to get immediately their file and instead, they must wait for the catalog task to finish and click on download, not so obvious. And as I said in the previous post, there should be at least the time of execution and also the requester. Other idea : filter by default to only see YOUR own executions, and add a filter to see all executions.

Edit : There is something you can implement on server and desktop. Imagine you do as described above : only your own last submissions are shown by default, not others (there would be a button for that). You then can also add a button to rerun a submission. That would act as bookmark : one user would be able to rerun the same extract with the same parameters every day.

dgudkov · April 22, 2022, 5:41pm

Thank you very much for the feedback, Romain! A few comments:

The Fields tab will be available in one of the next updates of Desktop 5.2 BETA.

In the final release, it will be possible to download a dataset in different formats - CSV, XLSX, etc.

To simplify beta-testing, we don't require Active Directory authentication but this in turn doesn't allow distinguish users. In the final release, the Catalog will only be available with AD authentication and users will only see their own computations and results.

Animation will appear in further updates of Desktop 5.2 BETA.

A few questions from my end:

Does the whole idea of the Catalog seem useful for your organization? The Catalog is basically a virtualized view of datasets (and other analytical assets) stored elsewhere or obtained on the fly. Which is different from a traditional centralized data warehouse where data is "materialized".
Does the "URL" item type make sense? Do you see a use for it in your organization?
Is the process of creation and publishing catalog items sufficiently simple for non-technical people?

cvo · April 23, 2022, 5:11pm

Hi Dmitry,

From my perspective, I was expecting a kind of shared super data catalog (see attached project). In this example, from an open data connection, I select a set of datasets and then an iterated module would add each dataset to a catalog. A kind of “add to catalog” action. A set of dedicated actions to manage data catalog could be defined. (suppress, update if exist, freeze…)
To retrieve the datacatalog datasets in a project, I was expecting a 3rd tab in the connection manager, a catalog folder could be the equivalent of a connector name.
To sum up, something like data catalog/data virtualization/semantic layer combined and done the EM way.

As usual, your team and you did an awesome work. I’ve said it many times but EM is the most innovative software I’ve seen on the last 10 years.

Regards

Open Data CVO WIZ data cat 2.morph (11.6 KB)

RJO · April 25, 2022, 7:37am

That sounds great thanks !

Yes of course data catalog is useful, I’m only regretting it costs more on server side But yes for users it’s very convenient because they can build new datasets on their own or use the one we will create for them. Everything is centralized and searchable, that’s a big plus. The equivalent I have in mind is datalake + Atlas. What we have now, tasks and files, are not so easy to use and everyone can see the files. It’s more a batch thing than what you propose now, which is more suitable to end users in interactive mode.

Links are interesting for my team because we provide a lot of links including power bi reports, paginated reports and so on. This is a way to centralize everything, and users really need to stay in the same tool to avoid wasting time. I think you should keep it, and I’m sure it’s not hard to maintain.

The process of creating and publishing items is very simple. You barely can make it easier than that.

dgudkov · April 25, 2022, 12:37pm

Thank you for the feedback, Christophe!

The Catalog will have tools for automation and an API, just not in the initial release. There will be a new action, “Catalog command” that will allow users automatically create/update/delete catalog items and their metadata such as fields. Therefore it will be possible to retrieve a list of database tables or files, and generate catalog items with an EasyMorph workflow.

dgudkov · May 10, 2022, 2:07pm

The beta-testing of the Data Catalog is now open. The topic has been moved to the main category.

Rykie · May 22, 2022, 1:36am

Hi Dmitry
I uninstalled EasyMorph on my computer, then installed the Beta version.
In Diagnostics: EasyMorph version: 5.1.2.20 (ef8c88, ‘Beta release’)

I configured the Server Link
I also configured your Connector Manager, but cannot see any connector names.
I can also not see the Data Catalog.
In the notes above you indicated that the Data Catalog is available from version 5.2.
Am I on the wrong version?

Thanks

dgudkov · May 22, 2022, 8:28am

Hi Rykie,

This is what you should see after configuring Server Link and switching to Data Catalog in EasyMorph Desktop.

There are no catalog items because you haven’t created any yet. Each beta tester has his/her own space which is initially empty. Create a few catalog items first. See “Adding catalog items” in my post above.

Rykie · May 23, 2022, 6:43am

Hi Dmitry

I do not see that.

This is what I see. The data catalog is not available.

Where am I going wrong?

dgudkov · May 23, 2022, 8:47am

Try reinstalling the beta using the link in the opening post. I’ve updated it to the most recent build.

Rykie · May 24, 2022, 12:25pm

Thanks, Dmitry.
It was user error (me).
I did not install the key.

dgudkov · May 24, 2022, 1:54pm

Yes, the free edition doesn’t have access to the Data Catalog.

Rykie · May 30, 2022, 6:38am

Hi Dmitry

My feedback from a users perspective:

The Data catalog concept is important
I do not totally understand the computed imports
I know that saving to your hard drive is great for users, but it can be challenging from a data perspective one source of the truth.
Adding fields - Drop and drag would have been great or importing the headers.
You cannot change the description or long description after publishing

I had some experience in working with Samenta with setting up a data catalog about 4 years ago.

We did it centrally and set up related links “that you could see in a diagramme”
I am trying to get my head around individuals loading their data catalog’s. I am used to a central point (such as from a server side) that loads the “web of data” & metadata and then gives access to individuals to a dataset. I am not sure about the data governance if anyone can load data.

Another suggestion: From Server, can you pick up all data sources being used and start a catalog that way.

I hope my comments helped.

dgudkov · May 31, 2022, 8:15pm

Thank you for the feesback, Rykie. See my answers below:

I do not totally understand the computed imports

A computed item means that the result is produced on the fly using an EasyMorph project stored on Server in the Public folder. For instance, a computed dataset means that the resulting dataset isn't a ready .dset file, but instead computed on the fly using an EasyMorph project.

Similarly, a computed hyperlink (URL) is computed by an EasyMorph project, that should return a URL in the 1st row of the 1st column. Then the resulting URL is opened by EasyMorph.

I know that saving to your hard drive is great for users, but it can be challenging from a data perspective one source of the truth.

The assumption is that the Catalog serves as the single source of truth. If everyone obtains the same data from the Catalog, then everyone is on the same page.

Adding fields - Drop and drag would have been great or importing the headers.

Yes, it's fully manual in the initial release, but we will add ways to semi-automate or fully automate adding field information.

You cannot change the description or long description after publishing

Every item property, including the descriptions, can be changed. For that, click the black arrow in the item and select "Edit" in the item menu. Edit a description and then press Publish.

Perhaps, I should do a webinar and demo the Catalog and answer questions about it. I understand that the concept is new and it requires a bit of adjusting especially for long-time EasyMorph users.

Rykie · June 12, 2022, 2:37am

Thanks, Dmitry. Yes, a demo will be good.

reynsnivea · June 12, 2022, 7:16pm

@dgudkov

Hi Dmitry,

It would be very interesting to see a video explaining the capabilities of the data catalog (product showcase) and maybe another one explaining more in depth how to work with it.

For example: how does one distribute the catalog to other users who are outside your organisation ?
Can we grant access rights for each catalog item ?

Thanks !
Nikolaas

dgudkov · June 13, 2022, 6:18pm

We will do a webinar on Data Catalog on June 24th at 10am EST (3pm GMT). In the webinar, I will demonstrate the Catalog and answer questions. The registration link is below.

>> Click to register <<

reynsnivea · June 14, 2022, 4:54pm

Hi Dmitry,

Great initiative !
Unfortunately, I have another item in my agenda at that point.
Will there be a recording of this webinar ?

Thanks !
Nikolaas