Data catalog features

cvo · December 23, 2021, 10:51am

it will be quite easy to generate huge number of dataset.
We will need to easily locate them.
here is an example of the open data soft portal

dgudkov · December 23, 2021, 12:13pm

Yes, of course, it will be possible to search the catalog. Items can be found by:

Type
Name
Description
Field name
Field description
Author
Certified / not certified status

RJO · December 28, 2021, 3:25pm

There may be a question around the solution “multiple choice” parameter + “select by lookup” action to only request a part of the output fields and not the totality : generally, if you want to get a subset of columns with good figures and no duplicates, you need to aggregate. But the problem in this case is that you don’t know what to aggregate because fields are chosen dynamically.

Imagine you have A and B as dimensions and C as measure
A / B / C
moto / NY / 1
cycle / NY / 2
And you want only B and C at the end, without duplicates, so NY / 3

Do you think there would be a programmatic way to enable the selection of output fields AND ensure the aggregation depending on that fields, assuming that you differentiate dimension parameters and measure parameters so that you know which columns you aggregate by and which you sum (always sum in our case) ? We would need a new aggregate by lookup action I guess XD

cvo · December 29, 2021, 4:44pm

Greetings,

In the data catalog UI, I’d like to have the table metadata panel. Maybe aside the “more details” and “retrieve” buttons.

Regards

dgudkov · December 30, 2021, 1:45pm

Selecting only particular output fields should theoretically be possible but only as long as these fields are defined in the catalog item metadata. It won't be possible to know ahead of time the output field names if they are calculated dynamically.

"Aggregate by lookup" can be done by unpivoting, filtering, aggregating with grouping, and then pivoting back. Something like this: Dynamic Aggregation - #4 by dgudkov

RJO · December 30, 2021, 2:50pm

Well, in our case we only query oracle databases. So I think with data catalog we will be able to build the parts of the SQL Query, and then pass them to one module which will contains a query like :

SELECT {SELECT} FROM {FROM} WHERE {WHERE} GROUP BY {GROUP BY}

Did a test and it’s working.

It’s even more possible to control the left joined tables depending on what is requested, and finetune the final SQL. It’s like replicating Business Objects SQL Query engine

David · December 30, 2021, 3:39pm

Looks great! Some small questions and comments:

What do the icons on the top-left corner of each box/dataset indicate? On your mockup, we see a Sandbox icon, a database icon and a calendar icon. I guess these reflect either one of the six Item types of a classification of the dataset (e.g. if it’s data that contains a temporal dimension, a fact-like dataset or more dimension-like or the certification status).
Some catalog items will probably be hierarchical in nature. For instance, on your mockup, there’s the ‘Customers that recently placed an order’. I would expect this dataset to be a subset of a parent ‘All Customers’ dataset. What are your thoughts on allowing for such hierarchy and making it visible ?
On the left-hand pane, it sais ‘Retrive’ instead of ‘Retrieve’. Just a typo but I couldn’t help but notice
What prompts would an end-user get when retrieving a dataset? E.g. would he/she be able to indicate where to store the dataset, whether to overwrite or rename the previous version, what format, etc?

On the long run, I’m curious how the authorization will work. It would be great for data admins to set a certain visibility level on the datasets (e.g. visible to all / to some) and, if a dataset is visible for an end-user, also an interaction level. By that I mean: an end user can see and retrieve certain datasets (as in the mockup). But next to those, there’s also some datasets he/she can see but not directly retrieve; only request. This request would then push a notification to the data admin which can then review the request and push the dataset to the user.

But looks great already!

Best wishes to all

dgudkov · January 3, 2022, 11:46pm

Hi David,

These icons don't represent one of the 6 item types. Instead, they are just icons from a pre-designed library of icons. We don't have them ready yet, so I used action icons instead. In reality, the icons will be generic and represent money, people, orders, etc. You can think of them as of "emojis".

Such a hierarchy can be represented using a directory hierarchy. Alternatively, items will have related items. Such relations are less strict than a hierarchy but can also help navigate similar items.

Finally, for strict hierarchies, you can specify a subset of data using a "fixed list" parameter.

For file results, the user will be able to choose a destination folder and the file name.
For dataset results, the user will be able to choose file format (e.g. csv or xlsx) and the file name.

Authorization is something that will evolve over time. I don't think we will be able to get it right from the beginning, so we will start with our standard authorization model when every user of a space will have the same rights. Later we will introduce different roles (such as "data steward") but users under the same role will still have the same type of access to all items in the space's catalog. Further evolution of the authorization scheme will be defined based on real-life use scenarios.

RJO · January 7, 2022, 5:27pm

Will there be dedicated log events about data catalog requests ? Like there are events on tasks. That would be nice !

dgudkov · January 7, 2022, 7:40pm

Yes, data catalog events, even on Desktops, will be logged in the Server journal.

ArendP · March 15, 2022, 2:16pm

hi, any news on development of the Data Catalog? I read it’s planned for Q1, any update on the timeline? I am really curious about this.

dgudkov · March 15, 2022, 9:00pm

Hi Arend,

Unfortunately, we have to postpone the release of the Data Catalog until Q2. Russia’s war against Ukraine affected our release plans because our software development team is located mostly in Ukraine and it takes time to adjust to the new reality and re-organize work (and life). After a two-week pause, our wonderful team has already resumed working on the Data Catalog, but it’s still not clear how quickly we can proceed. The Catalog is 80% ready at this point, but it’s hard to say how much time it will take to complete the remaining 20%.

PS. All other EasyMorph people - sales & licensing support, marketing, accounting, etc. are located in North America and are not affected by the war from an operational standpoint.

RJO · March 22, 2022, 1:57pm

Hi, it’s very bad news for us because as it’s still planned for March 2022 on the official web site, we have planned one big project according to that calendar and we announced the availability to our users during April. Of course I feel very sorry for what is currently happening.

Can you please update the calendar accordingly ? I mean replace March 22 by a trusted estimation date ? It’s a big problem that dates are not updated on the official web site. Big companies need time to adapt and visibility is the key.

dgudkov · March 22, 2022, 2:50pm

What could we do, Romain. Who knew this would happen

The timeline will be updated and an announcement about new deadlines will be sent out later this week.

RJO · March 22, 2022, 3:09pm

The answer is in your post. The good thing to do is to update the timeline as best as you can, and wish a near end as we all do.

dgudkov · March 30, 2022, 2:38pm

We already discussed it last year, Romain, and I would like to emphasize it again: the roadmap should not be perceived as our obligation or commitment. We reserve the right to change it at any moment as we see it and without notice. Please do not make promises to your users based on the roadmap.

dgudkov · March 30, 2022, 3:00pm

UPDATE

The Data Catalog is 90% ready and is approaching the closed beta-testing phase that is expected to start in mid-April. The initial release is currently planned for mid-May.

RJO · April 1, 2022, 9:38am

I think one of the biggest next challenge for data catalog will be the way to query it. I mean parameters are very limited, even more if you consider they are not dynamic (not directly depending on database values like in other bi tools).

On our side I think we will propose a text parameter so that they can type their own SQL filter based on fields, with documentation, and we will use this in a dynamic where condition. But It would be great to have a web query interface, even basic, so that users can build a more advanced filter (or / and) on top of a data catalog item.

One other thing would be to authorize hierarchy parameters : for example you authorize a parameter showing continent > country > city and users can select cities at continent level (all cities from continent) or country level (all cities from country) or city level.

dgudkov · April 1, 2022, 12:35pm

Instead of dynamic parameters, we’re considering a more general-purpose solution - user input. The discussion is here: Possible feature discussion: User input

The Data Catalog may have additional item types besides datasets, files, and hyperlinks. Future versions may add new item types such as database queries. Currently, it’s just an idea. We would like to get started with something and then see how to develop it further. In any case, the Catalog is very promising in terms of possibilities for future development.

cvo · April 1, 2022, 12:54pm

Hi Dmitry,
Data entry forms and data catalog together would be a very powerful feature.
I would like also some kind of tags or categories linked to columns.
This way I could link on the fly 2 or more datasets on their common dimensions or attributes.

Regards