Ignore Missing data in Merge another Table Action

htims05 · June 26, 2024, 12:41pm

I have a project where the source data is changing - basically different input files but the data structure is the same so I'm trying to create a repeatable project by just changing input files.

So in my merge steps when this data changes the merge fails because the previous attributes are no longer in the source data. Is there a way to have the project ignore these missing data? I've seen some examples here on the forum but they are quite basic - my source data is one of 10,000+ files, and within those the fields that could be missing from one file to the next is 100 or more and I don't know what will be missing from one file to the next (for example the fields in file 1 could be attributes of a refrigerator, and the fields in file 2 could be attributes of an automobile - obviously those fields will be very different and large in quantity).

So it seems in the examples I've seen you need to look at the data to determine what is optional or to know what the name of the column is to use columnexists() function.

Below is a screenshot of what I'm trying to get it to ignore - all the missing stuff.

dgudkov · June 26, 2024, 1:09pm

You can try the "Natural merge" action - it will merge all existing columns without explicitly specifying them.

htims05 · June 26, 2024, 1:31pm

Thanks that works - I suppose there's no easy/systematic way to have other actions ignore missing as well - like if I have a "remove/keep" columns set to remove a handful of column names....but now those columns are not there....there's no way to get around that error without modifying the way the project works in some fashion? I have a lot of steps in this project with remove/keep, calculated fields based on others, etc that would quit working when I change the source data.

Is it a fair assumption that I would either need to account for every situation where the data is no longer there or simply each time the source data is changed, manually make the changes - like disabling the checkboxes for the remove columns that no longer exist.

dgudkov · June 26, 2024, 1:40pm

If you're working with columns that contain variable attributes, usually the best way is to unpivot the table first. In this case, attributes will become rows instead of columns and it will still be easy to operate with them - merge, filter, etc while having a stable table structure and consistent column names.

In other words, variable attributes should never be columns - it's a road to a lot of pain

Once the transformation is done, pivot the result table back, if necessary.

htims05 · June 26, 2024, 1:56pm

Understood and yes I'm doing that in a lot of places in my workflow within the project. The issue still becomes if the source data changes and now doesn't have AttributeXYZ....then when I'm 30 steps later in my workflow after it's been pivoted back...if I have a 'remove columns" step to remove AttributeXYZ, it fails because it's no longer there. I'm thinking due to the nature of the variable source data, that my project likely can't be built in a way that still runs without any edits after changing the source data - sounds obvious but from an ETL perspective that runs automatically, that's what I'm trying to do.

dgudkov · June 26, 2024, 2:09pm

Technically, EasyMorph has all the necessary means to deal with instable table structures.

For instance, to ensure that a column exists, you can use the "Do not create column if it already exists" option (as below). So you create a column with a dummy value only if it doesn't exist already - that ensures that the column always exists, with real values or with a dummy value.

Also, you can use conditional branching and check whether a column exists using the columnexists() function.

Another common technique is to use the "Rename column by lookup" action to give columns canonical names using a lookup table.

In the next release, we will add an option in the "Keep/remove columns" to not produce an error if the removed column is already missing.

htims05 · June 26, 2024, 2:53pm

My issue is if I don't know that it doesn't exist then I can't use the columnexists() function - meaning I'd have to know all the possible attribute names up front to do this coding - which I don't know. I can't do this dynamically on the fly.

Here's a basic example of what I'm trying to convey here and I don't think I'm doing a good job.

Assume I don't know the columns until the file is loaded, file is loaded dynamically by listing files in a directory and picking out the right file to load

Source File upon first time the EasyMorph job is run:
Columns: SKU, Description, Brand, Counter Top Width, Counter Top Depth
Action: Remove columns "Brand", "Counter Top Depth"

Source File upon the second time the EasyMorph Job is run (again remember I don't know whats in it until it's loaded:
Columns: SKU, Description, Brand, Counter Top Width, Weight, Shipping method, Color
Action: Remove columns: "Brand", "Counter Top Depth"

The action of remove columns the second time the job is run fails because "Counter Top Depth" is missing in the new source file.

Since I didn't know it even existed before the first time the job ran, I can't use a columnsexist() function beforehand. Other methods I believe have the same problem, even if I unpivoted and then repivoted the second time I run the job with a new source data for attributes which are different and unknown to me before hand the Remove Column action will still fail because that column doesn't exist. This will happen for any function where you explicitly select the column for it to run on - like keep/remove columns, Unpivot, etc.

Unless I'm missing something on how I can devise an EasyMorph project to be reusable with differing source data without prior knowledge of the source data columns - I think a good feature for all these types of actions would be to "ignore if missing" option is available within the action itself....much like you say will be in the next release for Keep/Remove Columns - just expand that to do it for any action what you explicitly select the columns.

Thanks for the conversation here - it's been helpful nonetheless.

Jochen_Marquardt · June 27, 2024, 3:39am

As I understood, you would always like to keep
SKU, Description and Counter Top Width.
So, why don’t you just use the action keep the desired columns instead of removing unknown columns?

I just made a little example with two possible solutions:
example.zip (3.3 KB)

Solution 1 is as mentioned above: You know, that you always want these 3 columns SKU, Description and Counter Top Width and ignore all others
Solution 2: In this case, there could be more than these three fields (here, the field brand will be remained, too). So, you keep all columns which are filled in all files.

Hope this helps.

Regards,

Jochen