Transformation: Remove empty columns

joebo · January 17, 2018, 12:27pm

Blank columns can be common in input data or when using a transformation to load files with slightly different column names. It would be useful to be able to automatically remove all columns where all its values are empty (after filtering for example)

dgudkov · January 22, 2018, 4:20am

I’m thinking how this transformation can be generalized. Actually, this is conditional removal of columns. Conditions can be something like:

Column contains only empty values
Column contains empty values or empty text
Column contains empty values or empty text or whitespace
Column doesn’t contain errors (so that columns with errors remain)

The last one can be useful as well because it would allow detecting columns with error values. It would play well with conditionally derived tables and even tell exactly which columns contain errors. With addition of a transformation that filters rows with error values it would make possible to detect and pinpoint error values in any column/row.

Any other ideas what kind of conditions can be used for conditional column removal/selection?

joebo · January 22, 2018, 1:53pm

These are all good. Another option would be to specify a formula as the condition to remove. If the formula is true for all rows then the column gets removed. The formula would need to be able to reference the column so maybe something like [Input]

This may be more advanced than what’s desired. I do like the simplicity of what you propose.

I would suggest the following from your list:

Column contains empty values or empty text or whitespace
Column doesn’t contain errors (so that columns with errors remain)

Also, I later realized it’s possible to implement this without a new transformation using Iterate Table. Unfortunately, with Iterate Table the results are appended as rows and and pivot would be necessary. It would also be nice to have an “Append columns to the right” option for Iterate Table. Here’s a sample project… It’s not very efficient though on large data. It’s been running for 10 minutes on a 2,780,000 x 84 table. I didn’t make any attempt to optimize as it was just a POC. I suspect the slowdown in this is the pivot.

Remove Blanks.zip (3.9 KB)

joebo · January 22, 2018, 2:25pm

Another generalization may be “Column with only one value”… This would work for empty, white space, or if the column only has a value like “X” or 0. We get data like that fairly regularly which is just noise

jdavidhobbs · April 12, 2018, 6:48pm

Agree this would be useful. I have run into a situation where it would be great to remove empty columns rather than having to select all those that should be removed (aside from the extra effort, the next time I import I could have other blank columns).

jdavidhobbs · April 12, 2018, 8:53pm

Related, would be useful to selectively keep rows based on a condition. I am facing a situation now where I only want to keep the columns with “count” in their name. Again, I don’t necessarily know all the columns for any particular run of the script.

dgudkov · April 13, 2018, 3:24pm

This can already be done using the "Metadata" transformation to obtain a list column names, filtering it, and then applying the "Select by lookup" transformation to keep only columns which names are on the list.

dgudkov · April 13, 2018, 4:15pm

I’ve been thinking about this problem and come to conclusion that EasyMorph needs to have a simple transformation for removing empty/constant columns, and a new type of iteration - by columns, where the iterated subproject receives one column at a time and the parent project appends returned columns (if any) into one table, horizontally. This combination would cover simple cases as well as advanced scenarios. Iteration by columns would also make it possible to process matrix tables without unpivoting/re-pivoting them. Can be a rather versatile tool.

dgudkov · December 24, 2018, 3:24am

The “Remove empty columns” action is available starting from v3.8.1. See Release Notes v3.8.1.