Parsing Exceptionally Complex Column

So I’ve been handed a sample of data to see if EasyMorph will handle it. If it will, then it may result in a licence being bought by my client.

Here is a small sample of the 2 million or so rows in the file

My task, is to boil it all down, into individual rows, so for example

image

Will become

Finance Department,a,b,c,d,e
Aberdeen City Council,a,b,c,d,e
Town House,a,b,c,d,e

Only the first column is to be translated, the rest of the columns are just to be filled down

As a general rule of thumb, any semicolons can be split without consideration, I’ve already worked out how to do that part

Taking this line as an example:

Splitting on the semicolon gives me the following:

Next potentially comes splitting on the ‘/’ symbol, but this is where this now gets complicated.

We can ONLY split on the ‘/’ IF it’s NOT the last one on the line and not followed by 1 or more digits and possibly other chars after the digits, or in other words

name1/name2/name3

Should result in

name1
name2
name3

but

name1/name2/name3/55

should result in

name1
name2
name3/55

and

name1/55

should remain unchanged

The next to consider are number ranges.

1-10

on a line on it’s own should result in

1
2
3
4
5
6
7
8
9
10

Where as multiple groups EG:

1-5,2-9,8-12

Should result in

1
2
3
4
5
2
3
4
5
6
7
8
9
8
9
10
11
12

Meanwhile, if there is a prefix and trailing ‘/’ then they shall be split and recombined, so for example:

Flat 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 /16

Should result in:

16 : Flat 1
16 : Flat 2
16 : Flat 3
16 : Flat 4
16 : Flat 5
16 : Flat 6
16 : Flat 7
16 : Flat 8
16 : Flat 9
16 : Flat 10
16 : Flat 11
16 : Flat 12
16 : Flat 13
16 : Flat 14
16 : Flat 15

Do you think that EasyMorph is up to the challenge for this file? The aim at the end of the day is to split the compact address range rows back up, into individual entities, so that my client can then move the X&Y co-ordinates against each line, to more closely line up with specific buildings on a map, rather than everything just being within a radius of the postcode centre.

Hello,

It’s not the solution. Look at this morph. I use UID and split by “/” to have all data (id of line, splitted text and initial test). After, use fonction and transform to calc or remove your data.
Remember that you can derive table to do calculations.

Test.morph (3,4 Ko)

1 Like

Hi Florent, I will try playing with that idea, thanks. I’m very new to using easymorph, so I don’t know how to do fonction (Function?) I also do not yet know how to derive tables.

Thanks for the starting pointer though.

You can derive from ribbon.

image

After transformation, you can merge data.

“Fonction” is mistake from me. It’s “function” like you can use in “calculate new columns” or “modify columns”.

1 Like

Thank you @Florent I will try that.

The full sample project is at the bottom of this post.

This can be handled using the keepbefore() and keepafter() functions. The functions allow specifying the index of the separator. A negative index counts from the end. In this case

keepafter("name1/name2/name3/55", "/", -1) returns "55"

Then we can check if the last segment is numeric using the isnumeric() function:

isnumeric(keepafter("name1/name2/name3/55", "/", -1)) returns TRUE.

Now we can identify if the last segment is numeric or not. If yes, then we can replace the last slash with a special separator (e.g. ':::'), split the text into lines, then replace the special separator back with a slash.

The "Repeat rows" action can handle it with a bit of arithmetics.

That is doable too using the "Split delimited text into rows" and "Split delimited text into columns" actions.

Here is the full example project:
parse-text.morph (8.5 KB)

1 Like

@dgudkov Thank you, will check that out today, and combine it with the tips @Florent gave me.