Regex capture groups

jdavidhobbs · March 23, 2018, 4:07pm

Would like to be able to calculate a new field based on a regex capture group from another field.

For instance, let’s say I had a field named “HTML” with HTML in it, with a variety of HTML tags. But I’d like to pull out the title from the title tag. I’d like to create a transformation step that I put in something like “<TITLE>(.*)</TITLE>” (yes the regex expression could be improved) and then it fills in a new field “TITLE” with those capture groups. So if row one had <TITLE>My Webpage</TITLE> then the TITLE field would be “My Webpage” and if the next row had <TITLE>Another page</TITLE> then the TITLE field would be “Another page” etc.

dgudkov · March 23, 2018, 5:21pm

Maybe I’m missing something, but it sounds like the “Regular expression” transformation does exactly this. Is there a case where it doesn’t do what is needed?

jdavidhobbs · March 23, 2018, 6:03pm

How would I pull out just what’s in between the open and close title tag?

I can see how to pull in the ENTIRE match, but I only want the CAPTURE GROUP. But it doesn’t seem that’s how the regular expression transformation works:

Instead of this I just want the title field to have “lkajsdf”, “lakjsdafalskjdf”, etc.

dgudkov · March 24, 2018, 1:27am

Oh, now I see what you are talking about. Point taken.

dgudkov · March 29, 2018, 3:36pm

Will be added in 3.8 with three new capture modes: Matches only, Groups only, Matches and groups.

jdavidhobbs · April 2, 2018, 10:46am

Great!

SoBeGuy · May 1, 2021, 4:47pm

For the Regular Expression transformation, could you please add additional regex matching options like:

Dot matches line breaks
^$ matches line breaks
Exact/Free spacing

Also, there should be an option to perform a regex replace instead of a regex match. When the replace option is enabled, we could enter the replacement text, including back references. This would allow for much more powerful regex transformations, performed in a single step.

For example, suppose you have an input column that contains the following line:

Name: John Doe

I’d like to create the following regex:
(?<=^Name: )([^\s]+) ([^\s]+)

This would extract the first and last name into capture groups. Then, you could enter the following replacement text:

$2, $1

So the resulting column would contain Doe, John.

I know EasyMorph includes a regexreplace function, but it would be convenient to also have this capability built-in to the Regular Expression transformation.