Would like to be able to calculate a new field based on a regex capture group from another field.
For instance, let’s say I had a field named “HTML” with HTML in it, with a variety of HTML tags. But I’d like to pull out the title from the title tag. I’d like to create a transformation step that I put in something like “<TITLE>(.*)</TITLE>” (yes the regex expression could be improved) and then it fills in a new field “TITLE” with those capture groups. So if row one had <TITLE>My Webpage</TITLE> then the TITLE field would be “My Webpage” and if the next row had <TITLE>Another page</TITLE> then the TITLE field would be “Another page” etc.
Maybe I’m missing something, but it sounds like the “Regular expression” transformation does exactly this. Is there a case where it doesn’t do what is needed?
How would I pull out just what’s in between the open and close title tag?
I can see how to pull in the ENTIRE match, but I only want the CAPTURE GROUP. But it doesn’t seem that’s how the regular expression transformation works:
Instead of this I just want the title field to have “lkajsdf”, “lakjsdafalskjdf”, etc.
Oh, now I see what you are talking about. Point taken.
Will be added in 3.8 with three new capture modes: Matches only, Groups only, Matches and groups.
For the Regular Expression transformation, could you please add additional regex matching options like:
- Dot matches line breaks
- ^$ matches line breaks
- Exact/Free spacing
Also, there should be an option to perform a regex replace instead of a regex match. When the replace option is enabled, we could enter the replacement text, including back references. This would allow for much more powerful regex transformations, performed in a single step.
For example, suppose you have an input column that contains the following line:
Name: John Doe
I’d like to create the following regex:
(?<=^Name: )([^\s]+) ([^\s]+)
This would extract the first and last name into capture groups. Then, you could enter the following replacement text:
$2, $1
So the resulting column would contain Doe, John
.
I know EasyMorph includes a regexreplace function, but it would be convenient to also have this capability built-in to the Regular Expression transformation.