Terry's ORA Tips

Using Regular Expressions In ORA

This page created 10 Feb 2024

 

This article describes some more advanced methods for constructing Templates in Online Repository Assistant (ORA) that extend the techniques described in my articles on Transforms and  Intermediate Template Methods. Other articles in my ORA Section cover other topics about using the software.

Topics Included in this Article
The Basic Concepts
What are Regular Expressions are about
Creating Patterns
The basics of constructing Regular Expressions
Managing Multiple Matches
When a pattern finds more than one match
Managing Letter Case
Selecting upper case, lower case, or either
Capture Groups
Capturing part of a match
Testing Your RegEx Patterns
Testing your work
Additional RegEx Resources
Finding help

ORA offers the opportunity to use Regular Expressions, often called RegEx, in three contexts: the :replace Transform, the :extract and :extractIndex Transforms, and the Value Test Variable. RegEx is a very powerful tool and is used in a wide variety of contexts for data processing. Part of its power is derived from an extremely wide variety of tools, but the apparently complexity of those options can make it seem daunting to new users.

But because it can be so useful in constructing Templates in ORA, I believe it is worthwhile for users interested in creating Templates to learn the basics of RegEx. Learning only a few of the basic RegEx techniques can produce very useful results. This article is an attempt to provide an understanding of the basic techniques and to identify some useful aids for users interested in trying them.

This article is far from a complete overview of RegEx. There are many resources that offer information about RegEx, but most are overwhelmingly complex, covering many topics not needed by the ORA user who is beginning to use this tool. This article focuses only on those aspects of RegEx that I have found most useful in creating Templates in ORA.

The Basic Concepts

RegEx tools examine a string of characters – letters, numbers, punctuation marks, and spaces – to determine whether any part of that string matches a specified "pattern." If so, it is considered a "match."

The "pattern" can be as simple as a single character, for example an underscore character. It can also be a complex pattern of letters and numbers or punctuation marks. It can be limited to the first or last occurrence of that pattern in the string. Or it can be limited by the boundaries of an individual "word" within the string.

Some examples of patterns and their matches:

RegEx Pattern String Matches
_ T625_375 _
[0-9]+ T625_375 625 and 375
[a-z][0-9]+ T625_375 T625
\([a-z]+\) Mary (Jones) Smith (Jones)
\(([a-z]+)\) Mary (Jones) Smith Jones

The last example shows how only part of the match is "captured." In this case the name with its surrounding parentheses was matched, but only the letters within parentheses were captured.

ORA can make use of matches in three ways:

Some examples of uses of these tools in ORA:

Transform or Variable String Result
[Film:replace:_:-] T625_375 T625-375
[Age:extract:([0-9]+) age 23 23
[?:Age/yrs/] 8 mos 34 days false
[?:Age/mos/] 8 mos 34 days true

Creating Patterns

RegEx patterns most used in ORA are constructed of three main components: elements that specify characters, elements that specify repetition or quantity, and elements that specify relationship to boundaries.

Specifying Characters

There are a large number of tools for specifying characters in patterns in RegEx. Those that I have found useful in ORA are these:

Characters that have special meaning to RegEx must be "escaped" – preceded with a backslash – when used as literal text. For example a period must be written as " \. " when it is to be interpreted a period. Special characters I have most often encountered when working with ORA are the period, square brackets, and parentheses.

Character Class   Meaning
[0-9]   any number
[a-z]   any letter
[aeiou]   any of the specified letters
[a-z ]   any letter or space
 
Specifying Repetition

The characters defined by the methods described above will match a single character in the target string (or in the case of literal text, a single instance of the literal string). Thus the Character Class [a-z] will match any one letter in the range between a and z. When the pattern is to match a series of characters a number of "repetitions" must be specified. There are a number of ways to specify repetition in RegEx, with the following being those I have found most useful in working with ORA:

{1}   matches exactly one time
{4}   matches exactly four times
{1,4}   matches between one and four times
 
Specifying Location

There are a large number of tools available in RegEx to specify where the pattern must be found within the target string. Those that I have found most useful in using RegEx in ORA are these:

Examples

Some examples of how the three types of elements described above can be combined to define a desired pattern may be helpful:

Pattern Meaning Sample String Match
[0-9]+ one or more numbers film M345 345
[a-z][0-9]+ one letter followed by one or more numbers film M345 M345
\b[a-z]\b one letter at beginning of word followed by end of word John D Jones D
^0* any number of zeros at start of string 000489 000
\b0* any number of zeros at start of word cert 00689 00
\(.*\) any number of characters enclosed with parentheses Jane (Jones) Smith (Jones)

We next move on to some other aspects of RegEx that are relevant to using it in ORA.

Managing Multiple Matches

A RegEx pattern may match more than one segment of a test string. The three ORA functions have different ways to manage which match or matches are to be used.

Transform Variable Value Output
[Age:replace:[0-9]+:x] 13 yrs 25 mo x yrs x mo
[Age:replace:[0-9]+:x: ] 13 yrs 25 mo x yrs 25 mo
[Age:replace:[0-9]+:x:i] 13 yrs 25 mo x yrs 25 mo
Transform Variable Value Output
[Age:extract:([0-9]+)] 13 yrs 25 mo 13
[Age:extractIndex:([0-9]+):1] 13 yrs 25 mo 13
[Age:extractIndex:([0-9]+):2] 13 yrs 25 mo 25

Managing Letter Case

The case of alphabetic letters is sometimes significant in designing ORA Templates, and other at other times not. The ORA functions that use RegEx are by default case insensitive. But each can be configured to be case sensitive.

The :replace, and :extract Transforms use Flags in their optional final parameter to control how letter case is evaluated. If that final parameter is omitted the " i " Flag, meaning case insensitive, is assumed. To make these Transforms case sensitive, create the final (optional) parameter by adding a final colon. Then enter either a blank to indicate no Flags, or the " g " Flag if that is desired. The examples below use the :replace transform to replace a single letter with nothing (the second parameter is left empty), demonstrating the effects of omitting or specifying the " i " Flag::

Transform Variable Value Output
[Film:replace:[a-z]:] T354-15 354-15
[Film:replace:[a-z]:: ] T354-15 T354-15
[Film:replace:[A-Z]:: ] T354-15 354-15
[Film:replace:[a-z]::i] T354-15 354-15

Note that the Character Set " [a-z] " and " [A-Z] " have the same meaning when operating in the case insensitive mode, but different meanings in the case sensitive mode.

The Value Test Variable is case insensitive in its normal form – [?:Age/mo/] – but becomes case sensitive if a space, indicating the omitted " i " flag, is added at the end – [?:Age/mo/ ]. These examples demonstrate how this might be used:

Variable Variable Value Result
[?:Age/mo/] 3 yr 5 mo true
[?:Age/MO/] 3 yr 5 mo true
[?:Age/MO/ ] 3 yr 5 mo false

Capture Groups

Capture Groups are used in the :extract and :extractIndex Transforms to specify that a part of a larger pattern is to be "captured," in the case of these Transforms, to be extracted. That is, when one wants to match a specific pattern only when it is surrounded by other pattern elements. The Capture Group is specified by enclosing the part of the pattern to be "captured" with parentheses. The following examples illustrate how a capture group can select a part of a larger pattern to extract:

:replace, and

In the first example, the parentheses around the term for "one or more numbers" – [0-9]+ – indicate that only the numbers are to be extracted. The literal text " mo " as part of the pattern indicates that the numbers are to be extracted only if they are followed by the text " mo ".

In the second example the literal characters left and right parentheses (with escape marks to indicate they are not themselves meaning a capture group) show that the intended match must be enclosed in parentheses. The actual target match is defined by the term for "any number of characters" – .* – inside the escaped parentheses, and that term is enclosed in unescaped parentheses to define the Capture Group.

It is considered good practice to always specify a Capture Group. If none is specified the :extract Transforms will extract all the text that matches the pattern.

Capture Groups are also useful with the :replace Transform. With that Transform one or more parts of the pattern match can be "saved" and then inserted into the replacement text. The "saved" parts are indicted in the replacement text parameter with a dollar sign and number, so "$1" indicated the first capture group is to be output, "$2" the second, etc. The example below, in which Capture Groups are used to insert a comma into a string of numbers, illustrates how this feature can be used:

Transform Variable Value Output
[Amount:replace:([0-9]{1,3})([0-9]{3}):$1,$2] 14283 14,283
[Amount:replace:([0-9]{1,3})([0-9]{3}):$1,$2] 489 489

In the first Capture Group, the term " [0-9]{1,3} " indicated that one to three digits are to be found. In the second Capture Group the term " [0-9]{3} " indicates that a group of three digits must be found. If both these matches are made, the output is to be the digits found by the first Capture Group, indicated by the " $1 " term, a literal comma, then the digits found by the second Capture Group, indicated by the " $2 " term.

In the second instance, there was no match found because there were not enough digits in the value, so no replacement was made. (This example does not work correctly with numbers with more than six digits.)

Testing Your RegEx Patterns

When creating Templates using RegEx one way to test the result is to use the Test button on the OraSettings screen, as one would for any Template under development. However that feature offers few clues about the cause of the problem when the RegEx in the Template does not produce the intended result.

Many users find the website Regular Expressions 101 to be a valuable tool for testing.

The first step in using this website is to set the "Flavor," in the upper left corner of the page, to "ECMScript (JavaScript)" as shown in the screenshot on the right. There are a number of variations, or "flavors," of RegEx with minor differences in some details, and this sets the website to use the variation used by ORA.

The next step is to enter a sample of the text that the RegEx is to test, and the pattern intended to test with in the center of the web page, as shown in the screenshot below which shows a test of the RegEx used in the last example above.

The pattern is entered in the "Regular Expression" field. At the right end of that field are the Flags the website will use in evaluating the pattern. Note that the default for this site are different than the default used by ORA. The website uses " g " and " m " by default. To obtain accurate results, these should be changed to " g " and ' i " as shown in the screenshot, unless different flags are to be used in ORA as described above for controlling letter case. The Flags are changed in the website by clicking on one of the Flags displayed and choosing the desired ones from the drop-down list that appears.

The test string is entered in the "Test String" field below. The test string can easily be edited to test the pattern with different values that might be found in various records.

The two fields in the right pane of the website provide important information about the pattern entered and the matches, if any that are found.

The upper field, "Explanation," provides an explanation of the meaning of each term in entered in the pattern. Reviewing this information can be very helpful in understanding the issue when a pattern does not produce the expected result.

The field below, "Match Information," lists all the matches found in the test string. When a Capture Group is defined, as it is in the illustrated case, it also lists all the capture groups found. If multiple matches are listed, be aware of how the ORA functions deal with them, as described above.

 

Additional RegEx Resources

A wide variety of websites provide tutorials and detailed explanations of RegEx, and can be found with a Google search. Two that are often recommended are RegexOne and RexEgg. However, as mentioned above I often find them overwhelming, making it difficult to find what I need.

The Quick Reference section, in the lower right corner of the Regular Expressions 101 site, mentioned above for testing, is also helpful. It provides good brief descriptions of almost all of the RegEx tokens, and if you click a particular token it gives you a more complete description and provides an example.

When I have difficulty getting a pattern to do what I want I generally post a question to the ORA list, where a number of helpful users, and ORA's author, John Cardinal, often respond. Another approach I find useful is to make a Google search with a statement of my question and the term "RegEx."

In addition, a number of my Template Examples make use of RegEx, and include explanations of how the RegEx terms work. They include:

ReigelRidge Home Terry's Tips Home Contact Terry

 

 

Copyright 2000- by Terry Reigel