Terry's ORA Tips

Using Regular Expressions In ORA

This page created 10 Feb 2024

This article describes some more advanced methods for constructing Templates in Online Repository Assistant (ORA) that extend the techniques described in my articles on Transforms and Intermediate Template Methods. Other articles in my ORA Section cover other topics about using the software.

Topics Included in this Article
The Basic Concepts	What are Regular Expressions are about
Creating Patterns	The basics of constructing Regular Expressions
Managing Multiple Matches	When a pattern finds more than one match
Managing Letter Case	Selecting upper case, lower case, or either
Capture Groups	Capturing part of a match
Testing Your RegEx Patterns	Testing your work
Additional RegEx Resources	Finding help

ORA offers the opportunity to use Regular Expressions, often called RegEx, in three contexts: the :replace Transform, the :extract and :extractIndex Transforms, and the Value Test Variable. RegEx is a very powerful tool and is used in a wide variety of contexts for data processing. Part of its power is derived from an extremely wide variety of tools, but the apparently complexity of those options can make it seem daunting to new users.

But because it can be so useful in constructing Templates in ORA, I believe it is worthwhile for users interested in creating Templates to learn the basics of RegEx. Learning only a few of the basic RegEx techniques can produce very useful results. This article is an attempt to provide an understanding of the basic techniques and to identify some useful aids for users interested in trying them.

This article is far from a complete overview of RegEx. There are many resources that offer information about RegEx, but most are overwhelmingly complex, covering many topics not needed by the ORA user who is beginning to use this tool. This article focuses only on those aspects of RegEx that I have found most useful in creating Templates in ORA.

The Basic Concepts

RegEx tools examine a string of characters – letters, numbers, punctuation marks, and spaces – to determine whether any part of that string matches a specified "pattern." If so, it is considered a "match."

The "pattern" can be as simple as a single character, for example an underscore character. It can also be a complex pattern of letters and numbers or punctuation marks. It can be limited to the first or last occurrence of that pattern in the string. Or it can be limited by the boundaries of an individual "word" within the string.

Some examples of patterns and their matches:

RegEx Pattern	String	Matches
_	T625_375	_
[0-9]+	T625_375	625 and 375
[a-z][0-9]+	T625_375	T625
$[a-z]+$	Mary (Jones) Smith	(Jones)
$([a-z]+)$	Mary (Jones) Smith	Jones

The last example shows how only part of the match is "captured." In this case the name with its surrounding parentheses was matched, but only the letters within parentheses were captured.

ORA can make use of matches in three ways:

The :replace Transform can replace the matched characters with something else, or with nothing, thus removing them, leaving the rest of the string unchanged.
The :extract Transforms do the opposite, extracting the matched string or only the captured part of it, dropping all the rest of the string.
The Value Test Variable will return "true" if a match is made, and "false" if there is none. That result can then be used in a Conditional structure to control the output of other text.

Some examples of uses of these tools in ORA:

Transform or Variable	String	Result
[Film:replace:_:-]	T625_375	T625-375
[Age:extract:([0-9]+)	age 23	23
[?:Age/yrs/]	8 mos 34 days	false
[?:Age/mos/]	8 mos 34 days	true

Creating Patterns

RegEx patterns most used in ORA are constructed of three main components: elements that specify characters, elements that specify repetition or quantity, and elements that specify relationship to boundaries.

Specifying Characters

There are a large number of tools for specifying characters in patterns in RegEx. Those that I have found useful in ORA are these:

Literal Text – matches the specific character or string of characters. When the match is to include an exact match of a specific character or string of characters, simply make those characters part of the pattern. Common examples include hyphens, quotation marks, and various types of brackets, often with a preceding or following space. But strings of letters, for example "year" or "volume" can also be useful parts of the pattern in some cases.

Characters that have special meaning to RegEx must be "escaped" – preceded with a backslash – when used as literal text. For example a period must be written as " \. " when it is to be interpreted a period. Special characters I have most often encountered when working with ORA are the period, square brackets, and parentheses.

. (period) – represents any character (except a line break). It is useful when the string to be matched may contain a variety of characters, including letters, numbers, spaces, and punctuation marks.
Character Classes – define a set of characters, any one of which can match the target string. A Character Class is defined by enclosing the desired characters, or a range of characters, in square brackets. Some of the Character Classes I have found useful in ORA include:

Character Class		Meaning
[0-9]		any number
[a-z]		any letter
[aeiou]		any of the specified letters
[a-z ]		any letter or space

Specifying Repetition

The characters defined by the methods described above will match a single character in the target string (or in the case of literal text, a single instance of the literal string). Thus the Character Class [a-z] will match any one letter in the range between a and z. When the pattern is to match a series of characters a number of "repetitions" must be specified. There are a number of ways to specify repetition in RegEx, with the following being those I have found most useful in working with ORA:

+ (plus sign) – means the specified characters must be found one or more times for a match.
* (asterisk) – means the specified characters can be found one or more times, or not at all, as part of a match.
? (question mark) – means the specified character may not be found, or can be found once. This is most useful when trying to match a pattern when the matched string may or may not be followed by a space or period.
Specific Number – is defined by enclosing the number of repetitions, or range of repetitions, in curly brackets. Examples of this method:

{1}		matches exactly one time
{4}		matches exactly four times
{1,4}		matches between one and four times

Specifying Location

There are a large number of tools available in RegEx to specify where the pattern must be found within the target string. Those that I have found most useful in using RegEx in ORA are these:

^ (caret symbol) – means the pattern must be found at the beginning of a target string.
$ (dollar sign) – means the pattern must be found at the end of the target string.
\b – means the pattern must be found at a "word boundary." A "word" may consist of letters, numbers, or underscores, and is separated from other "words" in a string by spaces or other characters, or be at the beginning or end of the string. The " \b " code can mean either the beginning or end of a "word" depending on whether it is placed before or after the pattern characters to be matched.

Examples

Some examples of how the three types of elements described above can be combined to define a desired pattern may be helpful:

Pattern	Meaning	Sample String	Match
[0-9]+	one or more numbers	film M345	345
[a-z][0-9]+	one letter followed by one or more numbers	film M345	M345
\b[a-z]\b	one letter at beginning of word followed by end of word	John D Jones	D
^0*	any number of zeros at start of string	000489	000
\b0*	any number of zeros at start of word	cert 00689	00
$.*$	any number of characters enclosed with parentheses	Jane (Jones) Smith	(Jones)

We next move on to some other aspects of RegEx that are relevant to using it in ORA.

Managing Multiple Matches

A RegEx pattern may match more than one segment of a test string. The three ORA functions have different ways to manage which match or matches are to be used.

The :replace Transform has an optional third parameter for Flags. The two Flags recognized are " i " for "case insensitive" (described in the next section) and " g " for "global search." If the Flags parameter is omitted, the :replace Transform assumes the " g " Flag. This means that the replace function acts on every match found in the search string. If only the first match found is to be replaced, the " g " Flag must be disabled. That is done by adding a final colon to create the optional Flag parameter, then entering either a blank to indicate no Flags, or the " i " Flag if that is desired. The following examples illustrate how this works:

Transform	Variable Value	Output
[Age:replace:[0-9]+:x]	13 yrs 25 mo	x yrs x mo
[Age:replace:[0-9]+:x: ]	13 yrs 25 mo	x yrs 25 mo
[Age:replace:[0-9]+:x:i]	13 yrs 25 mo	x yrs 25 mo

The :extract Transform always extracts only the first match found by a pattern. A variation, the :extractIndex Transform, allows the user to specify which match is to be extracted. With :extractIndex version a second parameter specifies which match is to be used. The examples below illustrate how the two variations work.

Transform	Variable Value	Output
[Age:extract:([0-9]+)]	13 yrs 25 mo	13
[Age:extractIndex:([0-9]+):1]	13 yrs 25 mo	13
[Age:extractIndex:([0-9]+):2]	13 yrs 25 mo	25

Multiple matches are not relevant to the Value Test Variable. It returns "true" if a single or multiple match or found, and "false" when no matches are found.

Managing Letter Case

The case of alphabetic letters is sometimes significant in designing ORA Templates, and other at other times not. The ORA functions that use RegEx are by default case insensitive. But each can be configured to be case sensitive.

The :replace, and :extract Transforms use Flags in their optional final parameter to control how letter case is evaluated. If that final parameter is omitted the " i " Flag, meaning case insensitive, is assumed. To make these Transforms case sensitive, create the final (optional) parameter by adding a final colon. Then enter either a blank to indicate no Flags, or the " g " Flag if that is desired. The examples below use the :replace transform to replace a single letter with nothing (the second parameter is left empty), demonstrating the effects of omitting or specifying the " i " Flag::

Transform	Variable Value	Output
[Film:replace:[a-z]:]	T354-15	354-15
[Film:replace:[a-z]:: ]	T354-15	T354-15
[Film:replace:[A-Z]:: ]	T354-15	354-15
[Film:replace:[a-z]::i]	T354-15	354-15

Note that the Character Set " [a-z] " and " [A-Z] " have the same meaning when operating in the case insensitive mode, but different meanings in the case sensitive mode.

The Value Test Variable is case insensitive in its normal form – [?:Age/mo/] – but becomes case sensitive if a space, indicating the omitted " i " flag, is added at the end – [?:Age/mo/ ]. These examples demonstrate how this might be used:

Variable	Variable Value	Result
[?:Age/mo/]	3 yr 5 mo	true
[?:Age/MO/]	3 yr 5 mo	true
[?:Age/MO/ ]	3 yr 5 mo	false

Capture Groups

Capture Groups are used in the :extract and :extractIndex Transforms to specify that a part of a larger pattern is to be "captured," in the case of these Transforms, to be extracted. That is, when one wants to match a specific pattern only when it is surrounded by other pattern elements. The Capture Group is specified by enclosing the part of the pattern to be "captured" with parentheses. The following examples illustrate how a capture group can select a part of a larger pattern to extract:

:replace, and

In the first example, the parentheses around the term for "one or more numbers" – [0-9]+ – indicate that only the numbers are to be extracted. The literal text " mo " as part of the pattern indicates that the numbers are to be extracted only if they are followed by the text " mo ".

In the second example the literal characters left and right parentheses (with escape marks to indicate they are not themselves meaning a capture group) show that the intended match must be enclosed in parentheses. The actual target match is defined by the term for "any number of characters" – .* – inside the escaped parentheses, and that term is enclosed in unescaped parentheses to define the Capture Group.

It is considered good practice to always specify a Capture Group. If none is specified the :extract Transforms will extract all the text that matches the pattern.

Capture Groups are also useful with the :replace Transform. With that Transform one or more parts of the pattern match can be "saved" and then inserted into the replacement text. The "saved" parts are indicted in the replacement text parameter with a dollar sign and number, so "$1" indicated the first capture group is to be output, "$2" the second, etc. The example below, in which Capture Groups are used to insert a comma into a string of numbers, illustrates how this feature can be used:

Transform	Variable Value	Output
[Amount:replace:([0-9]{1,3})([0-9]{3}):$1,$2]	14283	14,283
[Amount:replace:([0-9]{1,3})([0-9]{3}):$1,$2]	489	489

In the first Capture Group, the term " [0-9]{1,3} " indicated that one to three digits are to be found. In the second Capture Group the term " [0-9]{3} " indicates that a group of three digits must be found. If both these matches are made, the output is to be the digits found by the first Capture Group, indicated by the " $1 " term, a literal comma, then the digits found by the second Capture Group, indicated by the " $2 " term.

In the second instance, there was no match found because there were not enough digits in the value, so no replacement was made. (This example does not work correctly with numbers with more than six digits.)

Testing Your RegEx Patterns

When creating Templates using RegEx one way to test the result is to use the Test button on the OraSettings screen, as one would for any Template under development. However that feature offers few clues about the cause of the problem when the RegEx in the Template does not produce the intended result.

Many users find the website Regular Expressions 101 to be a valuable tool for testing.

The first step in using this website is to set the "Flavor," in the upper left corner of the page, to "ECMScript (JavaScript)" as shown in the screenshot on the right. There are a number of variations, or "flavors," of RegEx with minor differences in some details, and this sets the website to use the variation used by ORA.

The next step is to enter a sample of the text that the RegEx is to test, and the pattern intended to test with in the center of the web page, as shown in the screenshot below which shows a test of the RegEx used in the last example above.

The pattern is entered in the "Regular Expression" field. At the right end of that field are the Flags the website will use in evaluating the pattern. Note that the default for this site are different than the default used by ORA. The website uses " g " and " m " by default. To obtain accurate results, these should be changed to " g " and ' i " as shown in the screenshot, unless different flags are to be used in ORA as described above for controlling letter case. The Flags are changed in the website by clicking on one of the Flags displayed and choosing the desired ones from the drop-down list that appears.

The test string is entered in the "Test String" field below. The test string can easily be edited to test the pattern with different values that might be found in various records.

The two fields in the right pane of the website provide important information about the pattern entered and the matches, if any that are found.

The upper field, "Explanation," provides an explanation of the meaning of each term in entered in the pattern. Reviewing this information can be very helpful in understanding the issue when a pattern does not produce the expected result.

The field below, "Match Information," lists all the matches found in the test string. When a Capture Group is defined, as it is in the illustrated case, it also lists all the capture groups found. If multiple matches are listed, be aware of how the ORA functions deal with them, as described above.

Additional RegEx Resources

A wide variety of websites provide tutorials and detailed explanations of RegEx, and can be found with a Google search. Two that are often recommended are RegexOne and RexEgg. However, as mentioned above I often find them overwhelming, making it difficult to find what I need.

The Quick Reference section, in the lower right corner of the Regular Expressions 101 site, mentioned above for testing, is also helpful. It provides good brief descriptions of almost all of the RegEx tokens, and if you click a particular token it gives you a more complete description and provides an example.

When I have difficulty getting a pattern to do what I want I generally post a question to the ORA list, where a number of helpful users, and ORA's author, John Cardinal, often respond. Another approach I find useful is to make a Google search with a statement of my question and the term "RegEx."

In addition, a number of my Template Examples make use of RegEx, and include explanations of how the RegEx terms work. They include:

Using "an" before words beginning with vowels in text
Removing country name in place fields
Standardizing street names in place fields
Testing for sheet number in Ancestry.com census records
Testing for birth month in Ancestry.com 1870 US census records
Testing for birth month in Ancestry.com 1950 US census records
Inserting commas in dollar amounts in Ancestry.com census records
Fixing missing "abt" date modifier in Ancestry.com census records
Removing extraneous text from names in Newspapers.com site
Formatting age values on FindAGrave site.

ReigelRidge Home

Terry's Tips Home

Contact Terry