A Java Package for Regular Expressions

Djun Kim, Mathematics, The University of British Columbia

Introduction

The following problem is commonly encountered by programmers and computer users:

given a block of text, find a substring which matches a given pattern.

For example we might wish to specify all files in a directory which end with the four characters ".htm". Many operating systems allow us to specify this set of filenames via the expression "*.htm".

Or, we may be interested in finding all occurences in a text file of "search", "searching", "searched", "searches" but not, for example, "research". Programmers generally use text editors which allow such a search using an expression such as "[^a-z]search.*". Such search specifiers are examples of regular expressions.

Problems involving regular expression searches involve finding all strings of a particular, ``easily described''' format, such as dates, filenames, variable names, formatting commands for typesetting programs. However, not everything which we might wish to search for can be specified as a regular expression. For example, we might reasonably want to pick out all examples of mathematical expressions in a file containing the text of an article. Or we might want to find all examples of verbs. Each of these examples attempts to specify string-sets whose structure is too complicated to describe compactly.

What is a Regular Expression?

Let us define precisely what a regular expression is. First, we need to specify A, the alphabet from which the strings we will be searching through are composed.

The closure of A, denote by A*, is the set of all strings (that is, n-tuples) of characters from A. The 0-tuple, or null string, will be denoted by ¥.

The reader is warned at this point that one of the potential obstructions to understanding regular expressions arises in confusing an object with the symbols used to describe it.

A language over A is a subset of A*.

Given strings x of length n and y of length m from some language L we can append y to x to form a string of length n+m, denoted by xy, and called the concatenation of x and y. For example, if x = fox and y = trot then xy = foxtrot.

The notation for contatenation is meant to suggest that it is a kind of product. Analogy suggests that we define exponentiation as the iterated concatenation of a string with itself:

x^k = xx ··· x

where there are k factors of x in the right hand side.

Let L and M be given languages over a fixed alphabet. We can construct new languages from L and M by various operations. For example, let L U M denote the union of the sets L and M; thus a string is in L U M if it is in either L or M.

We can define the concatenation LM of L and M to be the language

LM = { xy | x is in L and y is in M }

Similarly L^k = LL ··· L where there are k factors of L in the right hand side. By convention, L⁰ denotes the set containing ¥, the null string.

The Kleene closure of a language L, denoted by L*, is the union over all i = 0, 1, ... of the languages Lⁱ; Similiarly, L⁺, the positive closure of L, is the union over all i = 1, 2, ... of the languages Lⁱ. Note L⁺ does not contain L⁰.

We shall use parentheses to group operations on sets: for example,

(L U M)*

denotes the Kleene closure of the union of L and M.

If r is a regular expression over the alphabet A, the symbol L(r) denotes the language over A defined by r. Languages defined by regular expressions are called regular languages.

The regular expressions over A form a language R over (A U M) where M is the following set of metacharacters (assumed to be disjoint from A):

M = { (, ), *, |, ¥ }

To help the reader keep track of which are metacharacters, the elements of M will henceforth be typeset in boldface.

As an aside, it turns out that R is not a regular language over (A U M)!

We now define the language R of regular expressions to be the smallest language over A U M satisfying the following four conditions:

¥ is a regular expression. The language L(¥) is just the set containing the empty string, {¥}.
Every element a of A is a regular expression, which will also be denoted by a. The language L(a) is {a}.
If r is a regular expression, then so is (r). The language L((r)) defined by (r) is just L(r).
If r and s are regular expressions, then so are r | s, rs, and r *. The corresponding languages are defined by

L( r | s ) = L( r ) U L( s ),

L( rs ) = L( r ) L( s ),

L( r* ) = (L( r ))*

The first two rules specify basic elements; the third rule allows us to parenthesize regular expressions, and the fourth provides us with an inductive construction.

For convenience of notation, we'll introduce a few conventions. First, we'll establish an operator precedence which will allow us to dispense with unnecessary parentheses: * will bind more tightly than (the left associative operation of) concatentation, which in turn has higher precedence than the left-associative operation of alternation (``|''). For example, the expression (a|b*) | ((c*)(b)) would be written equivalently as a|b* | c*b

Return to the Living Mathematics Project homepage.