A Java Package for Regular Expressions
Djun Kim, Mathematics, The University of British Columbia

Introduction

The following problem is commonly encountered by programmers and computer users:

given a block of text, find a substring which matches a given pattern.

For example we might wish to specify all files in a directory which end with the four characters ".htm". Many operating systems allow us to specify this set of filenames via the expression "*.htm".

Or, we may be interested in finding all occurences in a text file of "search", "searching", "searched", "searches" but not, for example, "research". Programmers generally use text editors which allow such a search using an expression such as "[^a-z]search.*". Such search specifiers are examples of regular expressions.

Problems involving regular expression searches involve finding all strings of a particular, ``easily described''' format, such as dates, filenames, variable names, formatting commands for typesetting programs. However, not everything which we might wish to search for can be specified as a regular expression. For example, we might reasonably want to pick out all examples of mathematical expressions in a file containing the text of an article. Or we might want to find all examples of verbs. Each of these examples attempts to specify string-sets whose structure is too complicated to describe compactly.


What is a Regular Expression?

Let us define precisely what a regular expression is. First, we need to specify A, the alphabet from which the strings we will be searching through are composed.

The closure of A, denote by A*, is the set of all strings (that is, n-tuples) of characters from A. The  0-tuple, or null string, will be denoted by ¥.

The reader is warned at this point that one of the potential obstructions to understanding regular expressions arises in confusing an object with the symbols used to describe it.

A language over A is a subset of A*.

Given strings x of length n and y of length m from some language  L  we can append y to x to form a string of length n+m, denoted by xy, and called the concatenation of x and y. For example, if x = fox and y = trot then xy = foxtrot.

The notation for contatenation is meant to suggest that it is a kind of product. Analogy suggests that we define exponentiation as the iterated concatenation of a string with itself:

xk = xx ··· x
where there are  k  factors of  x  in the right hand side.

Let  L  and  M  be given languages over a fixed alphabet. We can construct new languages from  L  and  M  by various operations. For example, let  L U M  denote the union of the sets  L  and  M; thus a string is in  L U M  if it is in either  L  or  M.

We can define the concatenation  LM  of  L  and  M  to be the language

 LM = { xy | x is in L and y is in M  }

Similarly  Lk = LL ··· L  where there are  k  factors of  L  in the right hand side. By convention,   L0  denotes the set containing  ¥, the null string.

The Kleene closure of a language  L, denoted by  L*, is the union over all  i = 0, 1, ...  of the languages   Li; Similiarly,  L+, the positive closure of  L, is the union over all  i = 1, 2, ...  of the languages   Li. Note  L+  does not contain   L0.

We shall use parentheses to group operations on sets: for example,

 (L U M)* 
denotes the Kleene closure of the union of  L  and  M.

If  r  is a regular expression over the alphabet  A, the symbol  L(r)  denotes the language over  A  defined by  r. Languages defined by regular expressions are called regular languages.

The regular expressions over  A  form a language  R  over  (A U M)  where  M  is the following set of metacharacters (assumed to be disjoint from  A):

M   = { (, ), *, |, ¥ }
To help the reader keep track of which are metacharacters, the elements of  M  will henceforth be typeset in boldface.

As an aside, it turns out that  R  is not a regular language over  (A U M)!

We now define the language  R  of regular expressions to be the smallest language over  A U M  satisfying the following four conditions:

  1.  ¥  is a regular expression. The language  L(¥)  is just the set containing the empty string,  {¥}.

  2. Every element  a  of  A  is a regular expression, which will also be denoted by  a. The language  L(a)  is  {a}.

  3. If  r  is a regular expression, then so is  (r). The language  L((r))  defined by  (r)  is just  L(r).

  4. If  r  and  s  are regular expressions, then so are  r | s,  rs, and  r *. The corresponding languages are defined by

    Lr | s  )  =  Lr )  U  Ls ),


    Lrs )  =  Lr ) Ls ),


    Lr* )  =  (Lr ))*

The first two rules specify basic elements; the third rule allows us to parenthesize regular expressions, and the fourth provides us with an inductive construction.

For convenience of notation, we'll introduce a few conventions. First, we'll establish an operator precedence which will allow us to dispense with unnecessary parentheses: * will bind more tightly than (the left associative operation of) concatentation, which in turn has higher precedence than the left-associative operation of alternation (``|''). For example, the expression  (a|b*)  | ((c*)(b))   would be written equivalently as  a|b*  c*b

Return to the Living Mathematics Project homepage.