The Theory Behind It All

Regular expressions are a concept borrowed from automata theory. Regular expressions provide a a way to describe a "language" of strings.

The term, language, when used in the sense borrowed from automata theory, can be a bit confusing. A language in automata theory is simply some (possibly infinite) set of strings. Each string (which can be possibly empty) is composed of a set of characters from a fixed, finite set. In our case, this set will be all the possible ASCII characters1.

When we write a regular expression, we are writing a description of some set of possible strings. For the regular expression to have meaning, this set of possible strings that we are defining should have some meaning to us.

Regular expressions give us extreme power to do pattern matching on text documents. We can use the regular expression syntax to write a succinct description of the entire, infinite class of strings that fit our specification. In addition, anyone else who understands the description language of regular expressions, can easily read out description and determine what set of strings we want to match. Regular expressions are a universal description for matching regular strings.

When we discuss regular expressions, we discuss "matching". If a regular expression "matches" a given string, then that string is in the class we described with the regular expression. If it does not match, then the string is not in the desired class.


  1. Perl will eventually support unicode, or some other extended character format, in which case it will no longer merely be ASCII characters.