Regular Expressions in Prolog


Every so often, there's a posting in comp.lang.prolog asking for Prolog predicates for manipulating regular expressions. Unfortunately, there doesn't seem to be anything out there. [I was wrong about that: see the Other Approaches section at the end.]

The other week, I faced this regular expression problem, and decided to code something. I wanted to reuse the regular expression features offered by the POSIX regex library. Its C interface are the four functions regcomp(), regexec(), regerror(), and regfree(), available via regex.h.

I found a few 'bugs' in the version of regex in my Linux (Debian Linux 1.1), and so used an alternative implementation, the rx library, available at ftp://ftp.gnu.org/pub/gnu/rx/rx-1.5.tar.gz.

My coding solution is to have a simple C interface to the POSIX rx library, and have my Prolog predicates call that program.


The C Interface Program

The program is regexpr.c. It takes a regular expression and a string, and returns its results in a convenient Prolog-like term format. A few examples:

$ regexpr "a(b)" "abc"
ok(2).
reg(0, 2, "ab").
reg(1, 2, "b").

$ regexpr "([^ .]*)\.c" " this is file.c"
ok(2).
reg(9, 15, "file.c").
reg(9, 13, "file").

$ regexpr "([[:alpha:]]*)[[:space:]]*([[:alpha:]]*)" "Andrew Davison"
ok(3).
reg(0, 14, "Andrew Davison").
reg(0, 6, "Andrew").
reg(7, 14, "Davison").

$ regexpr "dog" "woofwoof"
err(exe, 1, "No match").

The first and second arguments of each reg/3 term are the starting and finishing positions of the substring in the string input. Counting starts at 0, and the finishing position is one after the end of the substring.

Also note that rx always returns the substring matched by the entire regex even though the regex is not bracketed. This is not the case for the regex libarary.


The Prolog Predicates

The Prolog code in re.prolog consists of 8 regular expression predicates and 4 pretty-printing predicates for writing strings as sequences of characters.

The basic predicate is regexpr/4, which uses BinProlog's popen/3 to invoke regexpr.c and read in its Prolog-like results. It takes a regex and string as input and returns a result term and a list of matches. Some calls to regexpr/4:


?- regexpr("\+", "one+two+three", Result, PMats).
Result = ok(1)
PMats = [reg(3, 4,"+")]         % but with the string in ASCII

?- regexpr("([[:alpha:]]*)[[:space:]]*([[:alpha:]]*)","foo Bar",R,P), wp(P), nl.
[reg(0, 7, "foo Bar"), reg(0, 3, "foo"), reg(4, 7, "Bar")]
R=ok(3),
P=[reg(0,7,[102,111,111,32,66,97,114]),reg(0,3,[102,111,111]),
   reg(4,7,[66,97,114])]

The wp/1 predicate call in the second example prints the strings in the reg/3 terms as a sequence of characters.


Substitutions

There are two substitutions predicates: regsub/4 and regsuball/4. The first does a single substitution, the second repeated substitutions. regsub/4 takes a regex, a string, and a substitution template as inputs, and returns a new string. regsuball/4 takes the same types of arguments.

When the regex matches against a substring in the string input, it is replaced by the substitution template in the new output string. The template can contain slashed integers (e.g. \1) which denote bracketed subexpressions in the regex.

Some examples:


?- regsub("([^\.]*)\.c", "file.c", "gcc -c \0 -o \1.o", Ns), ws(Ns), nl.
"gcc -c file.c -o file.o"
Ns=[103,99,99,32,45,99,32,102,105,108,101,46,99,32,45,111,32,102,
    105,108,101,46,111]

?- regsub("([[:alpha:]]*)[[:space:]]*([[:alpha:]]*)","foo bar", "\2, \1", S), 
   ws(S), nl.
"bar, foo"
S=[66,97,114,32,102,111,111]

?- regsuball("\+", "one+two+three", " ", Ns).
Ns = "one two three"                    % but with the string in ASCII

Note the use of \0, \1 and \2 labels in the substitution templates. \0 matches against the entire regex, \1 refers to the string covered by the first bracketed subexpression inside the regex, \2 to the second pair of brackets.


Other Approaches

Thanks to Gertjan van Noord and Tom Howland for much of this information.

Tom Howland has developed regular expression predicates, available at ftp://ftp.rahul.net/pub/tom/ftp/regex.tar.gz.

Gertjan van Noord has implemented a solution using the SICStus Prolog foreign-language interface, available from http://www.let.rug.nl/~vannoord/prolog-rx.

Gertjan is also the author of the FSA Utilities Toolbox: a collection of utilities for manipulating regular expressions, finite-state automata and finite-state transducers, written in SICStus Prolog. Operations include determinization, minimization, and visualization. There is support to add your own regular expression operators.

Elex is a lex-like tool, a lexical analyzer generator with multiple output languages. There is a version which produces Prolog lexers at http://www.let.rug.nl/~vannoord/Elex/. The original Elex homepage is at http://www.ozemail.com.au/~mpp/elex/elex.html. Elex isn't implemented in Prolog (it uses C++ and Perl), but it can be made to produce Prolog output.

Multiplex by Peter Reintjes (for Quintus Prolog) is another lexer which produces Prolog output (and is implemented in Prolog): http://frege.als.com/nalp/people/Reintjes_Peter/Reintjes_Peter.html.


Comments

I welcome comments: my e-mail address is ad@coe.psu.ac.th.


Andrew Davison
Last updated: September 14th, 1998