Note: This website is archived. For up-to-date information about D projects and development, please visit wiki.dlang.org.

PCRE Library

The regex library in MiniD's standard library is based on Tango's Tagged NFA regex engine. While this style of regex is much faster than Perl-Compatible Regexes (PCREs) at matching, it's also much slower at compiling and far less powerful. For more advanced text-processing tasks, PCRE is really the de facto standard.

libpcre is a library which implements PCREs and is available on virtually all platforms. This addon library is a binding to libpcre for MiniD.

Prerequisites

You will need libpcre installed, of course. This library loads it dynamically at runtime, so you just need the dynamic library ("libpcre.so" on Linux, "libpcre.dll" or "pcre.dll" on Windows, "libpcre.dylib" on OSX).

This library expects you to have libpcre version 7.4 or higher, built with UTF-8 support. Support for Unicode Properties is not necessary, just UTF-8. This library will check that the version of libpcre that you have conforms to these requirements.

Like any shared library, libpcre can be on the standard shared library search path or in the same directory as the host program that loads it.

For some reason the maintainers of the GnuWin32 project are lazy (?) and have failed to rebuild libpcre for Windows since version 7.0. As far as I know, there is no newer precompiled binary available for Windows anywhere. So, to save you the trouble, I've gone through the (irritating) process of building libpcre 7.8 with UTF-8 support on Windows myself, which you can download from the repository here (Note: you must have the VC++2008 redist installed for this DLL to work. This is a very tiny download and fast install.)

Initialization

To initialize the library within your host app, just do this:

import minid.addons.pcre;

...

// after opening a VM and loading the standard libraries into it
PcreLib.init(t);

That's it. Now any MiniD code loaded by your host program will be able to import the "pcre" module and use the Regex object contained therein.

Library Reference

This library is exposed through the "pcre" module in MiniD. It has a single member, the Regex class. The remainder of the reference is for that class.

class Regex

Wraps a compiled regular expression object. This exposes a similar interface to that of the standard library Regexp class.

this(pattern: string, attrs: string = "")
Regex constructor. The pattern parameter is a string representing the regular expression to be compiled. attrs is a string containing attributes with which to compile this regexp. attrs can contain any of the following characters, in any order:
  • 'i' - Case-insensitive. Any literal characters or character classes will match either case of that letter.
  • 's' - The dot pattern will match all characters including newlines (whereas it normally does not match them).
  • 'm' - Multiline. Normally, the ^ and $ patterns will match the beginning and end of the string; with this modifier, they will match the beginning and end of each line in the subject string.

Throws an exception if the pattern could not be compiled.

numGroups()
Returns the number of matched subgroups. This returns 0 if test returned false. Otherwise, it returns a number > 0.
groupNames()
Gets an array of strings of named groups. Named groups are created with the "(?P<name>pattern)" regex syntax. So, if you compiled something like r"(?P<lname>\w+), (?P<fname>\w+)", this function would return an array containing the strings "lname" and "fname" (though not in any particular order).
test([subject: string])
This is the workhorse of the regex engine. This gets the next match of the regex in the subject string, returning false when there are no more matches and true otherwise. When called with a parameter, it is set as the new subject string and then tested. That is, something like "re.test("foo")" is the same as "re.search("foo").test()".

When this function returns true, it updates all the matches, which can then be retrieved with the match function. pre and post will also be updated to refer to the correct portions of the subject string.

match([idx: int|string])
Gets matches in the string. If there are no more matches (test returned false), this function will just throw an exception. This function has three forms.

If you call this function with no parameters, it gets the portion of the subject string that corresponds to the entire match of the regex.

If you call this function with an integer parameter, it gets the portion of the subject string that corresponds to the nth subgroup in the regex. Subgroup 0 is the entire regex, and so match(0) will return the same thing as match(). The maximum legal group index is numGroups() - 1.

If you call this function with a string parameter, it gets the portion of the subject string that corresponds to the named subgroup in the regex. If you specify a name that does not exist, an exception will be thrown. Valid names can be retrieved by the groupNames function.

opIndex
This is the same as match. So "re[4]" is the same as "re.match(4)". Although "re[]" is not the same as "re.match()", since "re[]" is a slice, not an index.
search(subject: string)
Sets the subject string. This resets any currently-saved matches. After calling this, you can use test or iterate over matches using a foreach loop.
pre()
Gets the portion of the subject string before the entire regex's match. Throws an exception if there are no matches (test returned false).
post()
Gets the portion of the subject string after the end of the entire regex's match. Throws an exception if there are no matches (test returned false).
replace(subject: string, repl: string|function)
Perform a search-and-replace on subject using this regex as the search term.

If repl is a string, it will simply be used to replace any matches of the regex in subject. (This will probably be expanded.)

If repl is a function, it will be called on each match of regex in subject, with the regex object as the only parameter. Through that parameter the replacement function can access the current match, pre/post etc. It must then return a string to be used as the replacement.

This function returns the result of replacing each match of this regex in subject with repl.

split(subject: string)
Splits subject into pieces, using matches of this regex as the delimiters. Returns the array of split-up components.
find(subject: string)
Searches for the first match of this regex in subject. Returns the position of that match if found, or the length of subject if not. Basically the same as "if(re.search(subject).test()) return #re.pre(); else return #subject".
opApply
This allows you to iterate through all matches of this regex in a given subject string. To set the subject string, you can use search, which conveniently returns the regex object, which can then be iterated over.

In the foreach loop, there will be two indices: the first is the 0-based index of the match (not the group index, just how many times the regex has matched in the subject string), and the second is the regex object itself, which you can use to access all the matches.

For example:

local re = pcre.Regex$ @"(\w+)\s?=\s?(\w+)"
local subject =
"foo = bar
baz= quux"

foreach(i, m; re.search(subject))
	writefln$ "{}: key = '{}', value = '{}'", i, m.match(1), m.match(2)

This will print out:

0: key = 'foo', value = 'bar'
1: key = 'baz', value = 'quux'

Note that opApply is just defined in terms of test. You can also iterate through all matches by doing something like this:

for(local i = 0, re.search(subject); re.test(); i++)
	writefln$ "{}: key = '{}', value = '{}'", i, re.match(1), re.match(2)

This prints out the same thing as the previous example (given the same regex and subject).