Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Some regular expressions take a very long time to compile (at run time)

Moderators: kris

Posted: 04/11/10 16:29:36

I have been converting some code that parses html pages looking for tags within <> braces using regular expressions. On changing from phobos to Tango I have noticed that compiling regular expressions that include the < and > characters as literals takes quite a long time. By compile I mean when the Regex object is instantiated at run time.

For example, on my machine the following 5 Regex lines take 94ms, 359ms, 563ms, 2734ms and 7438ms respectively at run time:

	Regex re1;
	re1 = new Regex(r"a\<([^>]*)\>\<([^>]*)\>\<([^>]*)\>");
	re1 = new Regex(r"aaaaaaaa\<([^>]*)\>\<([^>]*)\>\<([^>]*)\>");
	re1 = new Regex(r"aaaaaaaaaaaa\<([^>]*)\>\<([^>]*)\>\<([^>]*)\>");
	re1 = new Regex(r"a{40}\<([^>]*)\>\<([^>]*)\>\<([^>]*)\>");
	re1 = new Regex(r"a{80}\<([^>]*)\>\<([^>]*)\>\<([^>]*)\>");

I am escaping the < and > characters so that they are literal and not the look ahead or look behind operators. The only difference between these regular expressions is the number of leading 'a' characters.

The rapid increase in compile time for relatively minor additional regular expression complexity seems excessive. Am I missing something ?

I am using release 5428 of Tango v0.99.9 with DMD v1.056 on Windows.

Author Message

Posted: 04/11/10 20:17:31

This is a known issue (certain patterns will take long to compile, but actual search time should be very fast, and you can do that as many times as you wish of course). There are however several long standing issues with the implementation, so most likely we'll find an alternative solution before the next release (hopefully without changing much of the interface).

Posted: 04/11/10 21:42:57

OK, there also seem to be some serious restrictions of regular expressions with tango compared to phobos or to other language implementations, for example:

\w \s etc. cannot be used within [] braces, you
have to use the longer forms [a-zA-Z0-9] etc.

r"([-+]?\d+)" cannot be used to match an integer
string with optional sign, it is necessary to use
r"(([-+])?\d+)", i.e. an extra set of braces are needed.

The convention of not having to escape "-" if it is the
last character in a [] set does not apply, i.e.
r"(([+-])?\d+)" won't run, you have to use r"(([+\-])?\d+)"