Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

how to use Regex?

Moderators: larsivi kris

Posted: 04/14/09 07:07:22

Hi, I am not quite sure on how to use the Regex class.

For instance: I have a web-page and want to extract the title. I tried with

  auto r = Regex("<title>(.*)<\\/title>");
  auto m = r.match(0);

but it throws an exception:

  tango.text.Regex.RegExpException: RegExp: Unexpected operand at char 20 "" in "<title>(.*)<\/title>"

Could someone give me an example? The above is very easy with python and php.

Author Message

Posted: 04/14/09 08:40:30

According to the documentation for Regex "<" and ">" are operators, try escaping them.

Posted: 04/14/09 17:20:18

This is what I was afraid of. I already tried escaping them like this:

auto r = Regex("<title>(.*)<\\/title\>");

but I get a compilation error:

main.d(24): undefined escape sequence \>

If I try with:

auto r = Regex("<title>(.*)<\\/title\\>");

I get it compiled, but it throws an exception:

tango.text.Regex.RegExpException: RegExp: Unexpected operand at char 20 ">" in "<title>(.*)<\/title\>"

Any ideas?

Posted: 04/14/09 20:44:42

As dood pointed out "<" and ">" are operators and you must therefore escape them all which means using something along the lines of

auto r = Regex("\\<title\\>(.*)\\</title\\>");

note that this is now escaping "<" instead of "/" which I don't think needs escaping. You could also try using

auto r = Regex(r"\<title\>(.*)\</title\>");

which becomes easier to read. The "r" before the string removes the need to escape characters within the string, a look at the D language reference on the Digtial Mars website may help you there.

Here's a small program that may make things clearer for you.

import tango.io.Stdout;
import tango.text.Regex;
import tango.net.http.HttpGet;

void main(){
        auto a=new HttpGet("http://www.google.com");
        auto p=cast(char[])a.read();
        auto r=new Regex(r"\<title\>(.*)\</title\>");
        r.test(p);
        Stdout(r.match(0)).newline;
}

Posted: 04/14/09 21:30:08

Hmm, I've somehow overlooked a Regex example:

import tango.io.Stdout;
import tango.text.Regex;

void main()
{
    foreach(m; Regex("ab").search("qwerabcabcababqwer"))
        Stdout.formatln("{}[{}]{}", m.pre, m.match(0), m.post);
}
// Prints:
// qwer[ab]cabcababqwer
// qwerabc[ab]cababqwer
// qwerabcabc[ab]abqwer
// qwerabcabcab[ab]qwer

One must set the input...

However when I try following:

import tango.io.Stdout;
import tango.text.Regex;
import tango.net.http.HttpGet;

void main()
{
        auto a=new HttpGet("http://www.google.com");
        auto p=cast(char[])a.read();
        foreach (m; Regex(r"\<title\>(.*)\</title\>").search(p))
        {
                Stdout.formatln("{}[{}]{}", m.pre, m.match(0), m.post);
        }
}

I get an exception:

object.Exception: 4invalid UTF-8 sequence

What does it mean? The page dump looks good to me.

Posted: 04/15/09 18:40:09

I have tried running your program and it works fine for me. I am using DMD v1.036 and the tango snapshot from 26/11/2008 both of which are available from the Tango Debian repository. If your versions differ then that may be the cause but I'm not sure.

Posted: 04/15/09 20:03:45

IIRC it is a known bug in the Regex engine that is a bit difficult to fix. In some cases a string is sliced wrongly in the middle of a utf-codepoint, causing the error. I think (speculating a bit here) that this could happen with new lines in the input.

The solution is probably (or hopefully at least, since otherwise I'm not sure what the problem is) to rewrite the faulting unicode functions in higher quality, there is even a rather old ticket for it.

Posted: 04/15/09 21:49:59

I know what the problem is: the page I got from google was encoded in latin-1 (charset=ISO-8859-1). After converting it to utf-8 (with gedit) and loading the trans-coded page, the Regex worked. The exception was coming from phobos code (Regex.d:4466) and btw a UtfException? should be used.

Is it possible with tango to convert between latin-1 and unicode?

A proposal: Regex should implement additional methods for searching - the current search method is supposed to be used in a foreach loop.

Posted: 04/16/09 07:15:01

No, Tango does not have conversion from non-unicode encodings.

As for improvements/additions to the Regex module, please create tickets?

Regarding the decode functions in Regex, they will hopefully be replaced at some point (and such that they can be generally used).

Posted: 04/23/09 19:56:10

a bit late, but I don't have that much time to play with tango now...

anyway, if someone needs conversion from/to latin-1 (based on php 5.2.9)

char[] decode(void[] content)
{
	ubyte[] buf = cast(ubyte[])content;
	ubyte[] result = new ubyte[buf.length * 4];
	uint n = 0;
	uint i = buf.length;
	ubyte* p = buf.ptr;
	while (i > 0)
	{
		uint c = cast(uint)*p;
		if (c < 0x80)
		{
			result[n++] = cast(ubyte)c;
		}
		else if (c < 0x800)
		{
			result[n++] = cast(ubyte)(0xC0 | (c >> 6));
			result[n++] = cast(ubyte)(0x80 | (c & 0x3F));
		}
		else if (c < 0x10000)
		{
			result[n++] = cast(ubyte)(0xE0 | (c >> 12));
			result[n++] = cast(ubyte)(0xC0 | ((c >> 6) & 0x3F));
			result[n++] = cast(ubyte)(0x80 | (c & 0x3F));
		}
		else if (c < 0x200000)
		{
			result[n++] = cast(ubyte)(0xF0 | (c >> 18));
			result[n++] = cast(ubyte)(0xE0 | ((c >> 12) & 0x3F));
			result[n++] = cast(ubyte)(0xC0 | ((c >> 6) & 0x3F));
			result[n++] = cast(ubyte)(0x80 | (c & 0x3F));
		}
		p++;
		i--;
	}
	result.length = n;
	return cast(char[])result;
}

void[] encode(char[] content)
{
	ubyte[] buf = cast(ubyte[])content;
	ubyte[] result = new ubyte[buf.length];
	uint n = 0;
	uint i = buf.length;
	ubyte* p = buf.ptr;
	while (i > 0)
	{
		uint c = *p;
		if (c >= 0xF0)
		{
			// four bytes encoded, 21 bits
			if (i >= 4)
			{
				c = ((p[0] & 0x07) << 18) | ((p[1] & 0x3F) << 12) | ((p[2] & 0x3F) << 6) | (p[3] & 0x3F);
			}
			else
			{
				c = 0x3F;
			}
			p += 4;
			i -= 4;
		}
		else if (c >= 0xE0)
		{
			// three bytes encoded, 16 bits
			if (i >= 3)
			{
				c = ((p[0] & 0x3F) << 12) | ((p[1] & 0x3F) << 6) | (p[2] & 0x3F);
			}
			else
			{
				c = 0x3F;
			}
			p += 3;
			i -= 3;
		}
		else if (c >= 0xC0)
		{
			// two bytes encoded, 11 bits
			if (i >= 2)
			{
				c = ((p[0] & 0x3F) << 6) | (p[1] & 0x3F);
			}
			else
			{
				c = 0x3F;
			}
			p += 2;
			i -= 2;
		}
		else
		{
			p++;
			i--;
		}
		// use '?' (0x3F) if no mapping is possible
		result[n++] = cast(ubyte)((c > 0xFF) ? 0x3F : c);
	}
	result.length = n;
	return cast(void[])result;
}