Ticket #1829 (closed defect: fixed)

Opened 14 years ago

Last modified 13 years ago

Tango's regex cannot match japanese (unicode charactors)

Reported by:	SHOO	Assigned to:	larsivi
Priority:	major	Milestone:	1.0
Component:	Tango	Version:	trunk
Keywords:	regex unicode	Cc:

Description

Tango's regular expression is not able to test string that consisted by Unicode characters, for example, japanese hiragana.

Unittest should pass following code:

import tango.text.Regex;

unittest
{
	auto r = new Regex(".+/(.+)");
	r.test("\u3042\u3044/\u3046\u3048\u304a"); // あい/うえお
	assert(r.match(1) == "\u3046\u3048\u304a"); // うえお
}

Change History

01/06/10 20:23:39 changed by larsivi

owner changed from kris to jascha.

The regex engine should really be capable of handling that as far as I know, however it is quite possible that there are bugs I'm afraid.

01/25/10 22:17:18 changed by larsivi

Btw, it would be interesting to see the results for the same regex on a string with ascii characters instead - it could just as well be a bug in the greedy matching part.

01/30/10 14:45:22 changed by larsivi

This seems to indeed be a unicode issue as exchanging the input with ab/cde works just fine.

02/07/10 21:40:21 changed by larsivi

milestone changed from 0.99.9 to 1.0.

06/06/10 19:59:45 changed by kris

owner changed from jascha to larsivi.

08/03/11 01:18:37 changed by dhasenan

The dot operator is hard-coded to basic Latin (ASCII range), Latin Extended A, Latin Extended B, and currency symbols. I think I can just make it char_t.min - char_t.max. Maybe need to do something smart about multiline vs not multiline.

08/03/11 02:08:28 changed by dhasenan

status changed from new to closed.
resolution set to fixed.

Fixed in r5672.

Ticket Navigation