Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

tango.text.Util.contains: code units or code points?

Moderators: larsivi kris

Posted: 11/30/09 18:43:09

Do the 'contains' and 'containsPattern' in 'tango.text.Util' work on code units or code points?

Author Message

Posted: 11/30/09 19:38:30

the byte pattern has to match exactly, which is why those utilities are templated for each of char/wchar/dchar

Posted: 11/30/09 19:48:45 -- Modified: 11/30/09 19:59:05 by
Abscissa -- Modified 3 Times

But for the char and wchar versions, is the search restricted to code point boundaries only or does it simply go from one code unit to the next code unit? I suppose I could just test this, and I think I will, but to clarify what I mean:

char[] str = "{one single symbol that just happens to be multi-byte in utf8}";

// I do realize that this code unit, by itself, is invalid as a code point,
// but I assume that 'contains' doesn't actually check that it's valid (or am I wrong?)
char c = str[$-1];

bool result = tango.text.Util.contains(str, c); // True or False?

Posted: 11/30/09 20:11:24

Just did a little test:

import tango.io.Stdout;
import tango.text.Util;

void main()
{
	// These three chracters are 3-bytes-each in utf8
	// If this gets messed up, it's supposed to be
	// the japanese kanji for "nihongo".
	char[] str = "日本語";

	// I do realize that these, by themselves, are invalid as code points,
	// but 'contains' and 'containsPattern' doesn't actually check for that.
	char c = str[$-1]; // *part* of a multi-byte utf8 code point
	
	// straddles the border of two multi-byte utf8 code points, but doesn't
	// totally encompass either.
	char[] subStr = str[2..4];

	if(tango.text.Util.contains(str, c))
		Stdout("contains operates on code units").newline;
	else
		Stdout("contains operates on code points").newline;

	if(tango.text.Util.containsPattern(str, subStr))
		Stdout("containsPattern operates on code units").newline;
	else
		Stdout("containsPattern operates on code points").newline;
}

Output:

contains operates on code units
containsPattern operates on code units

Posted: 11/30/09 20:20:52

correct

Posted: 11/30/09 23:02:37 -- Modified: 11/30/09 23:04:05 by
kris -- Modified 3 Times

btw, I was wondering if there's a need for those fully-qualified names in your example: if (tango.text.Util.contains(str, c))

I'm sure you already know this but, just in case, here's what we recommend:

import text = tango.text.Util;

if (text.contains (...))

or, if you don't care about namespaces:

import tango.text.Util;

if (contains (...))

2c

Posted: 11/30/09 23:49:30 -- Modified: 11/30/09 23:58:16 by
Abscissa -- Modified 2 Times

Oh, yea, I know. I was just being overly-paranoid about making sure it was calling the contains I meant to call. I have no idea why I felt the need to do that. :)

EDIT: Now I remember: In the little code snippet in my earlier post, I used it fully-qualified because it wasn't a full program and I didn't use an import. Then when I copy/pasted and turned it into a test app, it just kinda stayed that way. :) Also I don't usually use the (re)named import for tango.text.Util, because then I can't use D's nifty array-member calling syntax (which I am absolutely in love with), so I've gotten into the habit of going fully-qualified whenever the compiler complains about a conflict (not that there was a conflict in this case).

Posted: 12/01/09 00:48:38

oh, that's a good point about the "universal call syntax" and import renaming ... thx