Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Use of tango.text.Util.indexOf

Moderators: larsivi kris

Posted: 03/05/08 22:13:29 Modified: 03/05/08 22:15:02

Hi,

I am switching from Phobos to Tango, for various reason (first would be presence of XML related function, that Phobos is missing in its 1.0 version, even if I don't really care about alpha/beta for my project), and I'm making some basing String methods, mapping on the one found in Tango.

I have the following:

module lang.String;

import TangoUtil = tango.text.Util;
import TangoLayout = tango.text.convert.Layout;

public uint indexOf(T, U = uint)(T[] str, T c, U offset = 0) {
  return TangoUtil.indexOf(str[offset..$].ptr, c, str.length);
}

In fact, it is just like the indexOf function of Tango, except it allow me to use an additional offset which is not present in Tango.

My problem is not with the call in itself, nor it has to do with the interest of doing such thing while Tango seems to do it well (well, if I import the module tango.text.Util, I suppose I'd be able to do "foobar".indexOf('a') or "foobar"[offset..$].indexOf('a').

The problem is : if T is a char, then it is in utf-8.

My question is simple:

assert("é"d.indexOf(cast(char)'é') == 0);

(the cast is mandatory, because the D parser will convert it to a wchar (honestly, I don't know why we should fight with three king of char? ...)

How this will work?

- If é is a char, then it takes only 8 bits, but the fact is that in utf-8 'é' is coded 0xC3 0xA9. - If it look up like this :

for (int i = 0; i < length; ++i) {
  if (array[i] == c) return i;
}
return length;

Then there is no way that it work because the é character take two bytes, so it should be looked on two bytes.

Is there a class, or something better, for dealing with utf-8 compatible operation?

[edit] or am I forced to use wchar?

http://www.bbnwn.eu

Author Message

Posted: 03/06/08 04:43:03

This is a tricky aspect of UTF character processing. The difficulty arises from the fact that there's often no one-to-one correspondence between a char, and it's underlying representation. Wikipedia ought to have a section on this, describing the relationship between code-units and code-points.

One way around it is to convert UTF8 (and/or UTF16) into UTF32 first. By doing so, and operating in a UTF32 space instead, you regain the simple one-to-one mapping of characters to array entries (like ASCII). There are some special cases (with toUpper and toLower) where even this does not hold true at all times, but it works sufficiently well for most applications.

The reason why there are multiple UTF encodings is really a space/time trade-off. UTF32 can often consume more space than utf8 or utf16, but it is often faster and/or more convenient to work with.

FWIW, my experience with UTF has shown a reasonably clear pattern: low-level or trivial text processing, such as managing filenames, can easily and happily get by with UTF8 only (MS screwed this up, IMO, whereas Unix got it right). Heavy duty application-level text processing, such as a word-processor, tend to be better off with UTF32 instead. My 2c :)

Posted: 03/06/08 10:21:43

I see. I must admit that I'm most acustomized to the java.lang.String, and the fact it is hiding the internal String representation (or I would say, it use only utf-16 internally, with appropriate conversion, eg: http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.1 )

Even, as I understand it, utf-8, utf-16, and utf-32 are all for unicode, so the character "in" are the same (eg: é would be of code XXX on the three).

So, if I want a multi purpose (eg: independant of encoding) indexOf, I should do:

uint indexOf(char[] utf8string, dchar substring, uint offset = 0) {
  char[] e = utf8_encode(substring); // return the utf-8 representation of dchar, it surely exists in Tango
  return indexOf(utf8string, e, offset);
}

Is that right?

And in that case, shouldn't the indexOf function of Tango be specialised or deprecated/forbidden for searching particular character? (eg: indexOf(char[], dchar), indexOf(wchar[], dchar))

Finally, I wonder if it wouldn't be better to simply search for another substring (which would be in the same utf-X type).

http://www.bbnwn.eu

Posted: 03/07/08 22:23:14

You can also just do substring searches. ie.

auto pos = "abcdé".find( "é" );

... or the indexOf equivalent of course.

Posted: 03/08/08 06:19:33

If you're used to a java String, then see if tango.text.Text helps? It tries to avoid indexing altogether, by employing a "current selection" notion instead. Works quite well for most cases. Again, you might want to use a utf32 instance of Text, and take advantage the embedded transcoding support in there