Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Utf String Iterator

Moderators: kris

Posted: 07/06/08 12:18:30 Modified: 02/09/09 15:53:19

I see many code samples that use the following iteration over a string:

for (int i = 0; i < str.length; ++i) {
   Stdout(str[i]); // just an example
}

Which is wrong, since char[] may contain multibyte characters.

foreach (dchar c; str) {} works great, but

a) it doesn't allow iteration over two strings in parallel.

b) it doesn't allow you to keep track over current iterator position. I mean, you can't say

  • what's the position (offset) of current character c in the source string
  • how much space does it occupy in the source Utf string (1, 2, 3, 4?)
  • how many characters left in a string.

c) once you break, you can't continue.

As an example, I will show use simple test case: Given two strings, cut first N characters, that match in both strings, i.e. "Hello, there" and "Hello, World!" -> "there" and "World!" while preserving utf correctness.

Here is my solution:

import std.utf;
import std.stdio;

struct UtfStringIterator(CharType)
{
    public CharType[] str;      // the string
    public size_t offset;       // an offset of the current character
    public size_t nextOffset;   // an offset of the next character.
    public dchar value;         // current character
                                // its length can determined as nextOffset - offset.

    static UtfStringIterator opCall(CharType[] str) {
        UtfStringIterator it = void;
        it.str = str;
        it.offset = 0;
        it.nextOffset = 0;
        it.value = decode(str, it.nextOffset);

        return it;
    }

    bool isValid() {
        return this.offset < str.length;
    }

    void moveNext() {
        offset = nextOffset;
        if (isValid()) {
            value = decode(str, nextOffset);
        }
    }

    int opApply(int delegate(ref dchar d) dg) {
        while (isValid) {
            int result = dg(value);
            if (result != 0) {
                return result;
            }
            moveNext();
        }
        return 0;
    }
}

unittest
{
    auto iter = UtfStringIterator!(char)("Hello, World!");
    while (iter.isValid) {
        writef(iter.value);
        iter.moveNext();
    }
    
    writefln();

    auto iter2 = UtfStringIterator!(wchar)("Hello, World!"w);
    foreach (dchar c; iter2) {
        writef(c);
    }
}

void main()
{
    string s1 = "Привет, Страна!";
    string s2 = "Привет, Мир!";
    
    auto i1 = UtfStringIterator!(char)(s1);
    auto i2 = UtfStringIterator!(char)(s2);
    
    while (true) {
        if (i1.value != i2.value) {
            s1 = s1[i1.offset..$];
            s2 = s2[i2.offset..$];
            break;
        }
        
        i1.moveNext;
        i2.moveNext;
    }

    assert(s1 == "Страна!");
    assert(s2 == "Мир!");

    for (int i = 0; ; ++i) {
        if (s1[i] != s2[i]) {
            s1 = s1[i..$];
            s2 = s2[i..$];

            break;
        }
    }

    assert(s1 == "Страна!");
    assert(s2 == "Мир!");

    return 0;
}

I would be happy to see something like this in Tango.

Author Message

Posted: 07/06/08 16:14:08

If you look at tango.text.stream.* and make it fit in there, and then post it to a wishlist ticket, we can discuss it :)

From the onset, it looks like nice work though :)