Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Translate utf16 indices to utf8 indices

Moderators: larsivi kris

Posted: 05/19/08 22:26:06

I have an utf8 string and i want to call a windows api function that takes an utf16 string and returns a ascending sorted array of indices to that utf16 string. I need those indices translated to the utf8 string.

char[] str = something(); // the existing string

wchar[] wstr = toString16( str );
int[] indices16 = WindowsFunction( wstr.ptr, wstr.length );

// something happends ... :)

int[] indices8; // the resulting array of inidices to str

How can i do that in a nice, reliable and probably efficient way?
This should for the full unicode range.

Author Message

Posted: 05/21/08 16:58:53

Sounds like something that could be implemented more directly in D? A unicode savvy person would have to be involved though.

Posted: 05/21/08 20:04:11

In this case i miss the simple utf functions from phobos :)

encode/decode/...

Then i remembered the foreach does an implicit utf convertion, so i think this would work:

char[] str = something(); // the existing string

wchar[] wstr = toString16( str );
int[] indices16 = WindowsFunction( wstr.ptr, wstr.length );
int[] indices8 = indices16;

    // Translate the utf16 indices to utf8 indices
    int utf8idx = 0;
    int i = 0;
    foreach( uint utf16idx, char c; wchars ){
        if( indices8[i] is utf16idx ){
            indices8[i] = utf8idx;
            i++;
        }
        utf8idx++;
    }

I throw a glance at the foreach implementation (aApply.d line 308, _aApplywc2), hey and guess what i found there?[[br]] the "encode"/"decode" functions from phobos, hehe

I think we really need something like that to implement other looping constructs also.
For example, when i want to iterate over two strings codepoint wise, the foreach cannot be used.

Posted: 05/21/08 20:08:30

The same functions are currently in use in the Regex implementation too, but apparently they are buggy and so cause unicode failures in several situations - seems like they're not finding the correct code points in all cases.

But yes, I believe there is a ticket to create proper versions of these functions. Unfortunately the owner has disappeared.

Posted: 05/22/08 01:08:05

you mean, you want to use the templates in tango.text.convert.Utf ?

Posted: 05/23/08 00:13:54

My foreach loop above, hm i am not sure it really behaves correctly.
Now i tried to use tango.text.convert.Utf

char[] str8 = something(); // the existing string
wchar[] str16 = toString16( str8 );

int[ MAX_ITEMS ] idxbuf;
int idxcount;
WindowsFunction( str16.ptr, str16.length, idxbuf.ptr, MAX_ITEMS, &idxcount );

int[] indices = idxbuf[ 0 .. idxcount ];
// indices are ascending
// the first index is always 0
// the last index is always str16.length

int utf8idx = 0;
int utf16idx = 0;
int i = 0;

while( utf16idx < str16.length ){

    // convert to a single dchar, to ensure exactly one codepoint is eaten
    uint ate16;
    dchar[1] buf32;
    dchar[] res32 = Utf.toString32( str16[ utf16idx .. $ ], buf32, &ate16 ); 
    assert( ate16 >= 1 && ate16 <= 2 );
    assert( res32.length == 1 );
    assert( res32.ptr is buf32.ptr );
    
    // convert to char[] to get the char[] count
    uint ate32;
    char[4] buf8;
    char[] res8 = Utf.toString( res32, buf8, &ate32 ); 
    assert( ate32 is 1 );
    assert( res8.length > 0 && res8.length <= buf8.length );
    assert( res8.ptr is buf8.ptr );

    if( indices[i] is utf16idx ){
        indices[i] = utf8idx;
        i++;
    }

    utf16idx += ate16;
    utf8idx += res8.length;
}
assert( utf8idx is str8.length );
assert( utf16idx is str16.length );

// last idx behind end of str16
indices[$-1] = utf8idx;

// result in indices

http://codepad.org/bkqlerST

Is this correct?
Can it be written shorter?