Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Text.util substitute with utf characters

Moderators: larsivi kris

Posted: 05/18/09 09:43:34

Hey,

I have some problems using the tango.text.Util functions. I need to strip characters from an input text, which is defined as dchar[]. I created an array of dchar[]'s, which I would like to strip from each text.

const   dchar[] CHARS_2STRIP    = [
                                      0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 
                                      0x0028, 0x0029, 0x002A, 0x002B, 0x002C, 0x002E, 0x002F, 
                                      0x003A, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F, 
                                      0x0040, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F, 0x0060,  
                                      0x007B, 0x007C, 0x007D, 0x007E, 
                                      ];  

The problem is how to use substitute:

1.

foreach (character; TextTrimData.CHARS_2STRIP)
{
    text = TextUtil.substitute(text.dup, character, " ");
}

Error: template tango.text.Util.substitute(T) cannot deduce template function from argument types !()(dchar[],dchar,char[1u])

2.

foreach (character; TextTrimData.CHARS_2STRIP)
{
    text = TextUtil.substitute(text.dup, cast(dchar[]) character, cast(dchar[])" ");
}
Error: e2ir: cannot cast from dchar to dchar[]

Can somebody give me a hint?

/lars

Author Message

Posted: 05/18/09 09:53:08 -- Modified: 05/18/09 09:55:30 by
lars_kirchhoff

ok, found a solution.. more quickly than I thought :)

foreach (character; TextTrimData.CHARS_2STRIP)
{
    text = TextUtil.replace(text.dup, character, cast(dchar)' ');
}

Is there any faster method?

Posted: 05/18/09 14:32:20

Maybe doing all replacements in a single pass could be faster? If the strings are long, seems likely that it is.

foreach (s; delimiters(text, TextTrimData.CHARS_2STRIP))
{
    s[] = ' ';
}

Posted: 05/18/09 20:45:53 -- Modified: 05/18/09 21:01:24 by
torhu -- Modified 2 Times

Nevermind, that won't work. This should:

foreach (ref s; text)
{
  if (contains(TextTrimData.CHARS_2STRIP, s))
    s = ' ';
}

Posted: 05/20/09 15:12:37 -- Modified: 05/20/09 15:14:35 by
lars_kirchhoff

thank you very much.. your solution is indeed faster. I just wrote up a small test program.

module trim;


import  Integer = tango.text.convert.Integer;
import  TextUtil = tango.text.Util;
import  tango.time.StopWatch;
import  tango.io.Stdout;

const   dchar[] CHARS_2STRIP    = [
                                   0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 
                                   0x0028, 0x0029, 0x002A, 0x002B, 0x002C, 0x002E, 0x002F, 
                                   0x003A, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F, 
                                   0x0040, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F, 0x0060,  
                                   0x007B, 0x007C, 0x007D, 0x007E, 
                                   ];  


void main ( char[][] args )
{
    uint runs = Integer.toInt(args[1]);
    // Put in here a text. This one is a shorter then used for the test
    dchar[] text = "Lorem ipsum dolor sit amet, consectetur laboris re. Reliquarum mihi de opere quod voluptatibus magno...";
    StopWatch sw;
    
    
    sw.start;    
    for (uint i=0; i<runs; i++)
    {
        foreach (character; CHARS_2STRIP)
        {
            text = TextUtil.replace(text.dup, character, cast(dchar)' ');
        }
    }
    Stdout.formatln("Time: {:f4}", sw.stop);
    
    
    sw.start;
    for (uint i=0; i<runs; i++)
    {    
        foreach (ref s; text)
        {
            if (TextUtil.contains(CHARS_2STRIP, s))
                s = ' ';
        }
    }
    Stdout.formatln("Time: {:f4}", sw.stop);    
}

You solution becomes even better when the CHARS_2STRIP array becomes larger:

Times for 31 items:
Time: 6.6042
Time: 4.1018

Times for 272 items:
Time: 54.8661
Time: 25.6716

/lars

Posted: 05/20/09 16:03:49

the example using replace() is doing a quadratic .dup of the content, so you are probably timing the memory allocator instead? Also, such benchmarks are often quite fickle: the number of runs should be in the thousands, minimum, to help amortize whatever the O/S does in the background. You should also ensure CPU throttling (or power saving) is disabled.

That aside, using contains() or locate() will offer good throughput since they can both test multiple characters at a time (depending on CPU and character width).