std.string

lightoze · Joined: 12 Feb 2006 Posts: 35

1) This simple unittest hangs up:

brad · Posted: Wed Mar 08, 2006 11:13 am Post subject: Re: std.string

sean · Joined: 24 Jun 2004 Posts: 609 Location: Bay Area, CA

The algorithm I used is similar to the Knuth one except I begin comparing with pattern[0] and simply reduce the search string by pattern.length instead of matching against pattern[$-1] as I believe the Knuth version does. But it's obviously still broken. I'll fix it today and if you want to send me your rewrite then please do so.

lightoze · Joined: 12 Feb 2006 Posts: 35

On quite small strings KMP is faster a bit, on large strings - otherwise. This algorithms has similar asymptotics.
Also I have recently noticed, that pos argument is not useful, because using slicing is more clean.

sean · Joined: 24 Jun 2004 Posts: 609 Location: Bay Area, CA

Good point regarding pos. Perhaps it should be removed.

As for split--I'm planning to include split and join from Phobos, and that's the only function I'd gotten to when 149 was released. So expect functions to split on whitespace, a single char pivot, and possibly a substring. Join will be much the same.

sean · Joined: 24 Jun 2004 Posts: 609 Location: Bay Area, CA

I've checked in some updates to std.string, with more forthcoming. The code is generally just tightened up and bugs have been fixed. I've held off on switching to the Knuth algorithm for now as the memory allocation for prefixes can be problematic for very large strings, and I want these algorithms to be usable for all cases. I think it may be reasonable is to choose an appropriate implementation based on pattern length, so average-sized strings would use the Knuth algorithm and large strings would use the current algorithm. An alternative would be to provide both functions with different names so a user could choose the appropriate one for the situation.

[edit]

Ah, it was the Boyer-Moore algorithm I was recalling that simply indexes chars in the pattern. I suspect that it's faster than the KMP algorithm when there are few partial matches, and that the KMP algorithm is faster when there are many partial matches.

Am I placing too much importance on memory use? In practice I suspect it's likely that the pattern string will always be significantly smaller than the search string, whether each are measured in bytes or in gigabytes, and so the BM and KMP algorithms will probably always perform better than the naieve version I've implemented. Still, I don't want the algorithms to simply be unusable for specialized applications.

lightoze · Joined: 12 Feb 2006 Posts: 35

Ok, suppose that pattern will be 100MB size, so native search algorythm will take str.length * 100'000'000 time. This is MUCH MORE longer if you would use KMP or BM and on-disc swap. Any protests?
P.S. In this case, KMP will be more efficient because of less memory usage also.
P.P.S. Why does wiki downloads page have url ".../ares/wiki/Downlaods"?

sean · Joined: 24 Jun 2004 Posts: 609 Location: Bay Area, CA

lightoze · Joined: 12 Feb 2006 Posts: 35

sean · Joined: 24 Jun 2004 Posts: 609 Location: Bay Area, CA

Because I didn't see the typo Embarassed

It's now fixed.