View previous topic :: View next topic |
Author |
Message |
lightoze
Joined: 12 Feb 2006 Posts: 35
|
Posted: Wed Mar 08, 2006 9:32 am Post subject: std.string |
|
|
1) This simple unittest hangs up:
Code: | assert( rfind!(char)( "more fffruits", "fff" ) == 5 ); |
2) I am rewriting "find" and "rfind" functions now using much more efficient Knuth-Morris-Pratt algorithm. Also I will add useful "findAll" function and send new file this night.
3) I think current "split" function does not act as it is expected. May be it must act as PHP "explode" function - so you can give delimeter which would be used? |
|
Back to top |
|
|
brad Site Admin
Joined: 22 Feb 2004 Posts: 490 Location: Atlanta, GA USA
|
|
Back to top |
|
|
sean
Joined: 24 Jun 2004 Posts: 609 Location: Bay Area, CA
|
Posted: Wed Mar 08, 2006 12:02 pm Post subject: |
|
|
The algorithm I used is similar to the Knuth one except I begin comparing with pattern[0] and simply reduce the search string by pattern.length instead of matching against pattern[$-1] as I believe the Knuth version does. But it's obviously still broken. I'll fix it today and if you want to send me your rewrite then please do so. |
|
Back to top |
|
|
lightoze
Joined: 12 Feb 2006 Posts: 35
|
Posted: Wed Mar 08, 2006 12:08 pm Post subject: |
|
|
On quite small strings KMP is faster a bit, on large strings - otherwise. This algorithms has similar asymptotics.
Also I have recently noticed, that pos argument is not useful, because using slicing is more clean. |
|
Back to top |
|
|
sean
Joined: 24 Jun 2004 Posts: 609 Location: Bay Area, CA
|
Posted: Wed Mar 08, 2006 12:46 pm Post subject: |
|
|
Good point regarding pos. Perhaps it should be removed.
As for split--I'm planning to include split and join from Phobos, and that's the only function I'd gotten to when 149 was released. So expect functions to split on whitespace, a single char pivot, and possibly a substring. Join will be much the same. |
|
Back to top |
|
|
sean
Joined: 24 Jun 2004 Posts: 609 Location: Bay Area, CA
|
Posted: Wed Mar 08, 2006 5:03 pm Post subject: |
|
|
I've checked in some updates to std.string, with more forthcoming. The code is generally just tightened up and bugs have been fixed. I've held off on switching to the Knuth algorithm for now as the memory allocation for prefixes can be problematic for very large strings, and I want these algorithms to be usable for all cases. I think it may be reasonable is to choose an appropriate implementation based on pattern length, so average-sized strings would use the Knuth algorithm and large strings would use the current algorithm. An alternative would be to provide both functions with different names so a user could choose the appropriate one for the situation.
[edit]
Ah, it was the Boyer-Moore algorithm I was recalling that simply indexes chars in the pattern. I suspect that it's faster than the KMP algorithm when there are few partial matches, and that the KMP algorithm is faster when there are many partial matches.
Am I placing too much importance on memory use? In practice I suspect it's likely that the pattern string will always be significantly smaller than the search string, whether each are measured in bytes or in gigabytes, and so the BM and KMP algorithms will probably always perform better than the naieve version I've implemented. Still, I don't want the algorithms to simply be unusable for specialized applications. |
|
Back to top |
|
|
lightoze
Joined: 12 Feb 2006 Posts: 35
|
Posted: Wed Mar 08, 2006 6:25 pm Post subject: |
|
|
Ok, suppose that pattern will be 100MB size, so native search algorythm will take str.length * 100'000'000 time. This is MUCH MORE longer if you would use KMP or BM and on-disc swap. Any protests?
P.S. In this case, KMP will be more efficient because of less memory usage also.
P.P.S. Why does wiki downloads page have url ".../ares/wiki/Downlaods"? |
|
Back to top |
|
|
sean
Joined: 24 Jun 2004 Posts: 609 Location: Bay Area, CA
|
Posted: Wed Mar 08, 2006 6:46 pm Post subject: |
|
|
lightoze wrote: | Ok, suppose that pattern will be 100MB size, so native search algorythm will take str.length * 100'000'000 time. This is MUCH MORE longer if you would use KMP or BM and on-disc swap. Any protests? |
I think you're right for average case, though such large allocations can take a while. I was also concerned about memory overhead, but it's probably not worth the speed cost. I'll switch to using the KMP algorithm and leave the current code around as a curiosity.
Quote: | P.S. In this case, KMP will be more efficient because of less memory usage also. |
Don't both KMP and BM allocate a buffer equal to the pattern size? How is KMP more efficient?
Quote: | P.P.S. Why does wiki downloads page have url ".../ares/wiki/Downlaods"? |
I'll change it to be all lowercase. |
|
Back to top |
|
|
lightoze
Joined: 12 Feb 2006 Posts: 35
|
Posted: Wed Mar 08, 2006 6:58 pm Post subject: |
|
|
sean wrote: | Don't both KMP and BM allocate a buffer equal to the pattern size? How is KMP more efficient? |
BM uses two additionals arrays, instead of one in KMP.
sean wrote: | I'll change it to be all lowercase. |
I meant "why downlAods?". |
|
Back to top |
|
|
sean
Joined: 24 Jun 2004 Posts: 609 Location: Bay Area, CA
|
Posted: Wed Mar 08, 2006 7:01 pm Post subject: |
|
|
Because I didn't see the typo It's now fixed. |
|
Back to top |
|
|
|