Thanks, at 1st my attempt at d/tango was inspired by dlucene (d/phobos), but we actually making a real search engine not open source library. we used HttpClient? + Uri module in our crawler (multiple crawler instances + single indexer setup) but it failed after 200-300 docs, we tried with different seeds and still the same, it stopped at random url not the same problematic url (ill try to reproduce the error/exception codes) so we resorted to libcurl. Issues with tango.xml is minor but we need a strong and stable html parser that was our main reason for libxml. There are lots of things to consider when building a search engine for web, our primary focus is performance and stability, and in most area we just use tango's module and they did great.
i'll have to dig up old backup archive for tango.xml and tango.net version to reproduce the bugs. ill post it here as soon as i found it.
* some info on our project:
- distributed search engines with each server holding up to 100-150 mil docs for performance.
- custom index format (word level inverted index, with packed original source text)
- independent indexers with built-in crawler.
- custom ranking algorithm, modified bm25 + phrase proximity
- url queue server (tango's linked list + sqlite)
also we had to built custom file stream to support our 3-bytes uint and 5-bytes ulong, integer data. we used this method instead of vint to pack integer since vint require twice I/O overhead which is expensive.