Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

tango.net.uri question

Moderators: larsivi kris

Posted: 02/17/09 00:16:05

im writing a crawler bot for a search engine project and now having a problem with tango.net lack of documentation, can anybody tell me how:

1. to obtain path only (so i can filter it with robots.txt), im using tango.net.uri and all i can get was getPath() like '/pages/gallery.html', how to get '/pages/' only so ill not mess up with starting url like 'http://www.wikipedia.org' with no filename.

2. im using tango.net.http.HttpClient? sample as a base to retrieve html pages, is there a way to send my crawler user-agent when fetching a pages?

thanks.

Author Message

Posted: 02/17/09 01:56:31 -- Modified: 02/17/09 01:56:52 by
kris

  • if Uri doesn't split the path components, then you could use the one in tango.io.Path.parse("path");
  • HttpClient has a method named getRequestHeaders() .... use that to obtain the HttpHeaders instance, and add your additional headers there

Posted: 02/17/09 02:30:39

if i use tango.io.path, wouldnt that add extra overhead to my code since i want it a s small and fast as possible.

and i stripped some code form tango.io.path as below (to avoid bloat of imports):

		//get path folder
		char[] path = uri.getPath();
		int fn = -1;

		for (int i=path.length; --i >= 0;)
			if (path[i] == '/')
				if (fn < 0) fn = i + 1;

		if (fn < 0) fn = 0;
		path = path [0 .. fn];

but it still wont solve nasty urls like 'http://test.com/pages?req=1' where 'pages' is a folder and that url pointing to index.xxx

thanks for your reply, i know its not related to tango but any suggestion is highly appreciated.

Posted: 02/17/09 03:23:06 -- Modified: 02/17/09 03:24:43 by
debio -- Modified 2 Times

http://test.com/pages?req=1

If pages is a folder, I'm pretty sure a webserver will return a 404 on that one. It has to be http://test.com/pages/?req=1'

Apache is nice enough to redirect you to the second URL if you try the first one.

Posted: 02/17/09 03:44:10

i didn't know that, thanks debio.