Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact question

Moderators: larsivi kris

Posted: 02/17/09 00:16:05

im writing a crawler bot for a search engine project and now having a problem with lack of documentation, can anybody tell me how:

1. to obtain path only (so i can filter it with robots.txt), im using and all i can get was getPath() like '/pages/gallery.html', how to get '/pages/' only so ill not mess up with starting url like '' with no filename.

2. im using sample as a base to retrieve html pages, is there a way to send my crawler user-agent when fetching a pages?


Author Message

Posted: 02/17/09 01:56:31 -- Modified: 02/17/09 01:56:52 by

  • if Uri doesn't split the path components, then you could use the one in"path");
  • HttpClient has a method named getRequestHeaders() .... use that to obtain the HttpHeaders instance, and add your additional headers there

Posted: 02/17/09 02:30:39

if i use, wouldnt that add extra overhead to my code since i want it a s small and fast as possible.

and i stripped some code form as below (to avoid bloat of imports):

		//get path folder
		char[] path = uri.getPath();
		int fn = -1;

		for (int i=path.length; --i >= 0;)
			if (path[i] == '/')
				if (fn < 0) fn = i + 1;

		if (fn < 0) fn = 0;
		path = path [0 .. fn];

but it still wont solve nasty urls like '' where 'pages' is a folder and that url pointing to

thanks for your reply, i know its not related to tango but any suggestion is highly appreciated.

Posted: 02/17/09 03:23:06 -- Modified: 02/17/09 03:24:43 by
debio -- Modified 2 Times

If pages is a folder, I'm pretty sure a webserver will return a 404 on that one. It has to be'

Apache is nice enough to redirect you to the second URL if you try the first one.

Posted: 02/17/09 03:44:10

i didn't know that, thanks debio.