im writing a crawler bot for a search engine project and now having a problem with lack of documentation, can anybody tell me how:

1. to obtain path only (so i can filter it with robots.txt), im using and all i can get was getPath() like '/pages/gallery.html', how to get '/pages/' only so ill not mess up with starting url like '' with no filename.

2. im using sample as a base to retrieve html pages, is there a way to send my crawler user-agent when fetching a pages?


  • if Uri doesn't split the path components, then you could use the one in"path");
  • HttpClient has a method named getRequestHeaders() .... use that to obtain the HttpHeaders instance, and add your additional headers there

if i use, wouldnt that add extra overhead to my code since i want it a s small and fast as possible.

and i stripped some code form as below (to avoid bloat of imports):

		//get path folder
		char[] path = uri.getPath();
		int fn = -1;

		for (int i=path.length; --i >= 0;)
			if (path[i] == '/')
				if (fn < 0) fn = i + 1;

		if (fn < 0) fn = 0;
		path = path [0 .. fn];

but it still wont solve nasty urls like '' where 'pages' is a folder and that url pointing to

thanks for your reply, i know its not related to tango but any suggestion is highly appreciated.

If pages is a folder, I'm pretty sure a webserver will return a 404 on that one. It has to be'

Apache is nice enough to redirect you to the second URL if you try the first one.

i didn't know that, thanks debio.