General informations about clients
Submitted by thindil on Tue, 09/16/2008 - 10:05.
1. Downloading workunit
- Address of server with workunit should be available to set by user.
- Client connect to server by HTTP protocol and send username and password by basic access authentication without waiting for authentication challenge from server.
- After successfully authorization, download new workunit file.
- If client use proxy, must send two additional HTTP headers:
Cache-control: no-cache\r\n
Pragma: no-cache\r\n
- If client have support for run multiple crawlers simultaneously, should queue all crawlers (so, only one crawler connect to server from client in this same time).
2. Crawling
- Client should check correctness of downloaded workunit before start crawling. If workunit is invalid, drop it and back to download stage.
- Crawls the URLs given by the workunit. For each URL in a workunit, sends one request to the given host. If a response is received, writes the response text in an .arc file.
- If response HTTP code is 200, client must parse response for new URL and create sitemap file with this same name like .arc file.
- Client must have ability to set maximum amount of bytes received in one request. By default set on 25 MB on request (later client been read it from workunits). If server send more bytes, client should wrote to .arc file only part of response.
- The client does not follow any links on the crawled pages and does not follow any redirects.
- If server return only HTTP headers, client must write that page as a client error
- Client must check content page for META ROBOTS tag. If page cannot be indexed (NOINDEX tag), client must write that page as a client error. If URL's from page cannot be fetched (NOFOLLOW tag) client cannot create sitemap file from that page
- HTTP version of request client must read from workunit file.
- Request type (at this moment - GET) client must read from workunit file.
- HTTP error codes for client:
204 - server not sent page content (when server answer is HTTP 200 and server sent only HTTP headers)
403 - page cannot be indexed due to META ROBOTS tag
404 - client cannot resolve server IP via DNS
408 - timeout
503 - server not sent any data (client connect to server but don't get any data in response)
All other errors must be encoded as HTTP 500. All errors must be written in .arc file. Example:
http://test 0.0.0.0 20080916122745 application/x-grub-error 39\n
HTTP/1.0 500 Invalid URL\r\n
\r\n
Invalid URL\n
- Order must be preserved and every request must be represented in the .arc file (none can be skipped) so that the Grub server can validate an entire workunit from the resulting .arc.
- Minimal set of HTTP headers send by client should be:
Accept: */*\r\n
Accept-encoding: gzip,deflate\r\n
Connection: close\r\n
User-Agent: [delivered from workunit file]\r\n
- If client use proxy, must send additional HTTP headers:
For HTTP/1.0
Pragma: no-cache\r\n
For HTTP/1.1
Cache-control: no-cache\r\n
Pragma: no-cache\r\n
- After successful crawling, client must compress .arc and sitemap files by Gzip program.
3. Uploading .arc files
- Address of upload server client must read from workunit file.
- Client connect to server by HTTP protocol and send compressed .arc file.
- Minimal set of HTTP headers send by client should be:
Connection: keep-alive\r\n
Content-Length: [.arc.gz + HTTP headers length in bytes]\r\n
- Request type (at this moment - PUT) client must read from workunit file.
- Delete .arc file.
- If client have support for run multiple crawlers simultaneously, should queue all crawlers (so, only one crawler connect to server from client in this same time).
4. Uploading sitemap files
- Address of upload server for sitemaps should be available to set by user.
- Client connect to server by HTTP protocol and send username and password by basic access authentication without waiting for authentication challenge from server.
- After successfully authorization, upload sitemap file.
- If client have support for run multiple crawlers simultaneously, should queue all crawlers (so, only one crawler connect to server from client in this same time).
5. Optional (but recommended) client features.
- Ability to set maximum download and upload speed for crawler(s)
