Introducing project: Grub C Client

Written in the language of the great projects, the Grub C Client provides better efficiency by limiting to the basics. The executable is only a few KB, and the memory footprint is detemined by the C runtime and the biggest downloaded page. It's the more convenient for a server crawling, while still usable for home users. Although they may feel more comfortable having a GUI :-)

Crossplatform:
This probram only needs a C Compiler (preferably GCC) and a system supporting POSIX and BSD sockets to run on. Although Windows implementation isn't really conformant, it's supported anyway.

Usage:
GrubCClient

Will create the workunit with the specified web pages and upload it to the right url.

Installation:
svn co http://svn.swlabs.org/grubng/trunk/c/ grubcclient
cd grubcclient
make

You can also compile only the uploading part
make putarchive

putarchive Foo.arc.gz soap.grub.org /arcs/Balinny.00112233445566778899aabbccddeeff00112233.arc.gz

Will upload Foo.arc.gz to the server soap.grub.org at location /arcs/Balinny.00112233445566778899aabbccddeeff00112233.arc.gz

The Grub C Client also supports several defines on the Makefile to customize it.

HAVE_ZLIB_H Directly uses zlib to write the gzipped arc (default). If not specified, it will create the arc and then call gzip to get it compressed. Using it will require less space on hard disk, as it will be compressed on the fly instead of needed space for the uncompressed and compressed arc (during compression). However, although having zlib is quite common, not all system have its headers to use (apt-get install zlib-dev).

Force:
make CFLAGS=-DHAVE_ZLIB_H LDFLAGS=-lz

Disable:
make CFLAGS= LDFLAGS=

KEEP_FILES By default the arc is deleted if the upload was successful. The provided workunit is never deleted, its deletion is deemed responsibility of the program which produced it (you will usually overwrite it). Note that if the upload fails or the server rejects it, it will be kept for manual inspection and the progrma will exit with non-zero code. By defining KEEP_FILES they are not removed. Handy for archival purposes. Remember that your workunits must have different names!

To enable it (keeping default HAVE_ZLIB_H behavior):
make "CFLAGS=-DKEEP_FILES -DHAVE_ZLIB_H"

DELAY_UPLOAD Sometimes the upload server will reject any upload due to transient errors (such as a full disk). But that's not a reason not to continue crawling the web. If compiled with DELAY_UPLOAD the files are not uploaded. Instead the putarchive commands needed for upload are listed on a file Upload.sh (Upload.bat on windows), so you can batch upload them later. If you're also keeping files, the successful ones will be moved to a subfolder so the script can be run several times. Remember to use unique names for each download while in this mode, and to recompile without this option when normal operation is back.

To enable it (keeping default HAVE_ZLIB_H behavior):
make "CFLAGS=-DDELAY_UPLOAD -DHAVE_ZLIB_H"

http://svn.swlabs.org/grubng/trunk/c/

Feature request

Hi. Will there be (sometimes in a future) a multi thread support? Crawling will be faster, I think.

Yes, there're plans for

Yes, there're plans for adding threading support at an unspecified time.
Meanwhile, you can run several instances of the program.