Download/Upload queue
thindil, czw., 02/05/2009 - 23:39
| Projekt: | Grub Next Generation Python Client |
| Wersja: | 0.3.1 |
| Komponent: | Code |
| Kategoria: | prośba o opcję |
| Priorytet: | krytyczny |
| Przypisany: | Sealabs |
| Status: | closed |
Skocz do:
Opis
Because Python client have option to run (almost) unlimited amount of crawlers (on separate threads) i have one request: please, could you add a queue for download new workunits and upload .arc files (so, only one crawler at time been connect to servers)? At this moment one client with very good connection, can simply made DoS attack on Grub servers (and wipe out it) ;) I was update documentation about clients with informations about this feature.
- Zaloguj się lub zarejestruj by odpowiadać

#1
I was thinking that myself, just as a good programming paradigm, but it turns out that is really needed! I will take care of that asap. Probably the next release will also feature that.
But anyway, when everybody will be using grub, shouldn't the upload server just scale well?
#2
If there been more active clients probably all Grub architecture been changed ;) Servers can limit users connections (or speed) which connect to they, but this is only limitation not full solution of problem. So i start think about add special servers which been have only list of IP of normal servers, something like torrents trackers ;) But this need change workunit format too. At this moment, send too much crawlers to servers cause: on dispatch: more invalid workunits, on upload: use all CPU and(or) connection (so, server stop accept other .arc files).
#3
No need to change workunits for that. They contain the host to upload to. Make them round robin, ie. PUT soap6.grub.org:8080 ...
#4
I think do this in other, simpler way, but first, upload server must have few additional options ;) But IMO, talk about changes in protocol is offtopic in issue about Python client ;) For now there is only one resolution of problem: queues in multicrawlers clients and maybe any limitation on servers.
Plus remember - all things about workunits at this moment are in Jeremie hands which is very, very, very busy at this moment (probably faster i do new dispatch server) ;)
#5
For download/upload queuing on the client end, no change to workunits are required. Once this is implimented, there is no need to change anything server side.
Only the other clients may or may not be using a queue and thus they will have to be modified aswell.
-Chris.
#6
I have been running the python client for the last couple of days, and since I have a lot of bandwidth at my disposal (university connection), I figured I would help the grub project, and do some crawling. I have been using the python client because it is a lot faster than the other crawling options. I tried running 200 threads the other day, and it appears that I ended up killing the upload server (like was mentioned in the original post. So, now I have settled on 50 threads. Is there any way for me to use more of my bandwidth, or should I just leave the crawler at around 50 threads?
#7
A total twice times (if i good count ;) ), but don't worry, after each crash of server i get very detailed informations about reason of crash, so, every time server have fixed one more bug ;) This is help for Grub too ;) If you want, you can raise it but problems with large amount of crawlers in python client is:
1) At dispatch server - it is a simple Perl scripts so sometimes it seems like server send this same workunit for few crawlers. So, even if you run 200 crawlers simultaneously, in real you run around 60-80 ;) (problem in Python - don't have good built-in support for multi-threads applications, so checking it must be done manually)
2) At upload server - main problem can be with connection and cpu usage (probably in next week i try add support for LZMA compression as in issue #255: LZMA compression - if been implemented i put info on mailing list too). Plus at this moment, if i good see Python client often lost connection to upload server (lots of incomplete or empty files appear on server).
Answer at #5
Yes, queue don't need any changes in servers, but in future, one upload server can be too weak to serve all .arc files.
And other multi-crawlers clients (all one - C# ;) ) have queue for a some time ;)
#8
Please try out the latest svn code [1] to report any issues before releasing. Now a separate thread deals with the uploads and only one thread requests workunit each time
[1] http://people.swlabs.org/~bartek/websvn/filedetails.php?repname=GrubNG&p...
#9
Fixed in 0.4
#10
Automatically closed -- issue fixed for two weeks with no activity.