PolyBot: A high-Performance Distributed Web Crawler

PolyBot is a scalable web crawler that can download up to several hundred web pages per second over our T3 campus connection. The system is flexible enough to be used by various different crawling applications (e.g., bulk crawlers, focused crawlers, random walkers, page trackers), and is engineered to handle a number of possible performance bottlenecks (e.g., DNS lookup, robot exclusion checking, URL frontier management).

PolyBot is a distributed system that runs on a network of Solaris or Linux-based workstations, and that can be scaled by adding additional machines. Below is a small configuration of the system that runs on 3 to 5 machines and that can download more than 300 pages per second.

For more details about the system, see the following paper:

Design and Implementation of a High-Performance Distributed Web Crawler. V. Shkapenyuk and T. Suel. IEEE International Conference on Data Engineering, February 2002. Postscript
Further inquiries can be made to polybot@cis.poly.edu.