PolyBot: A high-Performance Distributed
Web Crawler
PolyBot is a scalable web crawler that can download up to
several hundred web pages per second over our T3 campus connection.
The system is flexible enough to be used by various different
crawling applications (e.g., bulk crawlers, focused crawlers,
random walkers, page trackers), and is engineered to handle
a number of possible performance bottlenecks (e.g., DNS lookup,
robot exclusion checking, URL frontier management).
PolyBot is a distributed system that runs on a network of
Solaris or Linux-based workstations, and that can be scaled
by adding additional machines. Below is a small configuration
of the system that runs on 3 to 5 machines and that can download
more than 300 pages per second.
For more details about the system, see the following paper:
Design and Implementation of a High-Performance
Distributed Web Crawler. V. Shkapenyuk and T.
Suel. IEEE International Conference on Data Engineering, February
2002. Postscript
Further inquiries can be made to polybot@cis.poly.edu. |