|
Design and Implementation of a
High-Performance Distributed Web Crawler
TR-CIS-2001-03 Vladislav Shkapenyuk, Torsten Suel
pdf version of this paper
Abstract:
Broad web search engines as well as many more specialized search tools
rely on web crawlers to acquire large collections of pages for indexing
and analysis. Such a web crawler may interact with millions of hosts over
a period of weeks or months, and thus issues of robustness, flexibility,
and manageability are of major importance. In addition, I/O performance,
network resources, and OS limits must be taken into account in order to achieve
high performance at a reasonable cost.
In this paper, we describe the design and implementation of a distributed
web crawler that runs on a network of workstations. The crawler scales to
(at least) several hundred pages per second, is resilient against system
crashes and other events, and can be adapted to various crawling
applications. We present the software architecture of the system, discuss the
performance bottlenecks, and describe efficient techniques for achieving high
performance. We also report preliminary experimental results based on a crawl of 120
million pages on 5 million hosts.
|