Following is a description
of some of our current projects, with links to project homepages
with additional information:
PolyBot: A High-Performance Distributed Web Crawler
PolyBot is a scalable web crawler that can download several hundred
pages per second. The system is flexible enough to be used by various
different crawling applications (e.g., bulk crawlers, focused crawlers,
random walkers, page trackers), and is engineered to handle a number
of possible performance bottlenecks (e.g., DNS lookup, robot exclusion
checking, URL frontier). For more information, see the PolyBot Project
Scalable Webpage Repository
This project is building a scalable storage system for web content
that allows efficient retrieval of pages and sites, is robust against
crashes, and uses highly optimized compression techniques to store
different versions of a page. More information available soon.
There has recently been a lot of interest in peer-to-peer systems
and other highly resilient, widely distributed networks and applications.
We are investigating and implementing a possible future search engine
architecture that is based on an underlying open and highly distributed
IR substrate. Applications of this work include search in file sharing
and storage networks, intranet search, as well as standard large
engines. More details can be be found on the Project Homepage.