Following is a description of some of our current projects, with links to project homepages with additional information:


PolyBot: A High-Performance Distributed Web Crawler:

PolyBot is a scalable web crawler that can download several hundred pages per second. The system is flexible enough to be used by various different crawling applications (e.g., bulk crawlers, focused crawlers, random walkers, page trackers), and is engineered to handle a number of possible performance bottlenecks (e.g., DNS lookup, robot exclusion checking, URL frontier). For more information, see the PolyBot Project Homepage.


Scalable Webpage Repository:

This project is building a scalable storage system for web content that allows efficient retrieval of pages and sites, is robust against crashes, and uses highly optimized compression techniques to store different versions of a page. More information available soon.


Peer-To-Peer Search Infrastructure:

There has recently been a lot of interest in peer-to-peer systems and other highly resilient, widely distributed networks and applications. We are investigating and implementing a possible future search engine architecture that is based on an underlying open and highly distributed IR substrate. Applications of this work include search in file sharing and storage networks, intranet search, as well as standard large engines. More details can be be found on the Project Homepage.


Indexing and Query Execution Performance:

We are studying techniques for improving the performance of the inverted index structures typically used in search engines. This includes work on parallel index partitioning strategies, index updates, and query execution. We are particularly interested in techniques for integrating term-based and link-based ranking strategies, and in pruning techniques for very large indexes.


Web Graph Structure and Link Analysis:

The graph structure of the Web has proven to be very valuable in web search and analysis. However, computing with graphs of hundreds of millions of nodes and billions of edges is quite challenging. We are experimenting with graph compression techniques and with I/O-efficient graph algorithms for link-based ranking and analysis. We are also looking at new approaches for link-based ranking and other tasks such as clustering, classification, focused crawling, etc.


Delta Compression and Remote File Synchronization:

(Joint work with the Visual Information Processing Lab) Delta compression and file synchronization are techniques for efficient data transmission and replication in environments with a lot of redundancy due to many similar or identical files and data sets. We are designing algorithms and software for delta compression and file synchronization that can be used to encode different versions of a file, update old versions, or exchange large collections of files with some degree of similarity. We are also looking at applications in the context of wireless web access, peer-to-peer systems, web archival, and distribution of large data sets. See the zdelta Homepage for details on a delta compression tool that we have developed in our group.


Wrapper Construction for Extracting Relational Data:

Many web pages, such as online store catalogs or discussion boards, contain interesting relational data. Wrappers are programs that extract such data from the unstructured HTML code supplied by the server. We have designed a system for constructing wrappers in an interactive fashion through a dialogue with the user. The system attempts to minimize the amount of user interactions that are required to come up with a robust wrapper. More information on this project will be available soon.


SPAWN: A Scalable Proxy Architecture for Web Access over Slow Wireless Networks:

(Joint work with the Visual Information Processing Lab) Web access via wireless modems is often slow due to the low bandwidth and ' high latency of most current wireless links. One technique for improving performance puts a proxy between the client browser and the web site that improves web access by using techniques such as compression and image transcoding. In this project, we are building a highly optimized dual-proxy architecture that combines a number of new and old techniques in order to get the fastest possible access time over slow links. Our main focus is currently on scalability issues and scheduling policies for such proxies under high user loads. See the SPAWN Homepage for more details.