Following is a description of
some of our current projects, with links to project homepages
with additional information:
PolyBot: A High-Performance
Distributed Web Crawler:
PolyBot is a scalable web crawler that can download several
hundred pages per second. The system is flexible enough to
be used by various different crawling applications (e.g.,
bulk crawlers, focused crawlers, random walkers, page trackers),
and is engineered to handle a number of possible performance
bottlenecks (e.g., DNS lookup, robot exclusion checking, URL
frontier). For more information, see the PolyBot
Project Homepage.
Scalable Webpage
Repository:
This project is building a scalable storage system for web
content that allows efficient retrieval of pages and sites,
is robust against crashes, and uses highly optimized compression
techniques to store different versions of a page. More information
available soon.
Peer-To-Peer Search
Infrastructure:
There has recently been a lot of interest in peer-to-peer
systems and other highly resilient, widely distributed networks
and applications. We are investigating and implementing a
possible future search engine architecture that is based on
an underlying open and highly distributed IR substrate. Applications
of this work include search in file sharing and storage networks,
intranet search, as well as standard large engines. More details
can be be found on the Project Homepage.
Indexing and Query
Execution Performance:
We are studying techniques for improving the performance
of the inverted index structures typically used in search
engines. This includes work on parallel index partitioning
strategies, index updates, and query execution. We are particularly
interested in techniques for integrating term-based and link-based
ranking strategies, and in pruning techniques for very large
indexes.
Web Graph Structure
and Link Analysis:
The graph structure of the Web has proven to be very valuable
in web search and analysis. However, computing with graphs
of hundreds of millions of nodes and billions of edges is
quite challenging. We are experimenting with graph compression
techniques and with I/O-efficient graph algorithms for link-based
ranking and analysis. We are also looking at new approaches
for link-based ranking and other tasks such as clustering,
classification, focused crawling, etc.
Delta Compression
and Remote File Synchronization:
(Joint work with the Visual Information Processing Lab)
Delta compression and file synchronization are techniques
for efficient data transmission and replication in environments
with a lot of redundancy due to many similar or identical
files and data sets. We are designing algorithms and software
for delta compression and file synchronization that can be
used to encode different versions of a file, update old versions,
or exchange large collections of files with some degree of
similarity. We are also looking at applications in the context
of wireless web access, peer-to-peer systems, web archival,
and distribution of large data sets. See the zdelta
Homepage for details on a delta compression tool that
we have developed in our group.
Wrapper Construction
for Extracting Relational Data:
Many web pages, such as online store catalogs or discussion
boards, contain interesting relational data. Wrappers are
programs that extract such data from the unstructured HTML
code supplied by the server. We have designed a system for
constructing wrappers in an interactive fashion through a
dialogue with the user. The system attempts to minimize the
amount of user interactions that are required to come up with
a robust wrapper. More information on this project will be
available soon.
SPAWN: A Scalable
Proxy Architecture for Web Access over Slow Wireless Networks:
(Joint work with the Visual Information Processing Lab)
Web access via wireless modems is often slow due to the low
bandwidth and ' high latency of most current wireless links.
One technique for improving performance puts a proxy between
the client browser and the web site that improves web access
by using techniques such as compression and image transcoding.
In this project, we are building a highly optimized dual-proxy
architecture that combines a number of new and old techniques
in order to get the fastest possible access time over slow
links. Our main focus is currently on scalability issues and
scheduling policies for such proxies under high user loads.
See the SPAWN Homepage
for more details.
|