Welcome to the homepage of the Web Exploration and Search Technology Lab (WestLab) in the Computer and Information Science Department at Polytechnic University. The goal of our group is to design new tools and techniques for searching and analyzing the structure and content of the World Wide Web.

More details on our work can be found on the project page. Our work is focused on the following four main areas:

Performance of Compressed Inverted List Caching in Search Engines

Due to the rapid growth in the size of the web, web search engines are facing enormous performance challenges. The larger engines in particular have to be able to process tens of thousands of queries per second on tens of billions of documents, making query throughput a critical issue. To satisfy this heavy workload, search engines use a variety of performance optimizations including index compression, caching, and early termination.
We focus on two techniques, inverted index compression and index caching, which play a crucial rule in web search engines as well as other high-performance information retrieval systems. We perform a comparison and evaluation of several inverted list compression algorithms, including new variants of existing algorithms that have not been studied before. We then evaluate different inverted list caching policies on large query traces, and finally study the possible performance benefits of combining compression and caching. The overall goal of this paper is to provide an updated discussion and evaluation of these two techniques, and to show how to select the best set of approaches and settings depending on parameter such as disk speed and main memory cache size. you could download our codes here:polycomp.tar

Performance of cluster-based search engines

Current large search engines are based on scalable clusters, i.e., large numbers of workstations connected by fast LANs. We are working to improve the performance and scalability of such engines and to increase the quality of the results returned to the user. We have implemented and studied a a number of search engine components, including a scalable high-performance crawler (Polybot), specialized storage systems, and indexing and query execution software. We are also looking at new ranking techniques based on link analysis, and at the integration of term-based, link-based, and other techniques.

Future Distributed Web Search Architectures

We are studying potential alternatives to the current centralized cluster-based architectures, such as highly distributed and peer-to-peer architectures and client-based search tools. Our current focus is on the design of a novel peer-to-peer information retrieval substrate, and on query execution in widely distributed systems.

Data Extraction, Mining and Discovery, and the Deep Web

With collaborators at Poly and at UC Berkeley, we are looking at automatic access to and query processing over web-accessible databases, and at data extraction from unstructured and loosely structured web pages. We are also looking at techniques for focused crawling, recrawling, and other strategies for the discovery and monitoring of web resources.

Optimizing Performance over Slow Wireless Links

In collaboration with the Visual Information Processing Lab, we are studying delta compression and file synchronization techniques for efficient storage and replication of collections of similar files. For example, we are looking at ways to improve basic tools such as tar+gzip for distributing file collections and the rsync utility for file synchronization, in the case of very slow bandwidth links. We are also working on protocol optimization and scheduling issues in a proxy system for wireless web access, called SPAWN, that we have built.

We may have research projects available for students looking for Senior Project and MS Thesis topics. If you are a strong and highly-motivated student at Poly who is interested in doing research in the web technology area, please contact Prof. Torsten Suel to inquire about possible topics.