Hao Yan's Homepage

I am a Ph.D. student at Department of Computer Science and Engineering of Polytechnic Institute of New York University (NYU-POLY and previously known as Polytechnic University) at Brooklyn, New York. My advisor is Prof. Torsten Suel in Web Exploration and Search Technology Lab. I received my bachelor degree from Xi an Jiaotong University in 2000 and my master degree from Insititue of Automation, Chinese Academy of Sciences in 2003.

My current research interests include: web search technology, information retrieval, inverted index compression, text compression, database, data mining, distributed computing systems and file synchronization.
(Before I joined Web Exploration and Search Technology Lab in Fall, 2005, I did some research work on image processing.)

My email: hyan (at) cis.poly.edu

Working and research experience
Selected Publications
Selected Projects
  • Efficient Term Proximity Search with Term-Pair Indexes
    Early termination techniques can greatly improve the efficiency of query processing by avoiding evaluating the full inverted lists of the query terms. However, most existing research on such techniques often does not consider the term proximity, that is, the distance between term occurrences in the document, while real search engines usually integrate it with other features to improve retrieval quality. In this project, we proposed new early termination techniques for efficient query processing for the case where term proximity is integrated into the retrieval model, by building additional term pair indexes. Our preliminary results showed that our methods can achieve significant improvements over the existing methods.
  • Revisiting Globally Sorted Indexes for Efficient Document Retrieval
    Most early termination techniques first build new inverted indexes by sorting the inverted lists in the order of either the term-dependent information, e.g., term frequencies or term IR scores, or the term-independent information, e.g., static rank of the document; and then apply appropriate retrieval strategies on the resulting indexes. Although the methods based only on the static rank have been shown to be ineffective for the early termination, there are still many advantages of using the methods based on term-independent information. In this project, we propose new techniques to organize inverted indexes based on the term-independent information beyond static rank and study the new retrieval strategies on the resulting indexes.
  • Compact Full-Text Indexing of Versioned Document Collections
    Versioned document collections contain multiple versions of each document. Important large-scale examples of such collections are Wikipedia or the web page archive maintained at the Internet Archive. A straightforward approach to indexing such collections would simply treat each document version as a separate document, resulting in an index size that scales linearly with the number of versions. In this project, we proposed several new techniques for organizing and compressing inverted index structures for versioned document collections.
  • Compressing Term Positions in Web Indexes
    Most previous work on compressing web search engine indexes has focused on compressing docID and frequency data. In this project, we focused on techniques for compressing term positions in web indexes. Compression of term positions in web pages is complicated by the fact that term occurrences tend to cluster within documents but not across document boundaries, making it harder to exploit clustering effects. Also, typical access patterns for position data are different from those for docID and frequency data. We performed a detailed study of a number of existing and new techniques for compressing position data in web indexes. We also studied how to efficiently access position data for ranking functions that take proximity features into account.
  • Inverted Index Compression and Query Processing with Optimized Document Ordering
    Web search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput, and many index compression techniques have been studied in the literature. One approach taken by several recent studies first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. In this project, we study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case.
  • Using Graphics Processors for High Performance IR Query Processing
    Web search engines have to process tens of thousands of queries per second over tens of billions of documents. To deal with this heavy workload, such engines employ massively parallel systems consisting of thousands of machines. In this project, we investigated a new way to build such high performance IR systems using graphical processing units (GPUs) by taking advantage of GPUs' power of massive on-chip parallelism.
  • Algorithms for Low-Latency Remote File Synchronization
    The remote file synchronization problem is how to update an outdated version of a file located on one machine to the current version located on another machine with a minimal amount of network communication. A widely used open-source tool called rsync uses a single round of messages between the two machines to solve this problem. While research has shown that significant additional savings in bandwidth are possible by using multiple rounds, such approaches are often not desirable due to network latencies, increased protocol complexity, or other overheads at the endpoints. In this project, we studied single-round synchronization techniques (based on set reconciliation techniques) that offer significant benefits in bandwidth consumption while preserving many of the advantages of the em rsync approach.
  • Intern Project at Bigmac Team, Morgan Stanley
    Bigmac is the core database system at the fixed income division. Co-operating with my colleagues, I developed a few components of the Bigmac software library of financial applications in the fixed income division. In particular, we first parsed MQ and SOAP messages from various information sources, then searched and processed the corresponding information in databases, and did further processing for other applications, e.g. FTES and SODPOS.
  • Text Data Compression
    Gzip is a widely used compression tool distributed in many linux/unix machines. The key idea of it is to represent a string by its previous occurrence (match) in the same file. However, Gzip can only detect the nearby matches since it only looks up matches inside a sliding window. In this project, we designed a new compression algorithm by efficiently searching bigger matches that may be far away from the current string. Using this algorithm, we further developed a new data compression tool (it is open source and will be available soon) that performs better than most of the compression tools available on Linux/Unix machines, including Gzip, Rzip, Bzip2, Vcdiff, etc.
  • Video Stabilization in Search and Rescue Using Remotely Guided Rats
    Remotely guided rats or other animals are ideal for search and rescue operations because they are highly adept at negotiating difficult 3D terrain, in both light and dark. Recent research by our colleagues has demonstrated ability to use wireless brain stimulation to guide the movements of rats through a variety of difficult terrains. In this project, they mounted tiny video cameras on rats' backs and used such neuroscience technologies to guide them to capture the video we want. However, such video is often too unstable to be recognizable. Therefore, we developed the software for visualizing and stabilizing such video. The basic idea is to quickly mosaic a few successive images based on their rotated and translated results.
Related Sites
If you think my research work is interesting, I highly recommend you to check out the following sites:
My Labmates