2000~2003 Research Assistant at Insitute of Automation, Chinese Academy of Sciences
2004 (Fall)~2005 (Fall) Teaching Assistant for the course Introduction to Database
at Polytechnic Institute of NYU
Selected Publications
Efficient Term Proximity Search with Term-Pair Indexes Hao Yan, S.Shi, F.Zhang, T.Suel and J.Wen
19th ACM International Conference on Information and Knowledge Management (CIKM'10), Toronto, 2010, accepted for publication (Acceptance rate 13.4%)
Compressing Term Positions in Web Indexes Hao Yan, S.Ding and T.Suel.
The 32nd Annual International ACM SIGIR Conference (SIGIR'09), Boston, July, 2009. (Acceptance rate 15.8%)
Digital Recognition System Based on Mobile Robot Hao Yan, M.Tan, L.Li. The 2002 International Conference on Control and Automation (ICCA'02), Xiamen, China, 2002.
Selected Projects
Efficient Term Proximity Search with Term-Pair Indexes
Early termination techniques can greatly improve the efficiency of query processing by avoiding evaluating the full inverted lists of the query terms. However, most existing research on such techniques often does not consider the term proximity, that is, the distance between term occurrences in the document, while real search engines usually integrate it with other features to improve retrieval quality. In this project, we proposed new early termination techniques for efficient query processing for the case where term proximity is integrated into the retrieval model, by building additional term pair indexes. Our preliminary results showed that our methods can achieve significant improvements over the existing methods.
Revisiting Globally Sorted Indexes for Efficient Document
Retrieval
Most early termination
techniques first build new inverted indexes by sorting the inverted
lists in the order of either the term-dependent information, e.g.,
term frequencies or term IR scores, or the term-independent
information, e.g., static rank of the document; and then apply
appropriate retrieval strategies on the resulting indexes. Although
the methods based only on the static rank have been shown to be ineffective for the early termination, there are still many
advantages of using the methods based on term-independent
information.
In this project, we propose new techniques to organize inverted
indexes based on the term-independent information beyond static
rank and study the new retrieval strategies on the resulting indexes.
Compact Full-Text Indexing of Versioned Document Collections Versioned document collections contain multiple versions of each document. Important large-scale examples of such collections are Wikipedia or the web page archive maintained at the Internet Archive. A straightforward approach to indexing such collections would simply treat each document version as a separate document, resulting in an index size that scales linearly with the number of versions. In this project, we proposed several new techniques for organizing and compressing inverted index structures for versioned document collections.
Compressing Term Positions in Web Indexes Most previous work on compressing web search engine indexes has focused on compressing docID and
frequency data. In this project, we focused on techniques for compressing term positions in web
indexes. Compression of term positions in web pages is complicated by the fact that
term occurrences tend to cluster within documents but not across document boundaries, making it harder to exploit clustering effects. Also, typical access patterns for position data are different from those for docID and frequency
data. We performed a detailed study of a number of existing and new techniques for compressing
position data in web indexes. We also studied how to efficiently access position data for ranking functions that take
proximity features into account.
Inverted Index Compression and Query Processing with Optimized Document Ordering Web search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput, and many index compression techniques have been studied in the literature. One approach taken by several recent studies first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. In this project, we study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case.
Using Graphics Processors for High Performance IR Query Processing Web search engines have to process tens of thousands of
queries per second over tens of billions of documents. To deal with this
heavy workload, such engines employ massively parallel systems consisting of
thousands of machines. In this project, we investigated a new way to build such high performance IR systems using
graphical processing units (GPUs) by taking advantage of GPUs' power of massive on-chip parallelism.
Algorithms for Low-Latency Remote File Synchronization The remote file synchronization problem is how to update an outdated
version of a file located on one machine to the current version
located on another machine with a minimal amount of network
communication. A widely used open-source tool called rsync uses
a single round of messages between the two machines to solve this
problem. While research has shown that significant additional savings in
bandwidth are possible by using multiple rounds, such approaches are
often not desirable due to network latencies, increased protocol
complexity, or other overheads at the endpoints. In this project, we studied single-round synchronization techniques (based
on set reconciliation techniques) that
offer significant benefits in bandwidth consumption while preserving
many of the advantages of the em rsync approach.
Intern Project at Bigmac Team, Morgan Stanley Bigmac is the core database system at the fixed income division. Co-operating with my colleagues, I developed a few components of the Bigmac software library of financial applications in the fixed income division. In particular, we first parsed MQ and SOAP messages from various information sources, then searched and processed the corresponding information in databases, and did further processing for other applications, e.g. FTES and SODPOS.
Text Data Compression Gzip is a widely used compression tool distributed in many linux/unix machines. The key idea of it is to represent a string by its previous occurrence (match) in the same file. However, Gzip can only detect the nearby matches since it only looks up matches inside a sliding window. In this project, we designed a new compression algorithm by efficiently searching bigger matches that may be far away from the current string. Using this algorithm, we further developed a new data compression tool (it is open source and will be available soon) that performs better than most of the compression tools available on Linux/Unix machines, including Gzip, Rzip, Bzip2, Vcdiff, etc.
Video Stabilization in Search and Rescue Using Remotely Guided Rats Remotely guided rats or other animals are ideal for search and rescue operations because they are highly adept at negotiating difficult 3D terrain, in both light and dark. Recent research by our colleagues has demonstrated ability to use wireless brain stimulation to guide the movements of rats through a variety of difficult terrains. In this project, they mounted tiny video cameras on rats' backs and used such neuroscience technologies to guide them to capture the video we want. However, such video is often too unstable to be recognizable. Therefore, we developed the software for visualizing and stabilizing such video. The basic idea is to quickly mosaic a few successive images based on their rotated and translated results.
Related Sites
If you think my research work is interesting, I highly recommend you to check out the following sites: