![]() |
![]() |
TR-CIS-2002-05 (12/27/2002)
Zan Ouyang, Nasir Memon, Torsten Suel, Dimitre Trendafilov
Abstract
Delta compression techniques are commonly used to succinctly represent
an updated version of a file with respect to an earlier one. In this
paper, we study the use of delta compression in a somewhat different
scenario, where we wish to compress a large collection of (more or less)
related files by performing a sequence of pairwise delta compressions.
The problem of finding an optimal delta encoding for a collection of files
by taking pairwise deltas can be reduced to the problem of computing a
branching of maximum weight in a weighted directed graph, but this solution
is inefficient and thus does not scale to larger file collections. This
motivates us to propose a framework for cluster-based delta compression that
uses text clustering techniques to prune the graph of possible pairwise delta
encodings. To demonstrate the efficacy of our approach, we present
experimental results on collections of web pages. Our experiments
show that cluster-based delta compression of collections provides significant
improvements in compression ratio as compared to individually compressing
each file or using tar+gzip, at a moderate cost in efficiency.