The zdelta Home Page
zdelta-2.1 is now available! (download here)
23.Feb.2004 - A bug is removed from the multiple reference file version of zdelta-2.1!
Modifications:
- Extended interfaces, supporting streaming on the target data.
- Runtime parameterization of the library - 10 different compression levels.
- Online documentation and
FAQ
zdelta-2.0 (download here)
Modifications:
- Improved compression algorithm, resulting in better compression (maybe around 10% on the average).
- Support for multiple reference files (up to 4).
- Minor bug fixes.
zdelta is a general purpose lossless delta compression library
developed at Polytechnic University. It is implemented by modifying the zlib 1.1.3 compression library (the
modifications are marked in all of the original zlib 1.1.3
files). With its version 2.0, however, the library has
significantly deviated from zlib, though it still uses a lot of the
zlib code and structure. The zdelta
algorithm was designed by
Nasir Memon, Torsten Suel, and
Dimitre Trendafilov, and implemented by Dimitre Trendafilov. The work on
the library was supported by a grant from Intel Corporation. Torsten Suel was
also supported by NSF CAREER Award NSF CCR-0093400.
The possible applications for zdelta include:
- Reducing HTTP latency
- Compression at the file system level
- Revision Control Systems
- Software Distributions
- Creating of compact file backup systems
or many other applications involving sets of similar files or different
versions of a file.
The initial motivation for developing zdelta was the need for a utility
that efficiently generates deltas of HTML pages. The library,
however, is more general and could be successfully applied on files with
any structure and any size.
zdelta shares many of the zlib characteristics:
- It is free.
- It can be used on (almost) any computer hardware and operating system.
- It is thread safe.
- It uses a data format, based on the zlib data format, which is
portable across platforms.
- Supports multiple reference files - feature introduced in version 2.0
- Supports streaming on the target data (the reference data still needs to be provided at once) - feature introduced
in version 2.1
- zdelta library allows for choosing different compression levels
(from 0 to 9) - feature introduced in
version 2.1
However, there are also some differences:
- Unlike zlib, zdelta could expand the data (very unlikely but
possible).
- Compression ratio may degrade for large files.
- The library, while debugged carefully, is not yet as stable as
zlib and at this point should not be used to compress critical data.
For more details visit the list of FAQ
and the library online manual.
Please note that this software is provided as is, with absolutely no
express or implied warranty.
The zdelta algorithm is a variation of LZ77 (Lempel-Ziv 1977). The
target data is compressed by finding duplicated strings in the already
compressed data or in the explicitly provided reference data. Note that
zdelta differs from applying zlib with a dictionary!
A complete description of zdelta-1.0 algorithm is given in the
article zdelta: an Efficient Delta Compression
Tool available
here.
An article on zdelta-2.0 will be available soon.
The complete zdelta sources are available
here.
zdelta was tested on the following benchmarks:
Note: All experiments were run by compressing
each individual file from the target data set in relation to the
corresponding file in the reference data set. Compressing large
files, like large tar files, is still an active research.
- The gcc and emacs data sets used in the performance study in
Delta Algorithms: An Empirical Analysis (1998)
by Hunt, Vo, and Tichy (ACM Transactions on Software Engineering and
Methodology, 1998).
The data sets consist of the source code of two versions of gcc and emacs
(basically, text files with average size less than 64KB and high degree of
similarity).
The benchmark (Benchmark1) is available
here.
- A set of artificially created files that model the degree of similarity
between two files. In particular, we created two completely random files
file_0 and file_1 of fixed length, and then performed delta
compression between file_1 and another file file_N created by
a "blending" procedure that copies text from either file_0 or
file_1 according to a simple Markov process (basically binary files
with size of 1MB and varying degree of similarity).
The code for generating this benchmark (Benchmark2) is available
here.
The following experiments were run only with zdelta-2.0. The
purpose of those experiments was to test zdelta, and also to
explore potential applications for delta compression in general.
- Four versions of a set consisting of 10000 arbitrary
HTML pages. These versions, Version1, Version2,
Version3, and Version4 were crawled correspondingly on
August 13th, August 15th, September 2nd, and Oct 27th of year 2001.
We use Version1 as a reference
data, and compress the other sets in relation to it. Note
that the similarity between these versions decreases as the time
difference between them increases. The four sets (Benchmark 3) are
available here.
Potential application: web archives, version control tools.
- Collection of HTML files crawled from specific web sites. In this
experiment we compress collections of files with similar structure and
content by representing the entire set through appropriate delta encodings
between target and reference files.
Potential application: backup systems.
- Compression of linked HTML files. In this experiment we explore the
benefit of compressing a given web page in relation to pages linked to it.
We crawled a set of above 1000 random files. For each of this pages,
referred to as core pages, we crawled a set of up to 100
reference
pages selected in the following manner:
1. The reference page should be on the same host
2. There should be either a link from the core page to the
reference page, or there should be a link from the reference page to the
core page
We conducted four groups of experiments - compressing each of the core
pages using one, two, three, or four from the corresponding
reference files. For each experiment we present two numbers - the
average
and the best compression ratios
Potential application: reducing HTTP latency.
The experimental results for zdelta-2.0 are available
here.
The experimental results for zdelta-1.0 are available
here.
For more information on the conducted experiments refer to the results section
of
zdelta: an Efficient Delta Compression Tool.
The Perl scripts used for the experiments are available
here.
Copyright © Dimitre Trendafilov 2002.
dtrend01@utopia.poly.edu