Detecting Duplicates in Document
Image Databases
Dr. Daniel Lopresti
Bell Laboratories -- Lucent Technologies
Monday, December 7th, 1998, 1:00pm-2:00pm
Library/CATT Building, Room LC102, Brooklyn Campus
Abstract:
Detecting duplicates in databases of scanned documents is a problem of growing importance.
The task is made difficult both by the various ways printed documents become degraded, and
by vague notions of what it means to be a "duplicate." The potential payoffs can
be huge, however. For example, the U.S. Government's Gulf War Declassification Project has
as its charter the release of any and all information that might shed light on the cause
of Gulf War Illness. In one particular collection activity 564,000 pages were gathered,
292,000 of which were later found to be duplicates of documents already on hand.
In this talk, I present a framework for clarifying and formalizing the duplicate detection
problem. This incorporates four distinct but related models, each with an algorithm for
its solution adapted from the realm of approximate string matching. The robustness of
these techniques is demonstrated through a series of experiments using test data derived
from real-world noise sources. I also discuss several heuristics that have the potential
to speed up the computation by several orders of magnitude.
For more information please contact Ed Wong (718) 260-3523
( wong@vision.poly.edu
)