Polytechnic University
home info people teaching research links

Detecting Duplicates in Document Image Databases

Dr. Daniel Lopresti
Bell Laboratories -- Lucent Technologies

Monday, December 7th, 1998, 1:00pm-2:00pm
Library/CATT Building, Room LC102, Brooklyn Campus

Abstract:

Detecting duplicates in databases of scanned documents is a problem of growing importance. The task is made difficult both by the various ways printed documents become degraded, and by vague notions of what it means to be a "duplicate." The potential payoffs can be huge, however. For example, the U.S. Government's Gulf War Declassification Project has as its charter the release of any and all information that might shed light on the cause of Gulf War Illness. In one particular collection activity 564,000 pages were gathered, 292,000 of which were later found to be duplicates of documents already on hand.

In this talk, I present a framework for clarifying and formalizing the duplicate detection problem. This incorporates four distinct but related models, each with an algorithm for its solution adapted from the realm of approximate string matching. The robustness of these techniques is demonstrated through a series of experiments using test data derived from real-world noise sources. I also discuss several heuristics that have the potential to speed up the computation by several orders of magnitude.


For more information please contact Ed Wong (718) 260-3523
( wong@vision.poly.edu )