Computer & Information Science Department   Polytechnic University

ATTENTION: THIS WEB SITE HAS MOVED. The pages you are looking at are no longer being maintained. Please go to http://www.poly.edu/cis/ to visit the new site of the Department of Computer and Information Science at Polytechnic University.

Interactive Wrapper Generation with Minimal User Effort

TR-CIS-2005-02 (05/31/2005)
Utku Irmak and Torsten Suel

pdf version of this paper

Abstract
While much of the data on the Web is unstructured in nature, there is also a significant amount of embedded structured data, such as product information on e-commerce sites or stock data on financial sites. A large amount of research has focused on the problem of generating wrappers, i.e., software tools that allow easy and robust extraction of structured data from text and HTML sources. In many applications, such as comparison shopping, data has to be extracted from many different sources, making manual coding of a wrapper for each source impractical. On the other hand, fully automatic approaches are often not reliable enough, resulting in low quality of the extracted data. We describe a system for semi-automatic wrapper generation that can be trained on various data sources in a simple interactive manner. Our goal is to minimize the user effort for training reliable wrappers, by providing an intuitive training interface that is implemented using an underlying powerful extraction language and training algorithm. We show that our system achieves robust data extraction with significantly less user effort than previous approaches.