![]() |
![]() |
TR-CIS-2005-02 (05/31/2005)
Utku Irmak and Torsten Suel
Abstract
While much of the data on the Web is unstructured in nature, there
is also a significant amount of embedded structured data, such as product
information on e-commerce sites or stock data on financial sites. A large
amount of research has focused on the problem of generating wrappers,
i.e., software tools that allow easy and robust extraction of structured
data from text and HTML sources. In many applications, such as comparison
shopping, data has to be extracted from many different sources, making
manual coding of a wrapper for each source impractical. On the other hand,
fully automatic approaches are often not reliable enough, resulting in
low quality of the extracted data. We describe a system for semi-automatic
wrapper generation that can be trained on various data sources in a simple
interactive manner. Our goal is to minimize the user effort for training
reliable wrappers, by providing an intuitive training interface that is
implemented using an underlying powerful extraction language and training
algorithm. We show that our system achieves robust data extraction with
significantly less user effort than previous approaches.