Luis Gravano
Columbia University
Friday, March 26, 2004, 11:00am - 12:00pm
LC 102, Brooklyn Campus, Polytechnic University

Many valuable text databases on the web have non-crawlable contents that are "hidden" behind search interfaces. Hence traditional search engines do not index this valuable information. One way to facilitate access to "hidden-web" databases is through commercial Yahoo!-like directories, which organize these databases manually into categories that users can browse. In this talk, I will describe a technique to automate the classification of hidden-web databases. Our technique adaptively probes the databases with queries derived from document classifiers, without retrieving any documents. A large-scale experimental evaluation over 130 real web databases indicates that our technique produces highly accurate database classification results using -on average- fewer than 200 queries of four words or less to classify a database.

An alternative way to facilitate access to hidden-web databases is through "metasearchers," which provide a unified query interface to search many databases at once. For efficiency, a critical task for a metasearcher is the selection of the most promising databases to search for a query, a task that typically relies on statistical summaries of the database contents. In this talk, I will also describe a recent technique to derive content summaries from hidden-web databases. We exploit our probing-based classification algorithm to adaptively extract documents that are representative of the topic coverage of the databases. We can then build content summaries from these topically-focused document samples. A large-scale experimental evaluation over a variety of databases indicates that our new content-summary construction technique is efficient and produces more accurate summaries than those from previously proposed strategies.

Luis Gravano has been on the faculty of the Computer Science Department, Columbia University since September 1997, where he has been an associate professor since July 2002. From January through August 2001, Luis was a Senior Research Scientist at Google (on leave from Columbia University). He received his Ph.D. degree in Computer Science from Stanford University in 1997. He also received an M.S. degree from Stanford University in 1994 and a B.S. degree from the Escuela Superior Latinoamericana de Informatica (ESLAI), Argentina in 1990. Luis is an associate editor of the ACM Transactions on Information Systems, as well as database program chair for the upcoming ACM CIKM 2004 and co-chair of the upcoming WebDB 2004 workshop. Luis is also a recipient of a CAREER award from the National Science Foundation.

This talk describes work performed jointly with Panos Ipeirotis (Columbia) and Mehran Sahami (Stanford/Google).

