This material was compiled for a one-day course on "Web Search Engines and Search Technology" held on February 9, 2001, at Polytechnic University. The course was organized by the Center for Advanced Technology in Telecommmunications located at Polytechnic University, and was taught by Prof. Torsten Suel.
Following is a list of resources for background information on web search technology, including papers, books, online articles, and web sites. Links to online version of papers are provided when available. Note that this is not meant to be a complete list of references for the area, but a starting point for finding more information.
TABLE OF CONTENTS:
PART I: Web Sites, News Stories, and General Technical Reference
Search Engines and Tools - General:
Google Search Engine (major search engine)
All The Web - Fast Technologies
(major search engine)
Yahoo (directory & portal)
Altavista Search Engine
(major search engine & portal wanna-be)
Northernlight (major search engine)
Ask Jeeves (takes questions in English, sort of)
Alexa (personal search assistant)
Search Engines and Tools - Specialized, Meta, and Deep Web:
searchability.com
(guide to specialized, meta, and deep-web search engines)
Five
types of recommended search tools
(from UC Berkeley libraries site)
Complete Planet
(deep web search engine; see their FAQ for some (biased) info)
Berkeley Cha-Cha Search
(search engine for the UC Berkeley Campus)
Research Index: Computer Science Papers
(specializes in computer science research papers)
Cora: Computer Science Research Paper Search Engine
(specializes in computer science research papers)
FindLaw
(specializes in legal information)
Achoo
(specializes in health information)
Organizations and Resources:
World Wide Web consortium (WWWC)
Internet Archive
(attempt to archive the web as it evolves)
World Wide Web Conference
(Home of the premier WWW research conference; has pointers to past proceedings)
Core set of bibliographic refs
(large bibliography of web-related research with links to many papers)
How search engines work
(from a page of a course offered at U. of Iowa, with some material)
CS912: Information Retrieval on the WWW
(course on search engines offered here at Poly)
Finding World Wide Web Pages
(site from U. of Tennessee/Morris)
Technical Reference:
Search Engine Watch
(lots of general information and comparison of different engines)
Search Engine World (similar)
Search Engine Showdown (similar)
Uneven Internet (miscellaneous info)
Search Engine Features for Webmasters
(what webmasters should know about search engines)
Inktomi Webmap
(various statistics about the web)
The Web Robot Pages
(pages on crawlers nad rules for them)
Robot Exclusion
(how to exclude crawlers using exclusion protocol and meta tags)
Anatomy of an http URL
(explains the general form of a URL)
How the web works: HTTP and CGI explained
(great explanation of HTTP and CGI)
Python Homepage
(scripting language popular on the web)
Berkeley DB Documentation
(great tool for writing robust web search tools)
Python Bindings for Berkeley DB
(allows Berkeley DB to be used from Python)
Various Talks, News Stories, Companies, and Research Project Pages:
Search tools leave Web out of sight
(CNET article on limitations of current search engines)
Bots snarl sites as shoppers seek PlayStation 2
(CNET article on shopping bots)
Crawling towards eternity
(article on the Internet Archive)
The Bush-Google Story
(read this and laugh. But it doesn't work anymore on Google.)
AdvancedLinks.com
(company specializing in search engine manipulation)
BayTSP Inc.
(company specializing in finding copyright violations)
The Informant
(Project at Dartmouth on search agents)
ACID vs. BASE
(Talk by Brewer from Inktomi, makes case why ACID properties not needed for search engines)
Federated Facts and Figures
(UC Berkeley project)
Focused Crawling
(page by Soumen Chakrabarti explaining focused crawling)
Web Sphinx
(crawler tool set from CMU)
Mercator Web Crawler
(research project at DEC/Compaq)
Books:
Managing Gigabytes : Compressing and Indexing Documents and Images, by I. Witten, A. Moffat, and T. Bell. Morgan Kaufmann 1999. (book on compression, indexing, and querying) Amazon catalog entry.
Modern Information Retrieval, by R. Baeza-Yates and B. Ribeiro-Neto. Addison-Wesley 1999. (book on information retrieval) Amazon catalog entry.
Python Essential Reference, by D. Beazley. New Riders 1999. (book on Python. There are many others, and you can also get a lot of documentation at the official Python web site, but this one is the best and most concise if you already know C.) Amazon catalog entry.
Data on the Web : From Relations to Semistructured Data and XML, by S. Abiteboul, P. Buneman and D. Suciu. Morgan Kaufmann 1999. (book on semistructured data. Somewhat different perspective than the one presented in the course. This is more XML and data oriented.) Amazon catalog entry.
Basics: Architecture, Crawling, Indexing:
Brin/Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine.
7th World Wide Web Conference.
(paper on the Google architecture)
HTML
Lawrence/Giles: Searching the World Wide Web.
Science, 280, 1998.
(overview paper about some recent work)
PDF
Arasu/Cho/Garcia-Molina/Paepcke/Raghavan: Searching the Web.
To appear in IEEE Transactions on Internet Technologies, 2001.
(overview over some recent work at Stanford)
PDF
Brin/Motwani/Page/Winograd: What can you do with a Web in your Pocket?
WebDB 1998 Workshop.
(another slightly older overview over some recent work at Stanford)
Postscript
Rappaport: Robots & Spiders & Crawlers. White Paper.
(discusses variety of issues concerning crawler behavior. No mention of performance issues though)
PDF
Heydon/Najork: Mercator: A Scalable, Extensible Web Crawler.
World Wide Web, December 1999, pages 219-229.
(performance issues in crawlers.)
Look for paper
Olson/Bostic/Seltzer: Berkeley DB.
1999 Summer Usenix Technical Conference.
(overview paper describing Berkeley DB)
HTML
Postscript
Moffat/Bell: In Situ Generation of Compressed Inverted Files.
Journal of the American Society for Information Science, August 1995.
(how to build a compressed inverted index with small space overhead. Not available online)
Bell/Moffat/Neville-Manning/Witten/Zobel: Data Compression in Full-Text Retrieval Systems.
Journal of the American Society for Information Science, October 1993.
(comprehensive use of compression in digital libraries. Not available online)
Melnik/Raghavan/Yang/Garcia-Molina: Building a Distributed Full-Text
Index for the Web.
WWW Conference, 2001.
(shows how to efficiently build a compressed inverted index using Berkeley DB)
PDF
Pagerank and HITS:
Page/Brin: The PageRank citation ranking: bringing order to the web.
(manuscript introducing pagerank)
Postscript
Cho/Garcia-Molina/Page: Efficient Crawling Through URL Ordering.
7th WWW Conference, 1998.
(describes how to use pagerank to crawl efficient pages first)
HTML
Kleinberg: Authoritative Sources in a Hyperlinked Environment.
Journal of the ACM, 2000.
(introduces HITS)
Postscript
Chakrabarti/Dom/Raghavan/Rajagopalan/Gibson/Kleinberg:
Automatic Resource Compilation by Analyzing Hyperlink Structure and
Associated Text.
7th WWW Conference, 1998.
(adds analysis of anchortext to HITS)
HTML
Bharat/Henzinger: Improved Algorithms for Topic Distillation in Hyperlinked Environments.
21st Conference on Research and Development in Information Retrieval (SIGIR), 1998, pp. 104-111.
(refinement of HITS)
Gzipped Postscript
Miscellaneous Link-Based Techniques:
Chakrabarti/Dom/Kumar/Raghavan/Rajagopalan/Tomkins/Gibson/Kleinberg:
Mining the Web's Link Structure.
IEEE Computer, August 1999.
(overview of some techniques by IBM people)
Look for paper
Lempel/Moran:
The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect.
9th WWW Conference, 2000.
(slight modification of the HITS method)
HTML
Kumar/Raghavan/Rajagopalan/Tomkins:
Trawling the Web for Emerging Cyber-Communities.
8th WWW Conference, 1999.
(finds bipartite cliques in the web graph)
HMTL
Kumar/Raghavan/Rajagopalan/Tomkins:
Extracting Large-Scale Knowledge Bases from the Web.
International Conference on Very Large Data Bases, 1999.
(expands on the previous paper)
Look for paper
Dean/Henzinger: Finding Related Pages in the World Wide Web.
8th World Wide Web Conference, 1999.
(uses link structure to find related pages)
HTML
Rafiei/Mendelzon: What is this Page Known for? Computing Web Page Reputations.
9th World Wide Web Conference, 2000.
(estimates reputations based on link structure)
HTML
Zhang/Dong: An Efficient Algorithm to Rank Web Resources.
9th World Wide Web Conference, 2000.
(another link-based ranking technique)
HTML
Ding/Gravano/Shivakumar: Computing geographical scopes of web resources.
Conference on Very Large Databases (VLDB), 2000.
(how to use link structure to estimate geographical scope of a page)
PDF
Chakrabarti/Dom/Indyk: Enhanced Hypertext Classification using Hyperlinks.
Acm Sigmod Conference, 1998.
(uses hyperlinks to improve automatic classification of pages)
Postscript
Focused Crawling Techniques:
Chakrabarti/van den Berg/Dom:
Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery.
8th WWW Conference, 1999.
(focused crawling based on link analysis a la HITS)
HTML
Chakrabarti/van den Berg/Dom:
Distributed hypertext resource Discovery through examples.
International Conference on Very Large Data Bases, 1999.
(more on focused crawling based on link analysis)
PDF
Diligenti/Coetzee/Lawrence/Giles/Gori:
Focused crawling using context graphs.
International Conference on Very Large Data Bases, 2000.
(learning based approach to focused crawling)
PDF
Rennie/McCallum: Using Reinforcement Learning to Spider the Web Efficiently.
Proceedings of ICML'99.
(another learning based approach to focused crawling)
Look for paper
Hersovici/Jacovi/et al: The shark-search algorithm - An application.
7th WWW Conference, 1999.
(heuristic approach to focused crawling)
HTML
deBra/Post:Searching for Arbitrary Information in the WWW: the Fish-Search for Mosaic.
WWW Conference, 1994.
(very early heuristic approach to focused crawling)
Look for paper
Mukherjea: WTMS: A System for Collecting and Analyzing Topic-Specific Web Information.
9th WWW Conference, 2000.
(another heuristic, plus system design)
HTML
Recrawling and Change on the Web:
Brewington/Cybenko: Keeping Up with the Changing Web.
IEEE Computer, Vol. 33, No. 5, May 2000
(on recrawling and change of pages)
Look for paper
Brewington/Cybenko: How dynamic is the Web?
9th WWW Conference, 2000.
(expands on above. similar to some cho/garcia-molina work)
HTML
Douglis/Ball/Chen/Koutsofios:
The AT&T Internet Difference Engine: Tracking and Viewing Changes on the Web.
World Wide Web, pp. 27-44. January 1998.
(system for studying changes in web pages)
Gzipped Postscript
Douglis/Feldmann/Krishnamurthy/Mogul:
Rate of Change and other Metrics: a Live Study of the World Wide Web.
USENIX Symposium on Internet Technologies and Systems, 1997.
(study of rate of change on the web)
Gzipped Postscript
Cho/Garcia-Molina: Synchronizing a database to Improve Freshness.
Conference on Management of Data (SIGMOD), 2000.
(proposes measure of freshness of a collection and strategies for maximizing this measure)
PDF
Cho/Garcia-Molina: The Evolution of the Web and Implications for an incremental Crawler.
26th Conference on Very Large Databases (VLDB), 2000.
(how to build an efficient incremental crawler)
PDF
Brandman/Cho/Garcia-Molina/Shivakumar:
Crawler-Friendly Web Servers.
Workshop on Performance and Architecture of Web Servers (PAWS), June 2000.
(scheme by which servers can notify crawlers of updates on site)
PDF
Mirrors, Duplicates, and Clustering:
Broder/Glassman/Manasse/Zweig: Syntactic Clustering of the Web.
6th WWW Conference, 1997.
(clusters web pages by similarity)
HMTL
Bharat/Broder/Dean/Henzinger: A comparison of Techniques to Find Mirrored Hosts on the WWW.
Workshop on Organizing Webspace at the Fourth ACM Conference on Digital Libraries 1999.
Gzipped Postscript
Bharat/Broder:
Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content.
8th WWW Conference, 1999.
(how to find replicated servers)
HTML
Cho/Shivakumar/Garcia-Molina: Finding replicated Web collections.
International Conference on Management of Data (SIGMOD), 2000.
(how to identify collections of pages that are replicated on different servers)
PDF
Shivakumar/Garcia-Molina: Finding near-replicas of documents on the web.
Workshop on Web Databases (WebDB), 1998.
(efficient duplicate detection)
PDF
Indyk/Haveliwala/Gionis: Scalable Techniques for Clustering the Web.
Workshop on Web Databases (WebDB), 2000.
(fast technique for finding very similar pages among large set)
Postscript
Random Walks, Size Estimation, and Graph Models:
Bar-Yossef/Berg/Chien/Fakcharoenphol/Weitz:
Approximating Aggregate Queries about Web Pages via Random Walks.
International Conference on Very Large Databases 2000
(random walking/sampling technique)
Look for paper
Henzinger/Heydon/Mitzenmacher/Najork: On Near-Uniform URL Sampling.
9th WWW Conference, 2000.
(another random walking/sampling technique)
HTML
Henzinger/Heydon/Mitzenmacher/Najork:
Measuring Index Quality Using Random Walks on the Web.
8th WWW Conference, 1999.
(tries to measure quality of the pages in different engines)
HTML
Bharat/Broder:
A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines.
(tries to measure size and overlap between engines)
7th WWW Conference, 1998.
HMTL
Broder/Kumar/et al: Graph structure in the web.
9th WWW Conference, 2000.
(studies connectivity structure of the web graph)
HTML
Bharat/Broder/et al: The Connectivity Server: Fast Access to Linkage Information on the Web.
7th World Wide Web Conference, 1998,
HTML
Kleinberg/Kumar/Raghavan/Rajagopalan/Tomkins:
The Web as a graph: measurements, models, and methods.
Fifth Annual International Computing and Combinatorics Conference (COCOON), 1999.
(proposes and motivates random graph model for the web)
Postscript
Kumar/Raghavan/Rajagopalan/Sivakumar/Tomkins/Upfal:
Stochastic models for the web graph.
IEEE Symposium on Foundations of Computer Science, 2000.
(analysis of random graph model for the web)
PDF
Miscellaneous Others:
Kleinberg/Tomkins:
Applications of Linear Algebra in Information Retrieval and Hypertext Analysis.
Tutorial paper at PODS'99.
(overview of techniques, including link-based ones)
Look for Paper
Brin: Extracting Patterns and Relations from the World Wide Web.
WebDB 98 Workshop.
(tries to extract book information from the web)
Postscript
Pringle/Allison/Dowe: What is a tall poppy among Web pages?o
7th World Wide Web Conference, 1998.
(learning approach to avoiding manipulation)
HTML
Stata/Bharat/Farzin Maghoul:
The Term Vector Database: fast access to indexing terms for Web pages.
9th World Wide Web Conference, 2000.
(maintains short list of indexing terms or keywords for each page)
HTML
Chakrabarti/Srivastava/Subramanyam/Tiwari:
Using Memex to archive and mine community Web browsing experience.
9th WWW Conference, 2000.
(integrated browsing and search/analysis)
HTML
Hirai/Raghavan/Garcia-Molina/Paepcke: WebBase : A repository of web pages.
9th WWW Conference, 2000.
(storage system based on linux servers for large web page collections)
HTML
Spertus/Stein: Squeal: A Structured Query Language for the Web.
9th WWW Conference, 2000.
(query language for the web)
HTML
Buyukokkten/Cho/Garcia-Molina/Gravano/Shivakumar:
Exploiting geographical location information of web pages.
Workshop on Web Databases (WebDB'99), June 1999.
(providing locally relevant information)
PDF