Here is a list of online resources that might be helpful for the programming projects.
NEW: INFORMATION ABOUT COURSE PROJECTS
Material for Assignment 1:
NEW:What to submit for homework #1
How to parse out hyperlinks in Python
Another way to parse out hyperlinks (by Dmitry Shenkelbakh)
Library for robots.txt parsing in Python
An alternative improved library for robots.txt parsing -- plus interesting dicussion about robots.txt history
How to handle password-protected pages
stack and queue data structure
Anatomy of
an http URL
white paper by Avi Rappoport on issues faced by web crawlers
Material for Assignment 2:
Some more hints on parsing the NZ data files
Some more hints on doc IDs, word IDs, and index construction
Code for a simple parser in C (read the readme), with
Python wrappers provided
Code for a Java port of the parser (courtesy of
Aleksander Dembowski)
Some comments on index structures that you need to
build
Some comments on parsing and word boundaries
How to uncompress gzipped files in memory
Uncompressing gzipped files (courtesy of Fekri Kassem)
More on uncompressing gzipped files (courtesy of Fekri Kassem)
Another way to uncompress gzipped files (courtesy of Li Lu)
Some simple C code for I/O-efficient merge sort
Sample data set for starting the homework
(use the data set nz2.tar in this directory to develop your code).
Larger data set of 10% of the nz domain (Note this is a >500MB tar file!)