Here is an explanation about how to handle password protected sites. Similar approaches can also be used to handle/detect other types of exceptions, like broken links. *************************************************************** Password protected sites (handling HTTP Error 401 -- authentication required) Two functions Python offers for opening a web page - urlopen() and urlretrieve() - are very convenient and hide the details of the HTTP protocol from the programmer. However they have one unacceptable (for a crawler) feature: when we try to open password protected sites they prompt the user for username and password. To override such a strange way of handling authentication errors we need to know how urlopen() and urlretrieve() organized are internally. If we look at the source code of the urllib module we will see that both urlopen() and urlretrieve() use a helper class FancyURLopener. Basically all urlopen() and urlretrieve() do is to call methods open() and retrieve() of the helper class FancyURLopener. Nothing prevents us from using url openers directly. There are two types of url openers defined in urllib: a base class URLopener and a derived class FancyURLopener. The base class does not handle any http errors and simply raises an exception if there is any problem. For example, if you try to open a password-protected page, URLopener will raise an exception and pass parameters (“http error”, 401, “Unauthorized”). FancyURLopener is a child class of URLopener. As opposed to his parent, FancyURLopener does handle all http errors (including authentication error - 401) and does not raise exceptions. Its handler for authentication errors prompts the user to enter username and password. Since urlopen() and urlretrieve( ) internally use FancyURLopener, they ask for a password every time you open password protected sites. There are multiple ways to go around this problem. One solution is to derive a new class from FancyURLopener and override the handler of error 401. Here is the example of the derived class myURLopener, subclassing the class FancyURLopener and overriding the http_error_401() method - the handler of authentication errors: # Custom URLopener overriding authentication error handling import urllib class myURLopener(urllib.FancyURLopener): def http_error_401(self, url, fp, errcode, errmsg, headers, data=None): return None # do nothing Here is an example of a program which tries to open a password protected page using our custom url opener: url_opener = myURLopener() # create URLopener data = url_opener.open("http://www.lotus.com/names.nsf") # open file by url print data.read() # print content of downloaded file If you run this program you will see that it no longer asks the user for a password. Another solution is to use the base class URLopener instead of FancyURLopener. In this case no error handling will be done automatically and it will be our responsibility to handle various http errors. This way of doing things is more complicated but gives us more control. Here is an example of a program which uses this approach: import urllib url_opener = urllib.URLopener() # create URLopener try : data = url_opener.open("http://www.lotus.com/names.nsf") # open file by url print data.read() # print the content of downloaded file except IOError, error_code : # catch the error if error_code[0] == "http error" : if error_code[1] == 401 : # password protected site print "Authorization required" elif error_code[1] == 404 : # file not found print "File not found" else : print error_code Notice that since plain URLopener does not handle http errors automatically, we have to put open() in the try block and catch various http errors. If you run this program, it will try to open the page http://www.lotus.com/names.nsf. This page is password protected, so open() will raise an exception. The exception will be caught and error info("Authorization required") will be displayed.