Contact US

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Thread Pool Spider Issue...

Thread Pool Spider Issue...

Thread Pool Spider Issue...

I am running this code:


This is a skeletal working web spider, with virtually no error-checking.
You need to pass in an URL that points to a directory (e.g.

This version uses a thread pool, passing in each URL to be retrieved to
a queue and getting back the list of links in another queue.  It's more
efficient than the brute-force version because threads are re-used and
because there's no polling.

import sys
import string
import urllib
import urlparse
import htmllib
import formatter
from cStringIO import StringIO
import threading
import Queue
import time


def xor(a,b):
    from operator import truth
    return truth(a) ^ truth(b)

class Token:
    def __init__(self, URL=None, shutdown=None):
        if not xor(URL, shutdown):
            raise "Tsk, tsk, need to set either URL or shutdown (not both)"
        self.URL = URL
        self.shutdown = shutdown

class Retriever(threading.Thread):
    def __init__(self, inputQueue, outputQueue):
        self.inputQueue = inputQueue
        self.outputQueue = outputQueue

    def run(self):
        while 1:
            token = self.inputQueue.get()
            if token.shutdown:
                self.URL = token.URL

    def getPage(self):
        print "Retrieving:", self.URL
        self.page = urllib.urlopen(self.URL)
        self.body = self.page.read()

    def getLinks(self):
        # Handle relative links
        links = []
        for link in self.parser.anchorlist:
            links.append( urlparse.urljoin(self.URL, link) )
        return links

    def parse(self):
        # We're using the parser just to get the HREFs
        # We should also use it to e.g. respect <META NOFOLLOW>
        w = formatter.DumbWriter(StringIO())
        f = formatter.AbstractFormatter(w)
        self.parser = htmllib.HTMLParser(f)

class RetrievePool:
    def __init__(self, numThreads):
        self.retrievePool = []
        self.inputQueue = Queue.Queue()
        self.outputQueue = Queue.Queue()
        for i in range(numThreads):
            retriever = Retriever(self.inputQueue, self.outputQueue)

    def put(self, URL):

    def get(self):
        return self.outputQueue.get()

    def shutdown(self):
        for i in self.retrievePool:
        for thread in self.retrievePool:

class Spider:
    def __init__(self, startURL, maxThreads):
        self.URLs = []
        self.queue = [startURL]
        self.URLdict = {startURL: 1}
        self.include = startURL
        self.numPagesQueued = 0
        self.retriever = RetrievePool(maxThreads)

    def checkInclude(self, URL):
        return string.find(URL, self.include) == 0

    def run(self):
        while self.numPagesQueued > 0:
        self.URLs = self.URLdict.keys()

    def startPages(self):
        while self.queue:
            URL = self.queue.pop()
            self.numPagesQueued += 1

    def queueLinks(self):
        links = self.retriever.get()
        self.numPagesQueued -= 1

    def processLinks(self, links):
        for link in links:
            print "Checking:", link
            # Make sure this is a new URL and is within the current site
            if ( not self.URLdict.has_key(link) ) and self.checkInclude(link):
                self.URLdict[link] = 1

if __name__ == '__main__':
    startURL = sys.argv[1]
    spider = Spider(startURL, MAX_THREADS)
    for URL in spider.URLs:
        print URL

But I am getting this traceback...


Traceback (most recent call last):
  File "C:\Python26\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", line 325, in RunScript
    exec codeObject in __main__.__dict__
  File "C:\Python26\Scripts\foo3.py", line 155, in <module>
    startURL = sys.argv[1]
IndexError: list index out of range


RE: Thread Pool Spider Issue...

probably because you are running it from the PythonWin editor and are not supplying a URL as an argument?

RE: Thread Pool Spider Issue...


I went to command prompt:


Python foo3.py http://www.thewebsiteIamlookingup.com

I get tracebacks to the following:
line 39 for class Token
line 40 for in Token --> URL = None because in the prompt I am telling the prompt what to insert...

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close