×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Contact US

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Need print out to go to *.txt file instead of screen

Need print out to go to *.txt file instead of screen

Need print out to go to *.txt file instead of screen

(OP)
Need help with printing the results to a *.txt file.

Would I need to use the writelines() method?
or
Would I need to use the f.write(string) method?

any help would be appreciate...

CODE

#================================#
#File Name: Crawler.py                 
#Description: Spider with html parser; title and keywords
#Creator: unknown
#================================#

import sys
import re
import urllib2
import urlparse
tocrawl = set([sys.argv[1]])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

while 1:
    try:
        crawling = tocrawl.pop()
        print crawling
    except KeyError:
        raise StopIteration
    url = urlparse.urlparse(crawling)
    try:
        response = urllib2.urlopen(crawling)
    except:
        continue
    msg = response.read()
    startPos = msg.find('<title>')
    if startPos != -1:
        endPos = msg.find('</title>', startPos+7)
        if endPos != -1:
            title = msg[startPos+7:endPos]
            print title
    keywordlist = keywordregex.findall(msg)
    if len(keywordlist) > 0:
        keywordlist = keywordlist[0]
        keywordlist = keywordlist.split(", ")
        print keywordlist
    links = linkregex.findall(msg)
    crawled.add(crawling)
    for link in (links.pop(0) for _ in xrange(len(links))):
        if link.startswith('/'):
            link = 'http://' + url[1] + link
        elif link.startswith('#'):
            link = 'http://' + url[1] + url[2] + link
        elif not link.startswith('http'):
            link = 'http://' + url[1] + '/' + link
        if link not in crawled:
            tocrawl.add(link)

RE: Need print out to go to *.txt file instead of screen

Hi

I would prefer to be able to choose whether to write to file or the standard output.

CODE

import sys
import getopt

outfile='-'

opt,arg=getopt.getopt(sys.argv[1:],'o:')

for key,val in opt:
  if key=='-o':
    outfile=val

tocrawl=arg

if outfile=='-':
  out=sys.stdout
else:
  out=open(outfile,'w')

out.write("I'm just writing.\n")
out.write("I don't care where.\n")

# your crawling would come here

if outfile!='-':
  out.close()
Sample usage :
  Crawler.py http://tek-tips.com/ # print to standard outout
  Crawler.py -o - http://tek-tips.com/ # print to standard outout
  Crawler.py -o writehere.txt http://tek-tips.com/ # print to file writehere.txt

Note that I suggest to do some proper option parsing instead of tocrawl = set([sys.argv[1]]). My code is just a sample kept simple.

Note that you regular expressions are abit naive. You are supposing that
  • tags not contain no line wraps
  • tags, attributes and values are written all lowercase
  • attribute values are always surrounded with quotes
  • attribute values not contain quotes
  • all documents are XHTML
  • title has no attributes
  • meta's first attribute is name and the second is content
  • meta has no other attributes beside name and content
  • a's first attribute is href
The above enumerated situations can be met it valid documents. However, the wast majority of HTML documents are invalid, so containing even more situations for failure.

Better search for a suitable module to parse HTML.
 

Feherke.
http://free.rootshell.be/~feherke/

RE: Need print out to go to *.txt file instead of screen

(OP)
Feherke:

I am running into StopIteration, line 45

If I go with this option in the command prompt:
Crawler.py -o writehere.txt http://tek-tips.com/

Is the below what you are suggesting:

CODE

##feherke begining code  
import sys
import getopt

outfile='-'

opt,arg=getopt.getopt(sys.argv[1:],'o:')

for key,val in opt:
  if key=='-o':
    outfile=val

tocrawl=arg

if outfile=='-':
  out=sys.stdout
else:
  out=open(outfile,'w')

out.write("I'm just writing.\n")
out.write("I don't care where.\n")

###My crawl as it was

#================================#
#File Name: Crawler.py                 
#Description: Spider with html parser; title and keywords
#Creator: unknown
#================================#

import sys
import re
import urllib2
import urlparse
tocrawl = set([sys.argv[1]])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

while 1:
    try:
        crawling = tocrawl.pop()
        print crawling
    except KeyError:
        raise StopIteration
    url = urlparse.urlparse(crawling)
    try:
        response = urllib2.urlopen(crawling)
    except:
        continue
    msg = response.read()
    startPos = msg.find('<title>')
    if startPos != -1:
        endPos = msg.find('</title>', startPos+7)
        if endPos != -1:
            title = msg[startPos+7:endPos]
            print title
    keywordlist = keywordregex.findall(msg)
    if len(keywordlist) > 0:
        keywordlist = keywordlist[0]
        keywordlist = keywordlist.split(", ")
        print keywordlist
    links = linkregex.findall(msg)
    crawled.add(crawling)
    for link in (links.pop(0) for _ in xrange(len(links))):
        if link.startswith('/'):
            link = 'http://' + url[1] + link
        elif link.startswith('#'):
            link = 'http://' + url[1] + url[2] + link
        elif not link.startswith('http'):
            link = 'http://' + url[1] + '/' + link
        if link not in crawled:
            tocrawl.add(link)
##feherke end code                        
if outfile!='-':
  out.close()

 

RE: Need print out to go to *.txt file instead of screen

Hi

At the end I forgot to mention that my code contains its own assignment to tocrawl. That was necessary because the use of getopt changed the situation abit.

Additionally, now tocrawl is list, not set. ( Why set anyway ? ) So I also changed crawled to list.

This works for me :

CODE

#================================#
#File Name: Crawler.py
#Description: Spider with html parser; title and keywords
#Creator: unknown
#================================#

import sys
import getopt
import re
import urllib2
import urlparse

outfile='-'

opt,arg=getopt.getopt(sys.argv[1:],'o:')

for key,val in opt:
    if key=='-o':
        outfile=val

tocrawl=arg

if outfile=='-':
    out=sys.stdout
else:
    out=open(outfile,'w')

crawled = []
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

while tocrawl:
    try:
        crawling = tocrawl.pop()
        out.write(crawling+"\n")
    except KeyError:
        raise StopIteration
    url = urlparse.urlparse(crawling)
    try:
        response = urllib2.urlopen(crawling)
    except:
        continue
    msg = response.read()
    startPos = msg.find('<title>')
    if startPos != -1:
        endPos = msg.find('</title>', startPos+7)
        if endPos != -1:
            title = msg[startPos+7:endPos]
            out.write(title+"\n")
    keywordlist = keywordregex.findall(msg)
    if len(keywordlist) > 0:
        keywordlist = keywordlist[0]
        keywordlist = keywordlist.split(", ")
        out.write(keywordlist+"\n")
    links = linkregex.findall(msg)
    crawled.append(crawling)
    for link in (links.pop(0) for _ in xrange(len(links))):
        if link.startswith('/'):
            link = 'http://' + url[1] + link
        elif link.startswith('#'):
            link = 'http://' + url[1] + url[2] + link
        elif not link.startswith('http'):
            link = 'http://' + url[1] + '/' + link
        if link not in crawled:
            tocrawl.append(link)

if outfile!='-':
    out.close()
Some more notes for your TODO list :
  • check the protocol to not try to follow ftp://, mailto:, javascript: and similar URLs
  • check for base href tag and use it when composing URL from link hrefs

Feherke.
http://free.rootshell.be/~feherke/

RE: Need print out to go to *.txt file instead of screen

(OP)
Excellent, Feherke. I appreciate your knowledge and the fact that you took the time to educate me a bit more...

I hope we cross paths again.

Any references you could post to address the TODO list you created for me would be might helpful...

Best of luck!

RE: Need print out to go to *.txt file instead of screen

Hi

I spent some time with similar tasks, and my conclusion was that using an existing tool is the best way.

For productivity, I would take a look at twill.

For fun, I would try to use a generic approach :

CODE

import re
import urllib2

scriptre=re.compile('<script\\b[\w\W]*?>.*?</script\s*>',re.I)
stylere=re.compile('<style\\b[\w\W]*?>.*?</style\s*>',re.I)
tagre=re.compile('<(\w+)[\w\W]*?>',re.I)
attrre=re.compile('(\w+)=(?:(["\'])(.*?)\\2|(\w*))',re.I)

response=urllib2.urlopen('http://tek-tips.com/')
html=response.read()
html=re.sub(stylere,',re.sub(scriptre,',html))

for tag in re.finditer(tagre,html):

  if tag.group(1).lower()=='meta':
    name=content='
    for attr in re.finditer(attrre,tag.group()):
      if attr.group(1).lower()=='name':
        name=attr.group(3) or attr.group(4)
      elif attr.group(1).lower()=='content':
        content=attr.group(3) or attr.group(4)
    if name.lower()=='keywords':
      print 'keywords\t= '+content

  if tag.group(1).lower()=='a':
    for attr in re.finditer(attrre,tag.group()):
      if attr.group(1).lower()=='href':
        href=attr.group(3) or attr.group(4)
        print 'href\t= '+href
The bad part of this approach is that the tags' innerHTML can not be obtained. Currently the attribute regular expression does not match minimized attributes. But for playing, I would continue this way.

Feherke.
http://free.rootshell.be/~feherke/

RE: Need print out to go to *.txt file instead of screen

(OP)
Feherke:

Thanks again...wonderful information.  

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login


Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close