×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Contact US

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Convert HTML to Plain Text

Convert HTML to Plain Text

Convert HTML to Plain Text

(OP)
This works well in pulling down the HTML/Text from the requested page:

CODE

# -*- coding: utf-8 -*-
# Python

from urllib import urlopen
print urlopen('http://www.fsmb.org').read()

However, I need help converting the print urlopen().read output to plain text rather than HTML/Text.

Your help is appreciated.

RE: Convert HTML to Plain Text

(OP)
JustinEzequiel:

Thanks for the response.

I have downloaded the HTML2Text.py when I run it a GUI window pops up with the phrase, PYTHONWIN. My only options are to fill it out (with what I don't know), select OK, or select CANCEL.

Any ideas?

RE: Convert HTML to Plain Text

if you open up the html2text.py file in your favorite editor then you'll see at the bottom how you can use it

CODE


if __name__ == "__main__":
    baseurl = ''
    if sys.argv[1:]:
        arg = sys.argv[1]
        if arg.startswith('http://'):
            baseurl = arg
            j = urllib.urlopen(baseurl)
            try:
                from feedparser import _getCharacterEncoding as enc
            except ImportError:
                   enc = lambda x, y: ('utf-8', 1)
            text = j.read()
            encoding = enc(j.headers, text)[0]
            if encoding == 'us-ascii': encoding = 'utf-8'
            data = text.decode(encoding)

        else:
            encoding = 'utf8'
            if len(sys.argv) > 2:
                encoding = sys.argv[2]
            data = open(arg, 'r').read().decode(encoding)
    else:
        data = sys.stdin.read().decode('utf8')
    wrapwrite(html2text(data, baseurl))

RE: Convert HTML to Plain Text

CODE

import sys, urllib
from StringIO import StringIO
import html2text

if __name__ == '__main__':
    url = 'http://www.fsmb.org'
    encoding = 'utf-8'
    f = urllib.urlopen(url)
    try: s = f.read()
    finally: f.close()
    ustr = s.decode(encoding)
    b = StringIO()
    old = sys.stdout
    try:
        sys.stdout = b
        html2text.wrapwrite(html2text.html2text(ustr, url))
    finally: sys.stdout = old
    text = b.getvalue()
    b.close()
    print text

RE: Convert HTML to Plain Text

(OP)
This is what I went with (this is a snippet of the whole):

CODE

if __name__ == "__main__":
    baseurl = 'http://www.fsmb.org'
    if sys.argv[1:]:
        arg = sys.argv[1]
        if arg.startswith('http://'):
            baseurl = arg
            j = urllib.urlopen(baseurl)
            try:
                from feedparser import _getCharacterEncoding as enc
            except ImportError:
                   enc = lambda x, y: ('utf-8', 1)
            text = j.read()
            encoding = enc(j.headers, text)[0]
            if encoding == 'us-ascii': encoding = 'utf-8'
            data = text.decode(encoding)

        else:
            encoding = 'utf8'
            if len(sys.argv) > 2:
                encoding = sys.argv[2]
            data = open(arg, 'r').read().decode(encoding)
    else:
        data = sys.stdin.read().decode('utf8')
    wrapwrite(html2text(data, baseurl))

I am still being prompted to input some sort of data...through a pop up window...any ideas? Should I be tweaking other elements of this code?

RE: Convert HTML to Plain Text

(OP)
Justin:

Touchdown...I was a day late and a dollar short. Your code worked wonderfully:

CODE

import sys, urllib
from StringIO import StringIO
import html2text

if __name__ == '__main__':
    url = 'http://www.fsmb.org'
    encoding = 'utf-8'
    f = urllib.urlopen(url)
    try: s = f.read()
    finally: f.close()
    ustr = s.decode(encoding)
    b = StringIO()
    old = sys.stdout
    try:
        sys.stdout = b
        html2text.wrapwrite(html2text.html2text(ustr, url))
    finally: sys.stdout = old
    text = b.getvalue()
    b.close()
    print text

Thank you.

RE: Convert HTML to Plain Text

try my previous post and do not modify the html2text.py file but incorporate my post into your own code

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login


Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close