×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Contact US

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

converting HTML markup code to text

converting HTML markup code to text

converting HTML markup code to text

(OP)
Greetings all,

I'm a python newbie, so my apology for being silly (if it is) with my question.

I was trying to convert converting HTML markup code to human readable text and the sample line I took form the iTumes music library, which is a .xml file.

CODE

import os, sys, string
import urllib;

xmlString = "<string>file://localhost/Volumes/DataCenter/iTunes/iTunes%20Music/George%20Michael/Ladies%20&%20Gentlemen_%20The%20Best%20Of%20George%20Michael/Father%20Figure.m4a</string>

String = xmlString.split("/")
iX = urllib.unquote(String[-3])
print "Album name: " + iX

I was expecting to see "Ladies & Gentlemen_ The Best Of George Michael" in stead it returns "Ladies & Gentlemen_ The Best Of George Michael" i.e. it's converting the %* thing but not the & and stuff like that. Any one know what I'm doing wrong?

Thanks in advance for your help. Cheers!!!

RE: converting HTML markup code to text

Hi

Quote (santanudas):

was expecting to see
"Ladies & Gentlemen_ The Best Of George Michael" in stead it returns
"Ladies & Gentlemen_ The Best Of George Michael"
As you can see, those two strings appear identically.

( Note that the TGML parser currently has a bug, it transforms certain character entities. It is tricky : the message appears correctly in the preview, then alters it. )

Please post again without previewing.
 

Feherke.
http://free.rootshell.be/~feherke/

RE: converting HTML markup code to text

(OP)
Silly me, I should have realized the html-code will be converted into normal character on the browser any way.

So, this is the sample line, taken for the .xml file  

CODE

<key>Location</key><string>file://localhost/Volumes/DataCenter/nMedia/mMusic/iTunes/iTunes%20Music/George%20Michael/Ladies%20&+#+38_;%20Gentlemen_%20The%20Best%20Of%20George%20Michael/Je
sus%20To%20A%20Child.m4a</string>
(omit the + signs in the red text, between & and ;)
After the conversion, I was expecting to see "Ladies & Gentlemen_ The Best Of George Michael" but I got "Ladies &_#_38_; Gentlemen_ The Best Of George Michael" (again, ignore the plus signs) in stead. Did I put it in right way this time? Cheers!!!   

RE: converting HTML markup code to text

(OP)
I see, I still get it right. In short: using urllib.unquote(), "&#38;" is not being converted to "&" - what am I missing?
(finger crossed!!! hopefully this time it will come up correctly).

RE: converting HTML markup code to text

(OP)
Is there any help from any one please? Is too tough to do?
Cheers!!!  

RE: converting HTML markup code to text

Hi

That is because urllib.unquote() handles only URL encoding ( those %XX things ). But your string has also character entities ( those &#XX; things ) which has to be handled separately.

Personally I would use the unescape() function from Fredrik Lundh's article, Unescape HTML Entities. Just add the import and def as shown there, then change this line :

CODE --> ( fragment )

iX = unescape(urllib.unquote(String[-3]))

Feherke.
http://free.rootshell.be/~feherke/

RE: converting HTML markup code to text

(OP)
Hi there,
Thanks for the link. The "unescape" did solve the problem for "*&#xx" but creating for problem for string like: Rai%CC%88 (Raï) or Beyonc%CC%81 (Beyoncé). This is what I get:

CODE

Traceback (most recent call last):
  File "./metadata.py", line 150, in <module>
    artist_dir="%s/%s/%s" % (media_dir, genre, artist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)

Any solution to this issue? Cheers!!!

RE: converting HTML markup code to text

(OP)
Just to mention that I already have # -*- coding: ISO-8859-15 -*- added at the beginning of the script but this isn't working. Cheers!!!

RE: converting HTML markup code to text

Hi

No idea what happens there. Anyway, reversing the function calls seems to solve something here ( not sure if this is your problem too ) :

CODE --> ( fragment )

iX = urllib.unquote(unescape(String[-3]))

Feherke.
http://free.rootshell.be/~feherke/

RE: converting HTML markup code to text

(OP)
Hi feherke,
I already tried reversing the system call as you said; need not to say that didn't fix the problem here. Cheers!!!  

RE: converting HTML markup code to text

Hi

Sorry, I have no idea. Character encoding is my weak point, regardless the language and/or environment.

Feherke.
http://free.rootshell.be/~feherke/

RE: converting HTML markup code to text

(OP)
No problem feherke, at least you tried to help. Many thanks for that. Cheers!!!

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login


Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close