converting HTML markup code to text

santanudas · Jan 29, 2010

Greetings all,

I'm a python newbie, so my apology for being silly (if it is) with my question.

I was trying to convert converting HTML markup code to human readable text and the sample line I took form the iTumes music library, which is a .xml file.

Code:

import os, sys, string
import urllib;

xmlString = "<string>file://localhost/Volumes/DataCenter/iTunes/iTunes%20Music/George%20Michael/Ladies%20&%20Gentlemen_%20The%20Best%20Of%20George%20Michael/Father%20Figure.m4a</string>

String = xmlString.split("/")
iX = urllib.unquote(String[-3])
print "Album name: " + iX

I was expecting to see "Ladies & Gentlemen_ The Best Of George Michael" in stead it returns "Ladies & Gentlemen_ The Best Of George Michael" i.e. it's converting the %* thing but not the & and stuff like that. Any one know what I'm doing wrong?

Thanks in advance for your help. Cheers!!!

feherke · Jan 29, 2010

Hi

santanudas said:
[small]was expecting to see [/small]
"Ladies & Gentlemen_ The Best Of George Michael" [small]in stead it returns [/small]
"Ladies & Gentlemen_ The Best Of George Michael"

As you can see, those two strings appear identically.

( Note that the TGML parser currently has a bug, it transforms certain character entities. It is tricky : the message appears correctly in the preview, then alters it. )

Please post again without previewing.

Feherke.

http://free.rootshell.be/~feherke/

santanudas · Jan 29, 2010

Silly me, I should have realized the html-code will be converted into normal character on the browser any way.

So, this is the sample line, taken for the .xml file

Code:

<key>Location</key><string>file://localhost/Volumes/DataCenter/nMedia/mMusic/iTunes/iTunes%20Music/George%20Michael/Ladies%20[COLOR=red][i]&+#+38_;[/i][/color]%20Gentlemen_%20The%20Best%20Of%20George%20Michael/Je
sus%20To%20A%20Child.m4a</string>

(omit the + signs in the red text, between & and

After the conversion, I was expecting to see "Ladies & Gentlemen_ The Best Of George Michael" but I got "Ladies &_#_38_; Gentlemen_ The Best Of George Michael" (again, ignore the plus signs) in stead. Did I put it in right way this time? Cheers!!!

santanudas · Jan 29, 2010

I see, I still get it right. In short: using urllib.unquote(), "&" is not being converted to "&" - what am I missing?
(finger crossed!!! hopefully this time it will come up correctly).

santanudas · Jan 30, 2010

Is there any help from any one please? Is too tough to do?
Cheers!!!

feherke · Jan 31, 2010

Hi

That is because [tt]urllib.unquote()[/tt] handles only URL encoding ( those %XX things ). But your string has also character entities ( those &#XX; things ) which has to be handled separately.

Personally I would use the [tt]unescape()[/tt] function from Fredrik Lundh's article, Unescape HTML Entities. Just add the [tt]import[/tt] and [tt]def[/tt] as shown there, then change this line :

Code:

iX [teal]=[/teal] [highlight][COLOR=darkgoldenrod]unescape[/color][teal]([/teal][/highlight]urllib[teal].[/teal][COLOR=darkgoldenrod]unquote[/color][teal]([/teal]String[teal][-[/teal][purple]3[/purple][teal]])[/teal][highlight][teal])[/teal][/highlight]

Feherke.

http://free.rootshell.be/~feherke/

santanudas · Feb 1, 2010

Hi there,
Thanks for the link. The "unescape" did solve the problem for "*&#xx" but creating for problem for string like: Rai%CC%88 (Raï) or Beyonc%CC%81 (Beyoncé). This is what I get:

Code:

Traceback (most recent call last):
  File "./metadata.py", line 150, in <module>
    artist_dir="%s/%s/%s" % (media_dir, genre, artist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)

Any solution to this issue? Cheers!!!

santanudas · Feb 1, 2010

Just to mention that I already have # -*- coding: ISO-8859-15 -*- added at the beginning of the script but this isn't working. Cheers!!!

feherke · Feb 1, 2010

Hi

No idea what happens there. Anyway, reversing the function calls seems to solve something here ( not sure if this is your problem too ) :

Code:

iX [teal]=[/teal] urllib[teal].[/teal][COLOR=darkgoldenrod]unquote[/color][teal]([/teal][COLOR=darkgoldenrod]unescape[/color][teal]([/teal]String[teal][-[/teal][purple]3[/purple][teal]]))[/teal]

Feherke.

http://free.rootshell.be/~feherke/

santanudas · Feb 1, 2010

Hi feherke,
I already tried reversing the system call as you said; need not to say that didn't fix the problem here. Cheers!!!

feherke · Feb 1, 2010

Hi

Sorry, I have no idea. Character encoding is my weak point, regardless the language and/or environment.

Feherke.

http://free.rootshell.be/~feherke/

santanudas · Feb 2, 2010

No problem feherke, at least you tried to help. Many thanks for that. Cheers!!!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

converting HTML markup code to text

santanudas

Technical User

feherke

Programmer

santanudas

Technical User

santanudas

Technical User

santanudas

Technical User

feherke

Programmer

santanudas

Technical User

santanudas

Technical User

feherke

Programmer

santanudas

Technical User

feherke

Programmer

santanudas

Technical User

Similar threads

Part and Inventory Search

Sponsor