Help: XML parsing in Japanese, with text replace filter

serathigeos · Apr 18, 2006

Hello,

I need to extract information from "rows" in an XML file.
<row id="LABEL0">
<tag a="Japanese" b="0" c="0">Japanese Text0</tag>
<tag a="English A" b="0" c="0">English Text A0</tag>
<tag a="English B" b="0" c="0">English Text B0</tag>
</row>

English Text A contains Japanese special characters that need to be converted. From the command line I could type:
perl -pi~ -e 's/old/new/g' %s
Only, cygwin doesn't like non-ASCII.

I need to output the row id, followed by the filtered English Text A.
LABEL0
English Text A0
LABEL1
English Text A1

To complicate things, rows can also indicate new output files:
<row id="file:newFile.txt">

The overall format is:
OUTPUT FILE
DATA
DATA
...
DATA
OUTPUT FILE
...

Do I want to be using perl for all of these tasks?
Do I want to be using perl at all?
What is the easiest way to extract data from an XML file that contains Japanese text (SJIS)?
What is the easiest way to handle text replace (SJIS)?
Are there any pitfalls I should look out for?
Is there a better place for me to ask these questions?

-Brendan

TrojanWarBlade · Apr 20, 2006

Perl looks to me like a good tool to use for this job.
If you want to avoid non-ascii characters you can specify hex equivilants in your regex using \xHH constructs.
There are a number of XML modules that you could try but I seem to be doing reasonably well with XML::Mini at the moment so you might like to try that.

Trojan.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Help: XML parsing in Japanese, with text replace filter

serathigeos

Programmer

TrojanWarBlade

Programmer

Similar threads

Part and Inventory Search

Sponsor