Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Shaun E on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Help: XML parsing in Japanese, with text replace filter

Status
Not open for further replies.

serathigeos

Programmer
Apr 18, 2006
2
JP
Hello,

I need to extract information from "rows" in an XML file.
<row id="LABEL0">
<tag a="Japanese" b="0" c="0">Japanese Text0</tag>
<tag a="English A" b="0" c="0">English Text A0</tag>
<tag a="English B" b="0" c="0">English Text B0</tag>
</row>

English Text A contains Japanese special characters that need to be converted. From the command line I could type:
perl -pi~ -e 's/old/new/g' %s
Only, cygwin doesn't like non-ASCII.

I need to output the row id, followed by the filtered English Text A.
LABEL0
English Text A0
LABEL1
English Text A1

To complicate things, rows can also indicate new output files:
<row id="file:newFile.txt">

The overall format is:
OUTPUT FILE
DATA
DATA
...
DATA
OUTPUT FILE
...

Do I want to be using perl for all of these tasks?
Do I want to be using perl at all?
What is the easiest way to extract data from an XML file that contains Japanese text (SJIS)?
What is the easiest way to handle text replace (SJIS)?
Are there any pitfalls I should look out for?
Is there a better place for me to ask these questions?

-Brendan
 
Perl looks to me like a good tool to use for this job.
If you want to avoid non-ascii characters you can specify hex equivilants in your regex using \xHH constructs.
There are a number of XML modules that you could try but I seem to be doing reasonably well with XML::Mini at the moment so you might like to try that.


Trojan.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top