Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

having a problem searching

Status
Not open for further replies.

Jiggerman

Programmer
Sep 5, 2002
62
GB
Hey folks,

I was wondering whther folk could give me a little advice over splitting and searching.

I'm trying to parse XML files, basically all I want to do is strip away all the Tags, pretty simple I thought, but I'm having major problems with it.

I thought that searching through the XML file line by line and removing the Tags with a substite operator would work, so
Code:
$line = s/<.+>//;
but that ended up removing just about everything, so next I thought I'll try and split it up by the tags so
Code:
@linesplit = split(<*>, $line);
or
Code:
@linesplit = split(<.+>, $line);
But these litle snippet splits every character into the array.

I'm still having no joy in the situation, I assume I'm doing something wrong with the Syntax of the search.

I know that this is something super simple, so I know you guy's will solve the problem in 5 minutes and then laugh at me for the five minutes after that.

Thanks alot
 
Instead of :

$line = s/<.+>//;

Do

$line = s/<.+?>//;

The first one is a 'greedy' operator that grabs the longest string possible. The second should be a none greedy operator that only gets the smallest match.

Why not just use one of the many many many CPAN modules to manage XML nicely?

 
I'm sorry to say that that doesn't seem to work either.

I didn't do any research into the CPAN modules, because I was hopeing to be able to evaluate the tags that I wanted to and ignore the rest. I imagine that a CPAN module might work though. I'll give it a try Thanks.
 
The problem could be that you don't have a global modifier on the regex:
$line = s/<.+?>//g;

This is what I've done in the past..
Code:
open (INP,&quot;input.xml&quot;);

undef $/; #undefine end of record
$inp = <INP>;
$/ = &quot;\n&quot;; #redefine it

$inp =~ s/[\x0d\x0a]//g; #strip any renegade cr/lf's

$inp =~ s/<.+?>//g;  

close INP;

getting the whole file into a string with no linefeeds or carriage returns makes it easier to kill the tags that span more than one line, such as comments:

<!--
if (window.top != window) {
window.top.location.replace (window.location.href);
}
-->
 
If it's just XML you're parsing, CPAN is the way to travel

--Paul
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top