having a problem searching

Jiggerman · Oct 23, 2003

Hey folks,

I was wondering whther folk could give me a little advice over splitting and searching.

I'm trying to parse XML files, basically all I want to do is strip away all the Tags, pretty simple I thought, but I'm having major problems with it.

I thought that searching through the XML file line by line and removing the Tags with a substite operator would work, so

Code:

$line = s/<.+>//;

but that ended up removing just about everything, so next I thought I'll try and split it up by the tags so

Code:

@linesplit = split(<*>, $line);

or

Code:

@linesplit = split(<.+>, $line);

But these litle snippet splits every character into the array.

I'm still having no joy in the situation, I assume I'm doing something wrong with the Syntax of the search.

I know that this is something super simple, so I know you guy's will solve the problem in 5 minutes and then laugh at me for the five minutes after that.

Thanks alot

siberian · Oct 23, 2003

Instead of :

$line = s/<.+>//;

Do

$line = s/<.+?>//;

The first one is a 'greedy' operator that grabs the longest string possible. The second should be a none greedy operator that only gets the smallest match.

Why not just use one of the many many many CPAN modules to manage XML nicely?

Jiggerman · Oct 25, 2003

I'm sorry to say that that doesn't seem to work either.

I didn't do any research into the CPAN modules, because I was hopeing to be able to evaluate the tags that I wanted to and ignore the rest. I imagine that a CPAN module might work though. I'll give it a try Thanks.

chazoid · Oct 25, 2003

The problem could be that you don't have a global modifier on the regex:
$line = s/<.+?>//g;

This is what I've done in the past..

Code:

open (INP,&quot;input.xml&quot;);

undef $/; #undefine end of record
$inp = <INP>;
$/ = &quot;\n&quot;; #redefine it

$inp =~ s/[\x0d\x0a]//g; #strip any renegade cr/lf's

$inp =~ s/<.+?>//g;  

close INP;

getting the whole file into a string with no linefeeds or carriage returns makes it easier to kill the tags that span more than one line, such as comments:

PaulTEG · Oct 25, 2003

If it's just XML you're parsing, CPAN is the way to travel

--Paul

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

having a problem searching

Jiggerman

Programmer

siberian

Programmer

Jiggerman

Programmer

chazoid

Technical User

PaulTEG

Technical User

Similar threads

Part and Inventory Search

Sponsor