Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

parsing html

Status
Not open for further replies.

dinger2121

Programmer
Sep 11, 2007
439
US
Hello,
I am trying to parse this bit of html -

<span class="hpPageText" >LEGAL</span></td>
<td headers="col2_1" style="width:13%; text-align:right" >
<span class="hpPageText" >151</span></td>

using this line -

m!\<span class="hpPageText" >$field\</span>\</td>(\S*)\<td headers="col2_1" style="width:13%; text-align:right" >(\S*)\<span class="hpPageText" >(.+?)</span>/</td>!is)


The script is not currently finding anything. can anyone see where I might be off?

Thanks
 
Parsing HTML with regexps is rarely a good idea. Especially if you're not entirely comfortable with them (you're preceding the '<' character with a backslash despite it not having any special meaning within a regexp, and you're using \S instead of \s to match whitespace).

Have a look on CPAN for HTML::TokeParser or HTML::TokeParser::Simple for parsing HTML. Those will be far more robust to minor changes in the HTML code in future (what if they change the width to 14% instead?), which will likely break your regexp.
 
thank you....I will look at html::tokeparser::simple.
I am new to Perl, just trying some things out.

Thanks again
 
I would like to quickly explain what I would like to accomplish in hopes that someone will affirm that I should be using HTML::TokeParser::Simple.

I have a page that has multiple sections like the following -

<td headers="col1_1" style="width:21%" >
<span class="PageText" >LETTER</span></td>
<td headers="col2_1" style="width:13%; text-align:right" >
<span class="PageText" >4,889</span></td>
<td headers="col3_1" style="width:13%; text-align:right" >
<span class="PageText" >1.0</span></td>
<td headers="col4_1" style="width:13%; text-align:right" >
<span class="PageText" ></span></td>
<td headers="col5_1" style="width:13%; text-align:right" >
<span class="PageText" ></span></td>
<td headers="col6_1" style="width:13%; text-align:right" >
<span class="PageText" >4889.0</span></td>
</tr>

I need to extract the number (in this case the 4,889) from each table row where the first tag (in this case LETTER) equals on or two values. I will then write that number value to a text file.
can anyone suggest a better method to accomplish this?

Thanks again
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top