Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

RegEx help needed, I give up with Parsers 2

Status
Not open for further replies.

1DMF

Programmer
Jan 18, 2005
8,795
GB
Hi,

I'm wasting too much time trying to understand HTML parsers when all I want to do is something really simple.

So can someone help me with the regex please.

I would like to strip everything upto and including
<h2>Basic details for:</h2>
from a HTML string.

and then everything from and including up to the emd of the file
<!-- MIFID Changes -->

That will leave me a small enough piece of HTML code I can manually get what I want from.

Many thank.
1DMF.

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
Here you go :)
Code:
my $p = new HTML::TokeParser::Simple( \$page_source );

my $wanted = 0;
while( my $t = $p->get_token ) {
   # start with a <h2> tag followed by "Basic details for:"
   $wanted = 1 if ( $t->is_start_tag( 'h2' ) && $p->peek eq 'Basic details for:$

   # stop at <!-- MIFID Changes --> comment
   $wanted = 0 if ( $t->is_comment && $t->as_is eq '<!-- MIFID Changes -->' );

   # print if $wanted is true
   print $t->as_is if $wanted;
}
 
thanks ishnid, only it produces
syntax error at my.cgi line 100, near "-->"
(Might be a runaway multi-line '' string starting on line 97)
Can't find string terminator "'" anywhere before EOF

so I altered the line to this
# start with a <h2> tag followed by "Basic details for:"
$wanted = 1 if ( $t->is_start_tag( 'h2' ) && $p->peek eq 'Basic details for:');
and now i get this
Can't call method "get_token" on an undefined value

what have we done wrong?


"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
Sorry, that line got truncated when I copied and pasted. Can I see the line of code where you create the $p variable (i.e. new HTML::TokeParser::Simple)?
 
this is the full code i have, you'll see i added another condition to help break the loop before EOF but I get the same error..
Code:
my $p = new HTML::TokeParser::Simple( $cont );
    my $brk = 0;
    my $wanted = 0;



    while( my $t = $p->get_token && $brk == 0) {

        if( $t->is_end_tag( 'body' ) ){$brk=1;}

        # start with a <h2> tag followed by "Basic details for:"
        $wanted = 1 if ( $t->is_start_tag( 'h2' ) && $p->peek eq 'Basic details for:');

        # stop at <!-- MIFID Changes --> comment
        $wanted = 0 if ( $t->is_comment && $t->as_is eq '<!-- MIFID Changes -->' );

        # print if $wanted is true
        print $t->as_is if $wanted;
    }

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
If you pass in a scalar, it will interpret is as the filename to read from. If $cont contains the actual HTML code itself, you need to pass the module a reference to it, i.e.
Code:
my $p = new HTML::TokeParser::Simple( $filename );
my $p = new HTML::TokeParser::Simple( \$scalar_containing_code );
 
Here's the regex solution. Although I advice you to get ishnid's parser solution to work instead:

Code:
[url=http://perldoc.perl.org/functions/use.html][black][b]use[/b][/black][/url] [green]strict[/green][red];[/red]

[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]$html[/blue] = [url=http://perldoc.perl.org/functions/do.html][black][b]do[/b][/black][/url] [red]{[/red][url=http://perldoc.perl.org/functions/local.html][black][b]local[/b][/black][/url] [blue]$/[/blue][red];[/red] <DATA>[red]}[/red][red];[/red]

[olive][b]if[/b][/olive] [red]([/red][blue]$html[/blue] =~ [red]m{[/red][purple][purple][b]\Q[/b][/purple]<h2>Basic details for:</h2>[purple][b]\E[/b][/purple](.*?)[purple][b]\Q[/b][/purple]<!-- MIFID Changes -->[purple][b]\E[/b][/purple][/purple][red]}[/red][red]s[/red][red])[/red] [red]{[/red]
	[url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] [red]"[/red][purple]Main Content = '[blue]$1[/blue]'[purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]
[red]}[/red] [olive][b]else[/b][/olive][red]{[/red]
	[url=http://perldoc.perl.org/functions/warn.html][black][b]warn[/b][/black][/url] [red]"[/red][purple]Unable to match content[purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]
[red]}[/red]

[fuchsia]1[/fuchsia][red];[/red]

[teal]__DATA__[/teal]
[teal]<html>[/teal]
[teal]<body>[/teal]
[teal]<h2>Basic details for:</h2>[/teal]
[teal]Main content here.[/teal]
[teal]no really.[/teal]
[teal]<!-- MIFID Changes -->[/teal]
[teal]</body>[/teal]
[teal]</html>[/teal]
[tt]------------------------------------------------------------
Pragmas (perl 5.8.8) used :
[ul]
[li]strict - Perl pragma to restrict unsafe constructs[/li]
[/ul]
[/tt]

- Miller
 
spot on ishnid, though i'm getting a wierd result....

it prints the first heading 'basic details for' but misses out the actual data up to the comment

Then it prints another heading for the next page fetched, but the data up to the comment is actually the previous webpage data that should have appeared?

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
Operator precedence? Try this replacement line:
Code:
    while( ( my $t = $p->get_token ) && $brk == 0) {
 
nope, coa when you mentioned the error in not passing a scalar reference i removed my additional code for $brk and went back to straight
Code:
while( my $t = $p->get_token ) {

though viewing the source, it seems the data being returned is correct, but the display when rendered is wrong, so i'm gonna play a bit more building a string of what i want and see if I can get the right order.

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
I'm still lost ishnid, I don't get why when i use the same methods, I get the tags insted of te data between them, this code...

Code:
$p = new HTML::TokeParser::Simple( \$data );

    my $txt;

    while( my $t = $p->get_token) {

        if( $t->is_start_tag( 'p' ) ){
            $txt .= $t->as_is . ",";
        }

        if( $t->is_start_tag( 'td' ) ){
            $txt .= $t->as_is . ",";
        }

    }

print $txt;

Produces ....
<p>,<td>,<td>,<br />

Then I relalised, if current row = start tag , it's the next INDEX that has the data , so am using peek and now yielding some results, will play more , and let you know how i fair :)

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
Each token is either a HTML tag or some text that appears between tags. In the code you posted above, you're checking if the current token ($t) is a 'p' tag. Then you add that tag to $txt (i.e. "<p>"). If you wanted to get the text in the tag after it, you could do something like this:
Code:
        if( $t->is_start_tag( 'p' ) ){
            $txt .= $p->get_token->as_is . ",";
        }
Of course you may want to check that the next token returned is actually text before adding it to $txt. A typical way of grabbing everything between two tags would be to use some sort of flag to indicate that you've encountered things you wish to hold onto (similar to what I've done above), e.g.:
Code:
my $flag = 0;
while( my $t = $p->get_token) {
   # stop if we encounter </p>
   $flag = 0 if ( $t->is_end_tag( 'p' ) );

   # add to the $txt variable if it's not a tag and
   # $flag is turned on
   $txt .= $t->as_is . "," if ( $flag && $t->is_text );

   # start if we encounter <p>
   $flag = 0 if ( $t->is_start_tag( 'p' ) );
}
 
Here is a slightly generalized version using HTML::parser. As you can see, it's fairly obvious why so many alternative interfaces such as HTML::TokeParser have been derived from this module, as the handler interface (while powerful) is rather non-standard and rarely the most logical.

Code:
[url=http://perldoc.perl.org/functions/use.html][black][b]use[/b][/black][/url] [green]HTML::Parser[/green][red];[/red]

[black][b]use[/b][/black] [green]strict[/green][red];[/red]

[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]@start_cond[/blue] = [red]([/red]
	[red]{[/red][purple]event[/purple] => [red]'[/red][purple]start[/purple][red]'[/red], [purple]tagname[/purple] => [red]'[/red][purple]h2[/purple][red]'[/red][red]}[/red],
	[red]{[/red][purple]event[/purple] => [red]'[/red][purple]text[/purple][red]'[/red], [purple]text[/purple] => [red]'[/red][purple]Basic details for:[/purple][red]'[/red][red]}[/red],
	[red]{[/red][purple]event[/purple] => [red]'[/red][purple]end[/purple][red]'[/red], [purple]tagname[/purple] => [red]'[/red][purple]h2[/purple][red]'[/red][red]}[/red],
[red])[/red][red];[/red]

[black][b]my[/b][/black] [blue]@end_cond[/blue] = [red]([/red]
	[red]{[/red][purple]event[/purple] => [red]'[/red][purple]comment[/purple][red]'[/red], [purple]text[/purple] => [red]'[/red][purple]<!-- MIFID Changes -->[/purple][red]'[/red][red]}[/red],
[red])[/red][red];[/red]

[black][b]my[/b][/black] [blue]$html[/blue] = [url=http://perldoc.perl.org/functions/do.html][black][b]do[/b][/black][/url] [red]{[/red][url=http://perldoc.perl.org/functions/local.html][black][b]local[/b][/black][/url] [blue]$/[/blue][red];[/red] <DATA>[red]}[/red][red];[/red]

[black][b]my[/b][/black] [blue]$p[/blue] = HTML::Parser->[maroon]new[/maroon][red]([/red]
	[purple]default_h[/purple] => [red][[/red]\[maroon]&start_h[/maroon], [red]"[/red][purple]self,event,text,tagname[/purple][red]"[/red][red]][/red],
[red])[/red][red];[/red]
[blue]$p[/blue]->[maroon]parse[/maroon][red]([/red][blue]$html[/blue][red])[/red][red];[/red]

[url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] [red]"[/red][purple][blue]$p[/blue]->{_text}[/purple][red]"[/red][red];[/red]


[gray][i]######[/i][/gray]
[gray][i]# Supporting Functions[/i][/gray]

[url=http://perldoc.perl.org/functions/sub.html][black][b]sub[/b][/black][/url] [maroon]match_sequential_condition[/maroon] [red]{[/red]
	[black][b]my[/b][/black] [red]([/red][blue]$condsref[/blue], [blue]$self[/blue], [blue]$event[/blue], [blue]$text[/blue], [blue]$tagname[/blue][red])[/red] = [blue]@_[/blue][red];[/red]

	[black][b]my[/b][/black] [blue]$condition[/blue] = [blue]$condsref[/blue]->[red][[/red][blue]$self[/blue]->[red]{[/red]_condition_index[red]}[/red] ||= [fuchsia]0[/fuchsia][red]][/red][red];[/red]
	[black][b]my[/b][/black] [blue]%token[/blue] = [red]([/red][purple]event[/purple] => [blue]$event[/blue], [purple]text[/purple] => [blue]$text[/blue], [purple]tagname[/purple] => [blue]$tagname[/blue][red])[/red][red];[/red]

	[black][b]my[/b][/black] [blue]$is_match[/blue] = [fuchsia]1[/fuchsia][red];[/red]
	[maroon]TEST[/maroon][maroon]:[/maroon]
	[olive][b]while[/b][/olive] [red]([/red][black][b]my[/b][/black] [red]([/red][blue]$key[/blue], [blue]$val[/blue][red])[/red] = [url=http://perldoc.perl.org/functions/each.html][black][b]each[/b][/black][/url] [blue]%$condition[/blue][red])[/red] [red]{[/red]
		[olive][b]if[/b][/olive] [red]([/red]! [url=http://perldoc.perl.org/functions/exists.html][black][b]exists[/b][/black][/url] [blue]$token[/blue][red]{[/red][blue]$key[/blue][red]}[/red] || [blue]$token[/blue][red]{[/red][blue]$key[/blue][red]}[/red] ne [blue]$val[/blue][red])[/red] [red]{[/red]
			[blue]$is_match[/blue] = [fuchsia]0[/fuchsia][red];[/red]
			[olive][b]last[/b][/olive] TEST[red];[/red]
		[red]}[/red]
	[red]}[/red]

	[olive][b]if[/b][/olive] [red]([/red][blue]$is_match[/blue][red])[/red] [red]{[/red]
		[blue]$self[/blue]->[red]{[/red]_condition_index[red]}[/red]++[red];[/red]
		[blue]$self[/blue]->[red]{[/red]_condition_index[red]}[/red] %= [blue]@$condsref[/blue][red];[/red]
		[blue]$is_match[/blue] = [fuchsia]0[/fuchsia] [olive][b]if[/b][/olive] [blue]$self[/blue]->[red]{[/red]_condition_index[red]}[/red][red];[/red] [gray][i]# Still more tests[/i][/gray]
	[red]}[/red] [olive][b]else[/b][/olive] [red]{[/red]
		[blue]$self[/blue]->[red]{[/red]_condition_index[red]}[/red] = [fuchsia]0[/fuchsia][red];[/red]
	[red]}[/red]

	[url=http://perldoc.perl.org/functions/return.html][black][b]return[/b][/black][/url] [blue]$is_match[/blue][red];[/red]
[red]}[/red]


[black][b]sub[/b][/black] [maroon]start_h[/maroon] [red]{[/red]
	[black][b]my[/b][/black] [red]([/red][blue]$self[/blue], [blue]$event[/blue], [blue]$text[/blue], [blue]$tagname[/blue][red])[/red] = [blue]@_[/blue][red];[/red]

	[olive][b]if[/b][/olive] [red]([/red][maroon]match_sequential_condition[/maroon][red]([/red]\[blue]@start_cond[/blue], [blue]@_[/blue][red])[/red][red])[/red] [red]{[/red]
		[blue]$self[/blue]->[maroon]handler[/maroon][red]([/red][purple]default[/purple] => \[maroon]&end_h[/maroon], [red]"[/red][purple]self,event,text,tagname[/purple][red]"[/red][red])[/red][red];[/red]
		[blue]$self[/blue]->[red]{[/red]_text[red]}[/red] = [red]'[/red][purple][/purple][red]'[/red][red];[/red]
	[red]}[/red]
[red]}[/red]

[black][b]sub[/b][/black] [maroon]end_h[/maroon] [red]{[/red]
	[black][b]my[/b][/black] [red]([/red][blue]$self[/blue], [blue]$event[/blue], [blue]$text[/blue], [blue]$tagname[/blue][red])[/red] = [blue]@_[/blue][red];[/red]

	[black][b]return[/b][/black] [blue]$self[/blue]->[maroon]eof[/maroon][red]([/red][red])[/red] [olive][b]if[/b][/olive] [maroon]match_sequential_condition[/maroon][red]([/red]\[blue]@end_cond[/blue], [blue]@_[/blue][red])[/red][red];[/red]

	[blue]$self[/blue]->[red]{[/red]_text[red]}[/red] .= [blue]$text[/blue][red];[/red]
[red]}[/red]

[fuchsia]1[/fuchsia][red];[/red]

[teal]__DATA__[/teal]
[teal]<html>[/teal]
[teal]<body>[/teal]
[teal]<h2>Basic details for:</h2>[/teal]
[teal]Main content here.[/teal]
[teal]no really.[/teal]
[teal]<!-- MIFID Changes -->[/teal]
[teal]</body>[/teal]
[teal]</html>[/teal]
[tt]------------------------------------------------------------
Pragmas (perl 5.8.8) used :
[ul]
[li]strict - Perl pragma to restrict unsafe constructs[/li]
[/ul]
Other Modules used :
[ul]
[li]HTML::parser[/li]
[/ul]
[/tt]

- Miller

PS,
This took 20 minutes to throw together. This is because I have to relearn the interface every time I use this module. Definitely a sign it's not very intuitive.
 
Thanks mller but whoooaaaaa!!!!

Man it was hard enough getting my head round the tokeParser, without you laying that on me - lol

but seriously I can see why these other wrapper modules were made, I struggle at the best of times with some CPAN modules, without this crazy mumma!

But don't fear, I got it cracked in the end with Ishnid's kind help (you really are a gent Mr Ishnid, my sincere thanks for your help as always!).

my final code is a bit clunky , but it works..
Code:
foreach my $firm (@firms){

    my @sid = split(/\?/,$firm);
    my $page = "$URL" . $sid[1];
    my $mech = [URL unfurl="true"]WWW::Mechanize->new();[/URL]
    $mech->get( $page );
    
    my $cont = $mech->content( base_href => [undef] );

    my $p = new HTML::TokeParser::Simple( \$cont );
    my $wanted = 0;
    my $data;

    while( my $t = $p->get_token) {

        # start with a <h2> tag followed by "Basic details for:"
        $wanted = 1 if ( $t->is_start_tag( 'h2' ) );

        # stop at <!-- MIFID Changes --> comment
        if ( $t->is_comment && $t->as_is eq '<!-- MIFID Changes -->' ){$wanted = 0; print "<hr />";}

        # print if $wanted is true
        $data .=  $t->as_is if $wanted;
    }


    my $comms = $1 if( $data =~ m{\Q<th>Phone:<br/>Fax:<br/>Email:<br/>Website:<br/></th>\E(.*?)\Q</td>\E}s); 

    my @comms = split(/<br\/>/,$comms);

    $comms[0] =~ s/<td>//gi;

    for(@comms){
        $_ =~ s/^\s+//;
	    $_ =~ s/\s+$//;
    }

    my $addy = $1 if( $data =~ m{\Q<th>Address:</th>\E(.*?)\Q</td>\E}s); 

    my @addy = split(/<br\/>/,$addy);

    $addy[0] =~ s/<td>//gi;

    for(@addy){
        $_ =~ s/^\s+//;
	    $_ =~ s/\s+$//;
    }

    $p = new HTML::TokeParser::Simple( \$data );


    my ($status,$edate,$agent,$ins,$add,$note) = 0;

    while( my $t = $p->get_token) {

        if( $t->is_start_tag( 'p' ) ){
            my $peek = $p->peek;
            my @txt = split(/-/,$peek);
            for(@txt){
                $_ =~ s/^\s+//;
	            $_ =~ s/\s+$//;
            }
            $txt .= $txt[0] . "," . $txt[1] . ",";
        }

        if( $t->is_start_tag( 'th' ) && $p->peek eq 'Current status:'){
            $status = 1;
        }

        if( $t->is_start_tag( 'th' ) && $p->peek eq 'Effective Date:'){
            $edate = 1;
        }

        if( $t->is_start_tag( 'th' ) && $p->peek eq 'Tied Agent:'){
            $agent = 1;
        }

        if( $t->is_start_tag( 'th' ) && $p->peek eq 'Undertakes Insurance Mediation:'){
            $ins = 1;
        }

        if( $t->is_start_tag( 'th' ) && $p->peek eq 'Address:'){
            my $ad = @addy - 1;
            for(my $x = 0; $x < $ad; $x++){
                $txt .= $addy[$x] . ",";
            }
        }

        if( $t->is_start_tag( 'th' ) && $p->peek eq 'Phone:'){
            my $com = @comms - 1;
            for(my $x = 0; $x < $com; $x++){
                $txt .= $comms[$x] . ",";
            }
        }                       

         if( $t->is_start_tag( 'th' ) && $p->peek eq 'Notices:'){
            $note = 1;
        }

        if( $t->is_start_tag( 'td' ) && $status ){
            $txt .= $p->peek . ",";
            $status = 0;
        }

        if( $t->is_start_tag( 'td' ) && $edate ){
            $txt .= $p->peek . ",";
            $edate = 0;
        }        

        if( $t->is_start_tag( 'td' ) && $agent ){

            if($p->peek eq '</td>'){
                $txt .= ",";
            }
            else{
                $txt .= $p->peek . ",";
            }

            $agent = 0;
        }

        if( $t->is_start_tag( 'td' ) && $ins ){
            $txt .= $p->peek . ",";
            $ins = 0;
        }

        if( $t->is_start_tag( 'td' ) && $note ){

            if($p->peek eq '</td>'){
                $txt .= ",";
            }
            else{
                $txt .= $p->peek;
            }

            $note = 0;
            $txt .= "\n"; 
        } 

    }

}
You'll notice a couple of sections i've had to play with Miller's RegEx, the parser breaks because of pesky '<br/>' tags killing some of the logic.

I went a bit 'blank' on how to deal with that while looping the tokeparser item for each webpage.

but with a mishmash of both methods and some clunky for loops to tidy up whitespace, i got my result.

So many thanks and a star to you both :)

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top