LWP::UserAgent and Cookies trouble.

Numbski · Nov 12, 2001

Okay, first, an exercise to know what we're dealing with. Open a cookies-enabled browser. Go to this URL:

http://www.mame.dk/login.phtml

This should pop up an authentication dialogue box. Use the username: numbski and password: slipup. There's nothing here to protect, it just keeps track of what I've downloaded when, no big deal.

This will bring back a page telling you that you're successfully logged in. As you get this page you're assigned a cookie, and a META-REFRESH tag takes you back to the main page.

Now, go to this page:

http://www.mame.dk/download/rom/pacman

Notice the upper-right hand corner identifies you as 'numbski' (it remembered who you are) and at the bottom of the page there are two links. Yes I do own it, and Oops, I am sorry. Look at the source. The 'yes' link is some randomly generated key akin to this:

?1005558755|d1cbe3583b4477c207af7c85d0e75104

Alright, tag that on to the end of your current URL like so:

http://www.mame.dk/download/rom/pacman?1005558755|d1cbe3583b4477c207af7c85d0e75104

This is our final destination. It has a META-Refresh tag that will try to automatically download the file. Stop that and look at the page. There's a line like this:

download: if your download doesnt start automatically, click here

Taking a look at the source of this link, if all has gone well, will look something like this:

http://roms.mame.dk/6a002cbe1c8d1a005d30844c6626d4bd/1005558871/cur/pacman.zip

Okay, I apologize, that was a mouthful. There was no getting around it though.

Now, onto the script. I have written an LWP::Simple script that goes through that exact process, and is supposed to download the zipfile at the end. It worked great for about 2 weeks, then they added the authentication thing and required a cookie. That's where my nightmares began. It looked like it would be a simple thing to fix, add the authentication, switch to LWP::UserAgent, and use cookies, but something is going horribly wrong. Instead of pulling the second download page like it's supposed two, it keeps pulling the first page over and over again. I have no idea why. If you add "|login=confirmed" to the end of the second URL, it will pull the second page correctly, but instead of the format of the URL you're supposed to get as above, you get something like this:

http://roms.mame.dk/pacman.zip

If you attempt to download that file, it'll download an HTML document stating 'you were linked by a bandwidth thief'.

Basically, I just need someone to take a look over my script and see if I'm screwing something up. I have debug enabled so you can see what's going on in the http headers each step of the way. Helllp!

Begin code:

Code:

#!/usr/bin/perl


use Cwd;
use LWP::UserAgent;
use URI;
use URI::URL;
use HTML::Parse;
use HTML::Element;
use HTTP::Cookies;
use HTTP::Request;
use HTTP::Response;
use LWP::Debug '+';


system(&quot;cls&quot;);

##
#For now, I've disabled the actual login routines.  Made the login a global.

#print &quot;Username: &quot;;
#$username=<>;
#print &quot;Password: &quot;;
#$password=<>;
#chomp($username);
#chomp($password);

$username=&quot;numbski&quot;;
$password=&quot;slipup&quot;;

print &quot;Checking Authentication...\n\n&quot;;

$ua = new LWP::UserAgent;
 
	#Hopefully creates a cookie jar that will catch mame.dk's cookie.
    	$cookies = HTTP::Cookies->new; 	# Create a cookie jar
	$ua->cookie_jar($cookies);	# Enable cookies

        #Tell the site that we're IE5.5 on Windows 2000
 	$ua->agent(&quot;Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)&quot;);

	#Request the login page, give it our username and password.
  	$req = new HTTP::Request GET => '[URL unfurl="true"]http://www.mame.dk/login.phtml';[/URL]
  	$req->authorization_basic($username,$password);
   	my $response = $ua->request($req);
   	if ($response->is_success){
		$login_page = $response->content;
  		print &quot;Login successful for $username.&quot;;
	} 
	else{
		die &quot;Could not get login page content.&quot;;
	}


   	print &quot;\n\nPress Enter to Continue...\n&quot;;
        $enter=<>;
        system(&quot;cls&quot;);


#Proceed to downloading the first page.  'Do you own this ROM?'
print &quot;Downloading Page 1...\n\n&quot;;
$rom=&quot;pacman&quot;;

$page1 = new HTTP::Request GET => &quot;[URL unfurl="true"]http://www.mame.dk/download/rom/$rom/&quot;;[/URL]


#Don't think this is needed after I'm logged in and have a cookie.
#$page1->authorization_basic($username,$password);

$res = $ua->request($page1);
if ($res->is_success){
	$page1_source = $res->content;
	print &quot;Got page 1.\n\n&quot;;
} 
else{
	die &quot;Could not get page content.&quot;;
}


#Find the links in this page, strip out the link to page 2.
$parsed_page1 = HTML::Parse::parse_html($page1_source);

for (@{ $parsed_page1->extract_links() }) {
	$link=$_->[0];
	$url      = new URI::URL $link;
	$full_url = $url->abs(&quot;[URL unfurl="true"]http://www.mame.dk/download/rom/$rom&quot;);[/URL]

        #Look for the URL with a question mark in it.  That's the one we need.
	if($full_url=~/\?/){
		$page2_url=$full_url;
      		chomp($page2_url);

        	#Since the structure of the URL is weird, we need to split it
         	#and add back in the rom name.
       		@url_parts=split(/\?/,$page2_url);
		#@session_id=split(/\%7C/,@url_parts[1]);

		$page2_url=&quot;@url_parts[0]$rom?@url_parts[1]&quot;;
        
		print &quot;\nHere's the link I found!\n $page2_url\n&quot;;
     	
	      	&get_page2;
     	
  }
}



sub get_page2{
	$enter=<>;
	system(&quot;cls&quot;);

        #'Download pacman.zip' page....we hope anyway.
	$page2 = HTTP::Request->new ( GET => $page2_url);
	$page2->authorization_basic($username,$password);

	$res2 = $ua->request($page2);
	if ($res2->is_success){
		$page2_source = $res2->content;
		print &quot;Page 2 Complete.\n&quot;;
 		print &quot;Have a source looksie:\n\n&quot;;
  		print $page2_source;
   		print &quot;\n\nIf all looks well here, try downloading the zipfile.\n&quot;;
    		print &quot;Press ENTER to continue.\n&quot;;
     		$enter=<>;
      		&download_zip;
	} 

	else{
		die &quot;Could not get page2 content&quot;;
	}
}



sub download_zip{

	@page2_content=split(/\n/,$page2_source);
	foreach $line(@page2_content){
		@line_parts=split(/</,$line);

  	     	#The following is a very poor parsing routine.  Will be replaced later.
   	    	#It's effective for our purposes though.

		#Begin stripping the binary link out of the correct line.
  	    	if(@line_parts[2] eq &quot;TD valign=\&quot;top\&quot; class=\&quot;stdtext\&quot;>if your download doesnt start automatically, &quot;){
			print &quot;Found our link.\n&quot;;
			@link_parts=split(/<a href=/,$line);
			$binary_link=@link_parts[1];
			chomp($binary_link);
			$binary_link=~s*&quot;**g;
			$binary_link=~s*>**g;
			$binary_link=~s*click here</a**g;

			#This MUST be in the form [URL unfurl="true"]http://roms(2).mame.dk/randomchars/randomchars/cur/$rom.zip[/URL]
  			#I've been getting [URL unfurl="true"]http://roms(2).mame.dk/$rom.zip,[/URL] try it in IE or Netscape to see what I mean.
			print &quot;Link is $binary_link.\n&quot;;

			print &quot;Downloading $rom from [URL unfurl="true"]http://roms.mame.dk\n&quot;;[/URL]

			$rom_filename=&quot;$rom.zip&quot;;

			#Create our binary request.  Print out failures, if any.
  			#This should save the zipfile to the same directory as this script.

			my $zipfile = new HTTP::Request('GET', &quot;$binary_link&quot;);
			my $response = $ua->request($zipfile, &quot;$rom_filename&quot;);
	     
			if($response->is_error()){
				print $response->status_line.&quot;\n\n&quot;
			}
			else{
				print &quot;Download of $rom.zip complete.\n\n&quot;
			}
								                       	
              	}

	}
 	print &quot;If you saw no text after downloading page 2, then it failed to get\n&quot;;
  	print &quot;the correct page 2.&quot;; 
}

Numbski · Nov 12, 2001

Wow...buried already?

Just thought you should know that the site has switched off it's need to authenticate, so I don't really know how my script will react now. If you have any insights I'd still like to hear it.

Numbski

tanderso · Nov 12, 2001

I'm not sure if this will help or not, but this is how I got around cookies in my proxy app:

if ($res->is_success)
{
...
# get the content type of the response
$type = $res->content_type;
unless (defined $type && $type ne "&quot

{$type = "text/html";} # assign a default (in case of error)

# grab the headers from the response
$headers = $res->headers_as_string;
...
}

# strip the domain from cookies for transparency
$headers =~ s/(Set-Cookie:.*?) domain=.*?\n/$1\n/gis;

# print the passed HTTP headers, including cookies
print $headers;

# print the HTTP content-type header
print "Content-type: $type\n\n";

print "$content";

This effectively makes the CGI set cookies in your browser instead of the remote site, while the remote site sets cookies in your CGI's cookie jar. They should always be the same.
Sincerely,

Tom Anderson
CEO, Order amid Chaos, Inc.

http://www.oac-design.com

tanderso · Nov 12, 2001

Looking over what I just wrote, the $type= stuff should be outside of the "is_success" block.
Sincerely,

Tom Anderson
CEO, Order amid Chaos, Inc.

http://www.oac-design.com

tanderso · Nov 12, 2001

Here's another possible solution from some message archive I came across:

"I've found that using a proxy for http and https requests eliminates
this problem. This is a good workaround for me, but it might suggest
that there's something LWP is missing in the way of handling socket
connections that other software (Apache, Lynx, Netscape...) does.

One other (somewhat unrelated) thing I found that might help someone
else out is that other browsers change the method of a redirected POST
to GET. This was causing me problems where I couldn't get past the
login on a secure site - I kept getting redirected in a circle back to
the login page.

Here's some code that worked for me :

# User agent subclass to allow redirects on POSTs
@MyAgent::ISA = qw(LWP::UserAgent);
sub MyAgent::redirect_ok {
my ($self,$request)=@_;
if ($request->method eq "POST&quot

{
$request->method("GET&quot

;
}
return 1;
}

--
--------------------------------------------------------------
Fair Winds, Software Engineer and

http://WWW developer

Chris Dunn MRM, inc.

Email: chris.dunn@mrmnc.com
Phone: (919) 544-6500 Ext 228
Pager: (919) 506-0819

http://www4.ncsu.edu/~w_dunn

--------------------------------------------------------------"

Sincerely,

Tom Anderson
CEO, Order amid Chaos, Inc.

http://www.oac-design.com

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

LWP::UserAgent and Cookies trouble.

Numbski

MIS

Numbski

MIS

tanderso

IS-IT--Management

tanderso

IS-IT--Management

tanderso

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor