Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations MikeeOK on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Seaching web for large pdf files and download them

Status
Not open for further replies.

MoshiachNow

IS-IT--Management
Feb 6, 2002
1,851
IL
HI,

For some project I"m involved, I need to find pdf files on web that are larger then 20,000 pages, then download them.

Where should I start with doing it with Perl ?
I have already checked out and I can find URl's with specific string.
But I actualy need to achieve a more complicated goal,as above.
Will appreciate ideas.
thanks

Long live king Moshiach !
 
OK,

I have maged this with the following code:

#search for pdf files and download these bigger the 20.000 pages
#use warnings;
use strict;

use use LWP;
use LWP::Simple;
use CAM::pDF;

my ($NOofPages,$content,$fileName,$url,$pdf);
my $TEMP=$ENV{'TEMP'};
my $query = '.pdf';
my $search = new $search->http_proxy(['http','ftp'] => '$search->native_query(chdir ($TEMP);
while (my $result = $search->next_result()) {
#print "\n=====================\n";
if ($result->url =~ /\.pdf$/) {
$url= $result->url;
($fileName) = $result->url =~ m!.*\/(.*\.pdf)!;
next if ($fileName =~ /of612.pdf/);
print "Getting url=$url=,filename=$fileName=,destination=$TEMP\\$fileName\n";
$content = getstore($result->url, "$TEMP\\$fileName");
warn "Couldn't get it!" unless defined $content;

$pdf = CAM::pDF->new("$TEMP\\$fileName"); #establish No of pages
next unless ($pdf);
$NOofPages = $pdf->numPages() ;
print "\tNo of pages = $NOofPages\n";
if ($NOofPages < 20000) {
system("del /q \"$TEMP\\$fileName\"");
} else {
system("xcopy /y \"$TEMP\\$fileName\" D:\\DOcuments\\BigPdf >NUL");
system("del /q \"$TEMP\\$fileName\"");
}
}
}

========================
What I would like to do is to check the number of pages in pdf BEFORE downloading it, so that i will download only these that are bigger then 20.000 pages.
Will appreciate ideas.
thanks

Long live king Moshiach !
 
I don't think you can. About the closest you can get is to do a HTTP HEAD request on the file to see how big it is before you download it. Set some arbitrary threshold value and only attempt to download anything over that size.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
try removing http from the proxy arguments:

['http','ftp']

['ftp']

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
i don't think that will work. That just specifies what protocols go through the proxy and I think he will use http to do the searching.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
You could just ignore any urls that don't have ftp

any reason your ignoring all the http pdf files?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top