Seaching web for large pdf files and download them

MoshiachNow · Dec 31, 2008

HI,

For some project I"m involved, I need to find pdf files on web that are larger then 20,000 pages, then download them.

Where should I start with doing it with Perl ?
I have already checked out

http://WWW:Search

and I can find URl's with specific string.
But I actualy need to achieve a more complicated goal,as above.
Will appreciate ideas.
thanks

Long live king Moshiach !

http://www.simpletoremember.com/vitals/seven-laws-of-noah.htm

MoshiachNow · Jan 1, 2009

OK,

I have maged this with the following code:

#search for pdf files and download these bigger the 20.000 pages
#use warnings;
use strict;

use

http://WWW::Search;

use LWP;
use LWP::Simple;
use CAM:

DF;

my ($NOofPages,$content,$fileName,$url,$pdf);
my $TEMP=$ENV{'TEMP'};
my $query = '.pdf';
my $search = new

http://WWW::Search('MSN');

$search->http_proxy(['http','ftp'] => '

http://proxy1.kodak.com:81');

$search->native_query(

http://WWW::Search::escape_query($query));

chdir ($TEMP);
while (my $result = $search->next_result()) {
#print "\n=====================\n";
if ($result->url =~ /\.pdf$/) {
$url= $result->url;
($fileName) = $result->url =~ m!.*\/(.*\.pdf)!;
next if ($fileName =~ /of612.pdf/);
print "Getting url=$url=,filename=$fileName=,destination=$TEMP\\$fileName\n";
$content = getstore($result->url, "$TEMP\\$fileName");
warn "Couldn't get it!" unless defined $content;

$pdf = CAM:

DF->new("$TEMP\\$fileName"); #establish No of pages
next unless ($pdf);
$NOofPages = $pdf->numPages() ;
print "\tNo of pages = $NOofPages\n";
if ($NOofPages < 20000) {
system("del /q \"$TEMP\\$fileName\"");
} else {
system("xcopy /y \"$TEMP\\$fileName\" D:\\DOcuments\\BigPdf >NUL");
system("del /q \"$TEMP\\$fileName\"");
}
}
}

========================
What I would like to do is to check the number of pages in pdf BEFORE downloading it, so that i will download only these that are bigger then 20.000 pages.
Will appreciate ideas.
thanks

Long live king Moshiach !

http://www.simpletoremember.com/vitals/seven-laws-of-noah.htm

stevexff · Jan 1, 2009

I don't think you can. About the closest you can get is to do a HTTP HEAD request on the file to see how big it is before you download it. Set some arbitrary threshold value and only attempt to download anything over that size.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

MoshiachNow · Jan 1, 2009

thanks,that's what I thought ...

Long live king Moshiach !

http://www.simpletoremember.com/vitals/seven-laws-of-noah.htm

MoshiachNow · Jan 1, 2009

Another question.
I use:
my $query = '.pdf';

I would really need to search only for pdf files that are on FTP:// only,not on HTTP.
How can I change my query to look for "fttp://*/*.pdf" ?
thanks

Long live king Moshiach !

http://www.simpletoremember.com/vitals/seven-laws-of-noah.htm

KevinADC · Jan 1, 2009

try removing http from the proxy arguments:

['http','ftp']

['ftp']

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

travs69 · Jan 1, 2009

i don't think that will work. That just specifies what protocols go through the proxy and I think he will use http to do the searching.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]

Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;

travs69 · Jan 1, 2009

You could just ignore any urls that don't have ftp

any reason your ignoring all the http pdf files?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]

Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;

MoshiachNow · Jan 3, 2009

KevinADC ,it did not work,got no results.
travs69 - I wanted pdf files from ftp only since there much more of these on FTP sites.

The problem is that the following code finds ONLY HTTP sites,no FTP ...
thanks

Long live king Moshiach !

http://www.simpletoremember.com/vitals/seven-laws-of-noah.htm

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Seaching web for large pdf files and download them

MoshiachNow

IS-IT--Management

MoshiachNow

IS-IT--Management

stevexff

Programmer

MoshiachNow

IS-IT--Management

MoshiachNow

IS-IT--Management

KevinADC

Technical User

travs69

MIS

travs69

MIS

MoshiachNow

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor