Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

collecting files from web ?

Status
Not open for further replies.

MoshiachNow

IS-IT--Management
Joined
Feb 6, 2002
Messages
1,851
Location
IL
HI,

I have a certain task for which I must collect complicated (bigger then 500KB) poscript (xx.ps) and pdf sample files.
I want to write some code that will ,like with GOOGLE ,search web for these types of files,get the file info,and if it match my criterions,dowload it to my computer.

What perl/CPAN functions should I use for this task ?
Thanks



Long live king Moshiach !
 
no but there are other modules to search Google... you'll need to get a Google key first, which they use to stop people from abusing their site by running millions of searches a second or something...

I'd suggest getting a Google key and playing nice with them as opposed to using LWP and making a Google search parser, because they might possibly ban you from using their search engine if you hit them too frequently (i.e. more hits than a key would've allowed)

-------------
Cuvou.com | The NEW Kirsle.net
 
He didn't say he wants to hit google, he wants a google like search script.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
ohh, like a web crawler.

I believe there's an LWP::Robot module that does just that (it might be or LWP::Spyder, or something, look around for it).

-------------
Cuvou.com | The NEW Kirsle.net
 
I think my needs are more like a Google functionality.
What I do now ,manualy,is searching Google for strings ".ps.Z" (compressed postscript files).
Then I go page by page and download these files one by one,then process them.
I do think I need to access some existing search engine.
I was thinking of introducing a half a second delay between searches,not to saturate the engine/net.
I would also limit the total number of found files to,say,1000.
Will appreciate further adviseon searching and downloading the found file.
Thanks


Long live king Moshiach !
 
Scraping Google pages like that is probably a violation of their terms of service. It definitely contradicts the directives in their robots.txt file.

If you're using Google to find your pages, you should use one of the available APIS.
 
HI,

I got now the "Google key".
I guess I need further advise here ...
I will introduce delays when hitting the web, to avoid it's saturation.
Thanks

Long live king Moshiach !
 
HI,

anybode got an example how to use Google for this type of goal ? I did not get far with Google examples ...
(I got the Google key now).
Thanks

Long live king Moshiach !
 
Here's something I wrote a long time ago which searched Google... it was a command for one of my AIM chat bots... I won't convert it into more generalized code, but all the functionality is here, so its just an example...

Code:
#      .   .             <CKS Juggernaut>

#     .:...::     Command Name // !google

#    .::   ::.     Description // A better Google search.

# ..:;;. ' .;;:..        Usage // !google <query>

#    .  '''  .     Permissions // Public

#     :;,:,;:         Listener // All Listeners

#     :     :        Copyright // 2004 Chaos AI Technology



sub google {

	my ($self,$client,$msg,$listener) = @_;



	# They need a search query.

	if (length $msg == 0) {

		return "You must provide a query to search for.\n\n"

			. "!google <lt>search string<gt>";

	}



	my $key = $chaos->{_system}->{config}->{googlekey};

	if (length $key == 0) {

		# A valid key is required for this command.

		if (isMaster($client,$listener)) {

			return "This command requires you to obtain a Google Search Key. You "

				. "can get one at [URL unfurl="true"]http://www.google.com/apis/[/URL] . Install the "

				. "Google key by opening startup.cfg and adding the variable "

				. "\"Google Key\" and insert the new key as its value, i.e.\n\n"

				. "Google Key=your new google key";

		}

		else {

			return "My botmaster has not obtained a Google key, and this command "

				. "cannot be used without a valid key.";

		}

	}



	use SOAP::Lite;

	my $google = SOAP::Lite->service ('file:./lib/GoogleSearch.wsdl');

	my $query = $msg;

	my $result = $google->doGoogleSearch($key, $query, 0, 5, 'false', '', 'false', '', 'latin1', 'latin1');



	my $reply;

	foreach my $element (@{$result->{resultElements}}) {

		$reply .= "$element->{title}\n"

			. "$element->{URL}\n\n";

	}



	return "Google Search Results\n\n" . $reply;



}



{

	Category => 'General Utilities',

	Description => 'Google Search',

	Usage => '!google <search string>',

	Listener => 'All',

};

You'll have to find GoogleSearch.wsdl on your own, but searching Google will probably turn up something (I think I may have gotten it from the Google API pages itself, I don't remember).

-------------
Cuvou.com | The NEW Kirsle.net
 
Kirsle ,

Message from: Google SOAP Search API (Beta)
Google Code Home > Google SOAP Search API :

"As of December 5, 2006, we are no longer issuing new API keys for the SOAP Search API. Developers with existing SOAP Search API keys will not be affected."


Will apreciate further ideas.

Long live king Moshiach !
 
"As of December 5, 2006, we are no longer issuing new API keys for the SOAP Search API. Developers with existing SOAP Search API keys will not be affected."

Did you try running a code like I had above with SOAP::Lite? You have a Google key now, try using that.

If you needed that GoogleSearch.wsdl, I'll temporary link it here for now:
-------------
Cuvou.com | The NEW Kirsle.net
 
Thanks,Kirsle
I have altered the code a bir,eneterd my key .
It only prints "Google Search Results" now ..
Where the "isMaster" is coming from anyhow ?
=================================
use SOAP::Lite;


my ($self,$client,$msg,$listener) = @_;
my $msg="postscript";

# They need a search query.
if (length $msg == 0) {
return "You must provide a query to search for.\n\n" . "!google <lt>search string<gt>";
}

my $key = "ABQIAAAALDB_-D9we8THVaL329c5axT2yXp_ZAY8_ufC3CFXhHIE1NvwkxTRvnN4TQO4H89WKIzqxg59T7prng";
if (length $key == 0) {
# A valid key is required for this command.
if (isMaster($client,$listener)) {
print "This command requires you to obtain a Google Search Key. You "
. "can get one at . Install the "
. "Google key by opening startup.cfg and adding the variable "
. "\"Google Key\" and insert the new key as its value, i.e.\n\n"
. "Google Key=your new google key";
} else {
print "My botmaster has not obtained a Google key, and this command "
. "cannot be used without a valid key.";
}
}

my $google = SOAP::Lite->service ('file:GoogleSearch.wsdl');
my $query = $msg;
my $result = $google->doGoogleSearch($key, $query, 0, 5, 'false', '', 'false', '', 'latin1', 'latin1');

my $reply;
foreach my $element (@{$result->{resultElements}}) {
$reply .= "$element->{title}\n"
. "$element->{URL}\n\n";
}

print "Google Search Results\n\n" . $reply;




Long live king Moshiach !
 
Are you some kind of total n00b at Perl or something? Learn to pick out the relevant bits of code.

Code:
use SOAP::Lite;

my $key = "my-google-key-goes-here";

my $google = SOAP::Lite->service ('file:GoogleSearch.wsdl');
my $result = $google->doGoogleSearch ($key, "postscript", 0, 5, 'false', '', 'false', '', 'latin1', 'latin1');

my $reply;
foreach my $element (@{$result->{resultElements}}) {
   $reply .= "$element->{title}\n"
      . "$element->{URL}\n\n";
}
print "Google results:\n\n$reply";

-------------
Cuvou.com | My personal homepage
Project Fearless | My web blog
 
Kirsle,

I have some experience with Perl,but not enough yet.
I guess we all have our days and appreciate patience from a much more experience guys ...One can never tell how much pressure the other guy has right now at his job position...

Did just the above code,no results.
Checked with my MIS guys,my program accesses port 80,no blocking by Firewall.
Thanks

Long live king Moshiach !
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top