Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

screen scraping - web content retrieval

Status
Not open for further replies.

basball

Technical User
Dec 18, 2002
192
US
i need some direction looking for an app that will allow we to take a snap shot a web page and save the content including images to a mysql database. my script will call the grabber app and supply it with user defined urls which will intern be archived mysql database.

i've heard this refered to as screen scraping or content retrieval.

thank you.
 
have a look at the sockets functions in php, you use fsockpen($url,$port) to open a connection
fgets() or fputs() to retrieve or pass variables thru the connection and fclose() to close the connection

see


Bastien

Cat, the other other white meat
 
the problem is retrieving the images, how is the accomplished:

looking for a php app that will allow me to to take a screenshot and dump the elements as well as the images into a mysql database. i have another php
script that will act as the front end which will allow the user to enter the url which will then be picked up by the grabber or spider script and dump the contents to a mysql database. i also want to be able to reconstruct the contents as well. thank you.
 
The trick for retrieving associated content (images, stylesheets, scripting inclusions) is parsing the HTML and fetching separately the data indicated in the various tags.


One place you might look is SourceForge



Want the best answers? Ask the best questions!

TANSTAAFL!!
 
A note to screen scrapers:
Please respect the ownership rights of site owners. Either do it under fair use circumstances or get permission to use the content that's not yours.
 
to split the data based on tags as sleipnir214 suggested, will take some regex to id the tags required in the data



Bastien

Cat, the other other white meat
 
the data i'm 'caching' will actually be purged every few months os in effect i'm archiving, don't know if screen scraping is the proper word
 
basball,

The difficulty with what you're talking about is that when you take a screenshot of a page, then the page first needs to be rendered. Different browsers will do this differently (even the more modern versions of IE and NN), and will include many other files, such as images and Flash movies. PHP on its own can't cope with the rendering of all these files.

In order to do this yourself, you'd probably need to write some proprietary software in C/C#, Java or something else, which would either include many of the features of a web browser, or would plug into the rendering features of another browser.

Alternatively, you could check out which offer what you want, but you may need to pay for it. There are some free options, and some paying.

If you have found a way of achieving what you are trying without doing it this way, please respond, as I too am trying to do this.

Thanks.
 
I have a windows application that can rip entire sites.. or anytihng that is linked on the site.

I dont remember the name on-the-fly, and im @ wrk.
you can in that program specify how deep it should go in links, if it should go offsite, etc.
also which files it should store..
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top