Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations wOOdy-Soft on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

reciprocal link checking 5

Status
Not open for further replies.

Jeremy21b

Technical User
Jul 7, 2002
127
CA
On a php script I saw a command to check a particluar webpage for a certain link. The problem is that my website is written with ASP. Does ASP have a similar command or a way to do this? Basically we do link exchanges with hundreds of websites. Periodically I need to check for return links. Currently I have to check each one manually which can be quite cumbersome. Essentially it would be parsing the html of a certain webpage to see if href=" exists in the code. I know the inStr command, so I just need to know how to parse an external webpage.
 
use the XMLHTTP object

couple of basic spider functions

Code:
function GetPageContent(strURL)
 Response.Buffer = True
  Dim objXMLHTTP, xml
  Set xml = Server.CreateObject("Microsoft.XMLHTTP")
  
 	xml.Open "GET", strURL, False
	xml.setRequestHeader "User-Agent","Googlebot/2.1+(+[URL unfurl="true"]http://www.googlebot.com/bot.html)"[/URL]
	xml.Send
	GetPageContent = xml.responseText
  Set xml = Nothing
end function

function GetPageResponse(strURL)
 Response.Buffer = True
  Dim objXMLHTTP, xml
  Set xml = Server.CreateObject("Microsoft.XMLHTTP")
  
 	xml.Open "GET", strURL, False
	xml.setRequestHeader "User-Agent","Googlebot/2.1+(+[URL unfurl="true"]http://www.googlebot.com/bot.html)"[/URL]
	xml.Send
	GetPageResponse = xml.getAllResponseHeaders()
  Set xml = Nothing
end function

useage
Code:
dim strPageCode
if GetPageStatus(url) <> 404 then
  strPageCode = GetPageContent(url)
end if
With a bit of coding work and a database of links you can set up a loop and process them automatically.

BTW the user-agent is set to the same as Google because at the moment I use this for UA cloaking detection.


Chris.

Indifference will be the downfall of mankind, but who cares?
A website that proves the cobblers kids adage.
Nightclub counting systems

So long, and thanks for all the fish.
 
Hi Chris.

Thanks for the code. It looks like exactly what I need. I'll try to set it up sometime this week. So for User-Agent, do I use the website that I'm checking for a link on?
 
No. The user-agent should be set so that it will identify where the request is coming from.
As you are using it for link checking the best thing to do would be to give it a name and add a URL of a page on your website that explains it's presence on the site. So you could use
Code:
    xml.setRequestHeader "User-Agent","LinkChecker+(+[URL unfurl="true"]http://www.mydomain.com/linkcheck.asp)"[/URL]
That will give the webmasters who check logs and block rogue crawlers chance to check it out and not block it.
To check the page set the value of URL before calling the functions
Code:
dim url
url = "[URL unfurl="true"]http://www.yahoo.com/links.asp"[/URL] ' for example
GetPageCode(url)
The " is essential and just for completeness you may want to URLEncode the value of url before passing to the routines.


Chris.

Indifference will be the downfall of mankind, but who cares?
A website that proves the cobblers kids adage.
Nightclub counting systems

So long, and thanks for all the fish.
 
Hi Chris.

I finally had some time to try to get that code working. It works exactly as it should. Thank you very much. Now I can catch those bastards who delete my return link.

--Jeremy
 
One thing I would suggest would be to set this up as a .vbs script file to read the entries from a table in the database, then update a field that signifies either te last date checked or a boolean for whether they had your link or not (or both).

Then it would be nothing to build a simple viewing page that let you view all the links or all the links that didn't link back.

The reason I say use a vbs file is because then you could schedule it to run using Windows Scheduler, maybe have it run Monday at 1am or something. That automates the process and gets around any timeout issues you would have by executing the checks from an ASP page.


I think I would suggest the table be set up something like:
LinkTable
site_id - autoincrementing counter
site_address - text/varchar
last_change - Date/Time
site_available - boolean
site_had_link - boolean

Note: I decided that rather then just list the last run (which would be obvious if you have it scheduled) that the date filed will now hold the last time this changed, so perhaps it fails 4 weeks in a row you would see that the link had not been added in 4 weeks or something to that effect.

Obviously if you already have the links in the database you could trim this table down to just link to that existing link table, but anyways, working with this as an example:
Code:
[b]ReciprocalCheck.vbs[/b]

'----- Define Globals
'Create three variable to define state of query:
'three variable to define the states
Const PASS = 0
Const FAIL = 1
Const UNAVAIL = 2

Const YOUR_ADDRESS = "your-address-here.com"

'----- Get Data From DB
Dim sql_links, rs_links, conn
Set conn = CreateObject("ADODB.Connection")
conn.Open "your connection string"

sql_links = "SELECT site_id, site_address FROM YourTable"
Set rs_links = conn.Execute(sql_links"

'dump data into array so we can close connection while evaluating links
Dim arr_links
Set arr_links = rs_links.GetRows()
Set rs_links = Nothing
conn.Close

'----- Run Link Checks
'build three variables to hold ids of those that pass, those that don't pass, and those that aren't available
Dim str_pass, str_unavail, str_fail

'counter used to loop through array and results of call to check function
Dim link_ctr, result

'loop through array running checks on each link
For link_ctr = 0 to UBound(arr_links,2)
   result = CheckLink(arr_links(1,link_ctr))
   Select Case result
      Case PASS
         str_pass = str_pass & arr_links(0,link_ctr) & ","
      Case FAIL
         str_fail = str_fail & arr_links(0,link_ctr) & ","
      Case UNAVAIL
         str_unavail = str_unavail & arr_links(0,link_ctr) & ","
      Case Else
         'this is an error case tat should never occur, maybe have it write to an error log or something
   End Select
Next

'now we have three lists of ID's. Trim off the final comma from each one if it has entries:
If len(str_pass) > 0 Then str_pass = Left(str_pass,len(str_pass) - 1)
If len(str_fail) > 0 Then str_fail = Left(str_fail,len(str_fail) - 1)
If len(str_fail) > 0 Then str_fail = Left(str_fail,len(str_fail) - 1)

'----- Update Database
'Now re-open the connection and execute 3 update statements
conn.Open "your connection string"

'So basically you only have to commit 3 updates now and it only updates 
'   the records in the list if there status (availability + had_link) had 
'   changed from there last entry.
conn.Execute "UPDATE YourTable SET last_change = Now(), site_available = True, site_had_link = True WHERE link_id IN (" & str_pass & ") And (site_had_link = False OR site_available = False)"

conn.Execute "UPDATE YourTable SET last_change = Now(), site_available = False, site_had_link = True WHERE link_id IN (" & str_unavail & ") And (site_had_link = False OR site_available = True)"

conn.Execute "UPDATE YourTable SET last_change = Now(), site_available = True, site_had_link = False WHERE link_id IN (" & str_fail & ") And (site_had_link = True OR site_available = False)"

'Now Clean up
conn.Close()
Set conn = Nothing

'----- Function: CheckLink
'    Returns UNAVAIL, PASS, or FAIL if the linking page is unavailable, 
'    has the link, or does not have the link
Function CheckLink(address)
   Dim objXMLHTTP, content
   Set objXMLHTTP = CreateObject("Microsoft.XMLHTTP")
  
   objXMLHTTP.Open "GET", address, False
   objXMLHTTP.setRequestHeader "User-Agent","Googlebot/2.1+(+[URL unfurl="true"]http://www.googlebot.com/bot.html)"[/URL]
   objXMLHTTP.Send

   If objXMLHTTP.status <> 200 Then
      CheckLink = UNAVAIL
   Else
      content = objXMLHTTP.ResponseText
      If InStr(content,YOUR_ADDRESS) > 0 Then
         CheckLink = PASS
      Else
         CheckLink = FAIL
      End If
   End If

   Set objXMLHTTP = Nothing
End Function

I hadn't planned on writing all of the coee, but oh well :p Not sure it is capable of running the first time, i wrote in on the fly (and it needs connections strings and a dtaabse and table).

Basically by placing this in a .vbs file you get several advantages:
1) No timeout issues like you would with an ASP page trying to run many many checks
2) The capability to schedule it

You could then build a viewing page that simply allowed you to view everything that had failed or everything that had been unavailable.

if you really anted to know without having to remember to check a website you could add some CDO code to this and either send yourself an email with all of the offending site addresses or add some more fields to your table and send an email to the people that took your link down :)

Anyways, hope this proves helpful,

-T

01000111 01101111 01110100 00100000 01000011 01101111 01100110 01100110 01100101 01100101 00111111
Help, the rampaging, spear-waving, rabid network gnomes are after me!
 
Star for the effort Tawrn! ;)

www.sitesd.com
ASP WEB DEVELOPMENT
 
Yes I think a vbs script would be much better. I have nearly 2000 links to check which would be quite time consuming through an ASP file. It would time out if I tried to check more than 20-30 urls at once. I haven't made a vbs file since school a few years ago, but I'm sure I could figure it out. If not, I'll just be back here for help :)

Thanks a lot.
 
Tarwn...I love you. The code you wrote was pretty much perfect. It did need some small changes, but I think I have it all under control.

Ok there is one issue that I've gotta figure out...the site_id varies from 1 to 4 digits. So if id '1234' is in one of the status lists (ie str_pass), it would also update sites with ids like 12, 123, 234, 2, etc. It seems like I might have to do an update for each entry. Anybody see a way around this?

Now I'm all excited about the other possibilities of this type of script. I think the next project is a script to automatically check order delivery dates through UPS's website.
 
nevermind about the ID's question....I'll just create 4 separate scripts...one for all 1 digit ids, one for all 2 digit, etc....It's a lazy way to solve a problem, but it'll work.
 
That doesn't seem right, it should be treating each comm-delimited value as a seperate entity in the IN statement...what db are you using?



barcode_1.gif
 
Ok I never tested it to see if that would happen. I've just never used the IN statement. I assumed it was the same as instr.

I originally thought the code was working fine, but it is being inconsistent now. I run the code and it just doesn't update the database. Then when it would update, it would not properly recognize return links...some said pass when they weren't there and others said fail when they were there.

Is there a simple way to do error checking with a VBS script? If it were asp, I would just do some response.write's to check variables through the code's process. I could output to a text file, but is there an easier way to display something...like an alert or something?
 
Depends on how you execute it. If you execute it by double-clicking on it then you can use MsgBox to popup a message box:
MsgBox "Hi There"

If you are executing from command-line you can print to stdout or stderr:
WScript.stdout.Write "Hi There"

Andof course you can also create a log file using FileSystemObject just as you would from ASP:
Code:
Dim fso, fil
Set fso = CreateObject("Scripting.FileSystemObject")
Set fil = fso.CreateTextFile("MyLog.txt")

fil.WriteLine "Hi There"

fil.Close
Set fil = Nothing


Also, just in case it was overlooked, the db will only be updated when the new state is differant from the old state. So if the last check was a pass and the current check was a pass then that record won't be updated. Basically it operates on the assumption that everything gets checked every time, so knowing how long something has been in a certan state is better then knowing that the program did actually run every single record when it ran (which it should do anyways).

barcode_1.gif
 
well I actually took out the code to only update if that state has changed. Basically I won't keep up any links that are not linked back. So I would check them all and remove the ones that aren't linked back. It just seems weird that it is not properly updating. I even tried changing the code to only check a certain id and it wouldn't update. It's almost as if its running a cached version or else not running at all. My only other guess is that our dumb webhost blocked our ip from doing this kind of database update. Well I'll try doing some error checking to try to figure what the issue is. It's probably something like a minor syntax error.

Thanks a lot for all the help.
 
OK I found the bug. I was checking to see if the reciprocal link was null before checking for a return link. I'm not sure why the code thinks it is null, but it works after taking that clause out.
 
It's all working fine now...almost. It does have a problem if the website it is checking is down or the page has a redirect. The code either hangs or it returns an error on the line where it retrieves the page code. I have no clue how to fix this.
 
Thats odd...it shouldn't even bother with the responsetext unless the status is 200 (OK), anything else (like redirect...err..403 maybe?) should come back as a bad link (at which point you should update your db entry to reflect their new address).

Website down or redirect should both end up in the fail area...argh, bad script go to your room. I'll have to play with it and see if I have any problems with it.

-T

barcode_1.gif
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top