Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to parse web page HTML source code

Status
Not open for further replies.

yeungsprite

Programmer
Nov 28, 2002
25
US
Hello,

I am having trouble creating a javascript function to parse through the HTML source code of a webpage using the Firefox browser. I have the following code, which attempts to make use of the built in function calls 'window.document.body.innerHTML' and 'HTMLsource.indexOf('bug.gif', startIndex)', but there is something wrong with the syntax. Can anyone suggest any tips or other calls to make to achieve the same result?

Thanks!

function CountBugs()
{
var startIndex, bugCount, CurrIndex, HTMLsource;
bugCount=0;
startIndex=1;

// DEBUG - does not store HTML code
HTMLsource=window.document.body.innerHTML;

while (1)
{
//starting at startIndex, look for index of string 'bug.gif'
CurrIndex=HTMLsource.indexOf('bug.gif', startIndex);

if (CurrIndex < 0 || startIndex > HTMLsource.length - 5)
{
alert("There are " + bugCount + " bugs on this page");
return bugCount;
}
//update CurrIndex to index of bug gif , but make sure you go to the next character to avoid an infinite loop
startIndex=CurrIndex + 1;
bugCount++;
}

}
 
Do this and think about why.
[tt]
function CountBugs()
{
var startIndex, bugCount, CurrIndex, HTMLsource;
bugCount=0;
//startIndex=1;
startIndex=0;

// DEBUG - does not store HTML code
HTMLsource=window.document.body.innerHTML;

var ssig="bug.gif";

while (1)
{
CurrIndex=HTMLsource.indexOf(ssig, startIndex);
if (CurrIndex != -1) {
bugCount++;
startIndex=CurrIndex+ssig.length;
} else {
alert("There are " + bugCount + " bugs on this page");
return bugCount;
}
}
}
[/tt]
 
Thanks for the tip tsuji. I am still getting the following error message:

'Error: window.document.body has no properties'

The HTMLsource variable is unable to store the HTML code and I am confused as to why. Any ideas?

Thanks!
 
If you get that an error like that, you have to tell the forum what are you testing against. If you put the CountBugs as the onload handler, you would have a simple config showing how it work at that config.
 
get rid of window when accessing the body of the page
Code:
HTMLsource=[!][s]window.[/s][/!]document.body.innerHTML;
Additionally, a few things to point out. innerHTML will show the code that is within the tags of whatever is calling it, in this case the body of the page. However, if you want to include the body tags in the call you can use outerHTML (I.E. only I believe). Additionally, when you say you're pulling the source code of the page, you're really only pulling what's inside the body of the page. If you wanted to pull the whole source code of the page you would have to go up one step and pull the inner or outerHTML of the <html> tag, not the body tag. Here's the code to do this:
Code:
HTMLsource = document.body.[!]parentNode[/!].outerHTML

-or-

HTMLsource = document.body.[!]parentNode[/!].innerHTML

-kaht

How much you wanna make a bet I can throw a football over them mountains?
sheepico.jpg
 
Thanks kaht and tsuji. I have tried both of your suggestions w/o success. kaht, you mentioned that the outerHTML call only works for IE? I think this is the same issue for innerHTML call, as I am getting the same error message:

'Error: document.body has no properties'

I am using Firefox currently, and I am testing against regular webpages (both including and not including the search term).

 
The problem is somewhere else in your code then, because this works just fine for me in IE and firefox. Copy/paste it into a new .html file and test for yourself:
Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "[URL unfurl="true"]http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">[/URL]
<html xmlns="[URL unfurl="true"]http://www.w3.org/1999/xhtml">[/URL]
<head>
   <script type="text/javascript">
   
   function showCode() {
      [!]alert(document.body.innerHTML);[/!]
   }
   
   </script>
</head>
<body>
   <table>
      <tr>
         <td>this</td>
         <td>is</td>
      </tr>
      <tr>
         <td>my</td>
         <td>table</td>
      </tr>
   </table>
   <div>
      <ul>
         <li>one</li>
         <li>two</li>
         <li>three</li>
      </ul>
   </div>
   <p>
      Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Etiam gravida pellentesque nisi. 
      Nulla interdum magna ac ipsum aliquam fermentum. Donec pede tellus, nonummy non, accumsan eu, 
      malesuada id, sapien. Morbi quis ligula. In ultrices enim nec risus. Mauris nisi risus, ornare 
      non, vehicula at, dictum id, nisl. Mauris condimentum. Suspendisse potenti. Sed interdum ante 
      quis odio. Curabitur lacus lectus, molestie vel, scelerisque id, interdum in, sem. Nunc erat. 
      Donec leo. Praesent eleifend. Nulla porttitor risus eget dolor. Cras vehicula. Mauris pharetra 
      hendrerit ipsum. In sollicitudin lorem ac velit. In eleifend ante quis leo. Nunc placerat 
      euismod nibh. Vestibulum sit amet metus quis nunc tempor ullamcorper.
   </p>
   <input type="button" value="show code" onclick="showCode()" />
</body>
</html>

-kaht

How much you wanna make a bet I can throw a football over them mountains?
sheepico.jpg
 
Thanks for the test code kaht; it works correctly and confirms your reply. The problem for me is that the .js file holding the function is being called from a .xul file (I am making the call using a Firefox extension i am creating), not an html file. For some reason the document.body call is not being recognized as it would be from an html file.

In the xul file the .js import line is:

<script type="application/x-javascript" src="chrome://testtoolbar/content/testtoolbar.js"/>

and the function is being called using:

<toolbarbutton id="BugCount" label="BugCount" class="chromeclass-toolbar" tooltiptext="BugCount" collapsed="false" oncommand="CountBugs()" />

Perhaps there is some additional requirements to be able to do the document.body call...
 
There is nothing wrong with [tt]window.document.body[/tt]. Any variation to it is just manner or mannerism.
 
The problem was that it should be content.document.body, not window.document.body.

Thanks for the suggestions!!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top