Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

XML - DOM - Data retrieving from HTML page

Status
Not open for further replies.

KernelObject

Programmer
Jul 25, 2001
65
IN
Hi all,

I am a VC++ programmer.

Currently I am working on a project which needs to extract data from a HTML page.

My initial plan was to write a configurable scripting language in C++. But I foundout that XML/DOM can be used in VB to extract data from a table in HTML in a jiffy.

My problem being I have very primary knowledge of both XML and VB, the learning curve is daunting and after spending a couple of weeks on Google and MSDN, I dont seem to be getting anywhere productive.

Has anybody worked on a similar project? Any source available anywhere?

One example of the kind of HTML pages I am working on would be: -> dynamic page recd after filling in data for FLIGHT details.

Any advice welcome, thanks in advance.

In the sweat of thy brow shall you eat your bread.
-Bible
 

I believe that what you want to do can be done with the WebBrowser control. If you advanced search this site in your forums for WebBrowser you will be able to find what you are looking for. It will take some digging through those threads but it is in there. Also I know that there is some code using the XML object floating around somewhere so you could search on that also.

Good Luck

 
There are a number of threads on this subject in this forum. As vb5prgrmr says it may take a while to sort through them...so try the following thread, which contains an example of pulling out the URL of the source file in a webpage's IMG tag. This should point you in the right direction (since it looks like you want to pull info from TR and/or TD tags): thread222-644041
 
Thanks strongm and vb5prgrmr.

I was able to load MSHTML and access the text using innerHTML etc.

I want to access a particular table in the HTML with some name value....can anyone plz tell me how this is done?

CODE IS AS FOLLOWS:

Code:
Private Sub Command1_Click()

Dim doc1 As New MSHTML.HTMLDocument, doc2 As New MSHTML.HTMLDocument
Dim strURL As String
    
strURL = "file:///C:/testHtml/trvlcty001.htm"
Set doc1 = doc2.createDocumentFromUrl(strURL, "null")
    
Do Until doc1.readyState = "complete"
   DoEvents
Loop
       
'modsearch is name of the table in HTML page
'NEXT STATEMENT NOT WORKING
Debug.Print doc1.All.Item(modsearch)   

'working fine
Debug.Print doc1.body.outerText 
Debug.Print doc1.body.outerHTML 

Set doc1 = Nothing
Set doc2 = Nothing

End Sub


In the sweat of thy brow shall you eat your bread.
-Bible

 
Ok here is an example on how to drill down to from a table tag to a tr to td and its contents if they have any html tags...
[tt]
Private Sub Command1_Click()

Dim doc1 As New MSHTML.HTMLDocument, doc2 As New MSHTML.HTMLDocument
Dim strURL As String

Dim TbObjs As Object, TbObj As Object
Dim TRObjs As Object, TRObj As Object
Dim TDObjs As Object, TDObj As Object
Dim TCObjs As Object, TCObj As Object

strURL = "Set doc1 = doc2.createDocumentFromUrl(strURL, "null")

Do Until doc1.readyState = "complete"
DoEvents
Sleep 1
Loop

Set TbObjs = doc1.All.tags("table")
For Each TbObj In TbObjs

Debug.Print "href=" & " innertext=" & TbObj.innerText & " outertext=" & TbObj.outerText
Debug.Print "innerhtml=" & TbObj.innerHTML & " outerhtml=" & TbObj.outerHTML

Set TRObjs = TbObj.All.tags("tr")
For Each TRObj In TRObjs

Debug.Print "href=" & " innertext=" & TRObj.innerText & " outertext=" & TRObj.outerText
Debug.Print "innerhtml=" & TRObj.innerHTML & " outerhtml=" & TRObj.outerHTML

Set TDObjs = TRObj.All.tags("td")
For Each TDObj In TDObjs

Debug.Print "href=" & " innertext=" & TDObj.innerText & " outertext=" & TDObj.outerText
Debug.Print "innerhtml=" & TDObj.innerHTML & " outerhtml=" & TDObj.outerHTML

Set TCObjs = TDObj.All.tags("a")
For Each TCObj In TCObjs

Debug.Print "href=" & TCObj.href & " innertext=" & TCObj.innerText & " outertext=" & TCObj.outerText
Debug.Print "innerhtml=" & TCObj.innerHTML & " outerhtml=" & TCObj.outerHTML

Next

Set TCObjs = TDObj.All.tags("img")
For Each TCObj In TCObjs

Debug.Print "href=" & TCObj.href & " innertext=" & TCObj.innerText & " outertext=" & TCObj.outerText
Debug.Print "innerhtml=" & TCObj.innerHTML & " outerhtml=" & TCObj.outerHTML

Next

Next

Next

Next


'modsearch is name of the table in HTML page
'NEXT STATEMENT NOT WORKING
'Debug.Print doc1.All.Item(modsearch)

'working fine
Debug.Print doc1.body.outerText
Debug.Print doc1.body.outerHTML

Set doc1 = Nothing
Set doc2 = Nothing

End Sub
[/tt]

Good Luck
 
fantastic!

I havent chked your code yet, will do ASAP (couple of hours). Logic looks like it shld work straight-off though...

MY DOUBTS:
1. In the HTML page (that i am working with), every table has a name...Is there no way in which I can access a particular table directly through its name?

2. Alternately, I saw some property called uniqueID (for each node I guess)...can I use this instead of the 'name'?

Thanks for using ' ;-P


regards

In the sweat of thy brow shall you eat your bread.
-Bible
 
Yes, you can access them via name, and yes you could use the UniqueID as well. I sadly don't have time right at the moment to knock an example together.
 

I wouldn't see a problem to using the ID's (ID="Table2") or name but right off I couldn't tell you exactly how to do that but it shouln't be to hard, and I believe you are correct in your assumption of the uniqueID. BUT!!! I must warn you that if the pages code changes or the elements are moved around then that will bust your code and if the name of the tables changes that will bust your code also.

Ok I got this to work via the enumeration...
[tt]
Set TbObjs = doc1.All.tags("table")
For Each TbObj In TbObjs
If UCase(TbObj.id) = UCase("table2") Then
DoEvents
End If
'...
[/tt]


Good Luck

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top