Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations bkrike on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

RegEx HTML Parsing

Status
Not open for further replies.

RLB2

Programmer
Nov 29, 2002
14
US
HI, I have spent many hours in Tektips and other websites learnign about Regex. Thanks!

I have basically 2 types of data that I want to parse out of an html (txt) file.

string 1: <title>1965 Corvette for sale</title>
String 2: <meta name="Keywords" content="Corvette, 1965, Cool Car">

Expected Result 1: 1965 Corvette for sale
Expected Result 2: Corvette, 1965, Cool Car

Actual Result 1: <title>1965 Corvette for sale
Actual Result 2: <meta name="Keywords" content="Corvette, 1965, Cool Car

Regexp used on 1: <title>[^<]+(?=</title>)
Regexp used on 2: <meta name="Keywords" content="[^<]+(?=">)

So some strings with Tags and some without. As you can see the beginiing match always "saves".

PROBLEM: The only :) problem I am having is that I can't get rid of the beginning of the match (ie <title> above)
I have tried many variations on the regexp itself to no avail. ANy help is much appreciuated:

CODE:
Set fso = New FileSystemObject
Set tsMyFile = fso_OpenTextFile(PUBTxtInputFile, ForReading)
Do Until tsMyFile.AtEndOfStream
Set re = New RegExp
With New RegExp
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = "<title>[^<]+(?=</title>)"
For Each myMatch In .Execute(tsMyFile.ReadLine)
PUBScrapedText = myMatch.Value
Next
End With
DoEvents
Loop

'PUBScrapedText returns the output (ie <title>1965 Corvette for sale) that I save in a table in the db.
 
I would use the tools that are made for Web document parsing, such as the "Microsoft HTML Object Library." If you reference this library in your project, then you can use IE to do the parsing for you. If you're already familiar with DOM, it will be simple:
Code:
Sub ParseHTML(ByVal strFile As String)
On Error GoTo ErrHandler

  Dim ie As New InternetExplorer
  Dim webpage As New HTMLDocument
  Dim item As HTMLHtmlElement
       
  ie.Navigate strFile
   
  Do Until ie.Busy = False
    DoEvents
  Loop
    
  Set webpage = ie.Document
  
  Debug.Print webpage.Title
  
  For Each item In webpage.all
    If item.nodeName = "META" Then
      Debug.Print item.Content
    End If
  Next item
  
ExitHere:
  On Error Resume Next
  ie.Quit
  Set webpage = Nothing
  Set ie = Nothing
  Exit Sub
ErrHandler:
  Debug.Print Err, Err.Description
  Resume ExitHere
End Sub

VBSlammer
redinvader3walking.gif

[sleeping]Unemployed in Houston, Texas
 
I think, if this is a line by line approach, something like this can be used:

[tt] Set fso = New FileSystemObject
Set tsMyFile = fso_OpenTextFile(PUBTxtInputFile, ForReading)
Set re = New RegExp
With New RegExp
.Global = True
.MultiLine = True
.IgnoreCase = True
End With
Do Until tsMyFile.AtEndOfStream
.Pattern = "<t.>(.*)</title>)"
strIn=tsMyFile.ReadLine
if .test(strIn) then
PUBScrapedText = .replace(strIn, "$1")
end if
Loop[/tt]

Similarly, the next pattern could be something like this:

[tt]"<meta name=""Keywords"" content=""(.*)>"[/tt]

Roy-Vidar
 
I played with the Regex approach a bit and had good results with this:
Code:
Sub RegexHTML(ByVal strFile As String)
On Error GoTo ErrHandler

  Dim fso As FileSystemObject
  Dim tsMyFile As TextStream
  Dim re As RegExp
  Dim strIn As String

  Set fso = New FileSystemObject
  Set tsMyFile = fso.OpenTextFile(strFile, ForReading)
  Set re = New RegExp
  
  With re
    
    [green]'set attributes[/green]
    .Global = True
    .MultiLine = True
    .IgnoreCase = True
  
    Do Until tsMyFile.AtEndOfStream
      
      [green]'read a line[/green]
      strIn = tsMyFile.ReadLine
    
      [green]'get the title[/green]
      .Pattern = "<title>(.*)</title>"
      If .Test(strIn) = True Then
        [green]'replace following line with database storage code[/green]
        Debug.Print "TITLE: " & .Replace(strIn, "$1")
      End If
      
      [green]'get the meta content[/green]
      .Pattern = "<meta name=""Keywords"" content=""(.*)"">"
      If .Test(strIn) = True Then
        [green]'replace following line with database storage code[/green]
        Debug.Print "META Content: " & .Replace(strIn, "$1")
      End If

    Loop
  End With
  
ExitHere:
  On Error Resume Next
  tsMyFile.Close
  Set tsMyFile = Nothing
  Set fso = Nothing
  Set re = Nothing
  Exit Sub
ErrHandler:
  Debug.Print Err, Err.Description
  Resume ExitHere
End Sub
I still prefer manipulating the DOM using HTML objects since it is similar to using XML - which all programmers will have to deal with eventually, it's here to stay.

VBSlammer
redinvader3walking.gif

[sleeping]Unemployed in Houston, Texas
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top