RegEx HTML Parsing

RLB2 · Jan 3, 2005

HI, I have spent many hours in Tektips and other websites learnign about Regex. Thanks!

I have basically 2 types of data that I want to parse out of an html (txt) file.

string 1: <title>1965 Corvette for sale</title>
String 2: <meta name="Keywords" content="Corvette, 1965, Cool Car">

Expected Result 1: 1965 Corvette for sale
Expected Result 2: Corvette, 1965, Cool Car

Actual Result 1: <title>1965 Corvette for sale
Actual Result 2: <meta name="Keywords" content="Corvette, 1965, Cool Car

Regexp used on 1: <title>[^<]+(?=</title>)
Regexp used on 2: <meta name="Keywords" content="[^<]+(?=">)

So some strings with Tags and some without. As you can see the beginiing match always "saves".

PROBLEM: The only

problem I am having is that I can't get rid of the beginning of the match (ie <title> above)
I have tried many variations on the regexp itself to no avail. ANy help is much appreciuated:

CODE:
Set fso = New FileSystemObject
Set tsMyFile = fs

penTextFile(PUBTxtInputFile, ForReading)
Do Until tsMyFile.AtEndOfStream
Set re = New RegExp
With New RegExp
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = "<title>[^<]+(?=</title>)"
For Each myMatch In .Execute(tsMyFile.ReadLine)
PUBScrapedText = myMatch.Value
Next
End With
DoEvents
Loop

'PUBScrapedText returns the output (ie <title>1965 Corvette for sale) that I save in a table in the db.

VBslammer · Jan 3, 2005

I would use the tools that are made for Web document parsing, such as the "Microsoft HTML Object Library." If you reference this library in your project, then you can use IE to do the parsing for you. If you're already familiar with DOM, it will be simple:

Code:

Sub ParseHTML(ByVal strFile As String)
On Error GoTo ErrHandler

  Dim ie As New InternetExplorer
  Dim webpage As New HTMLDocument
  Dim item As HTMLHtmlElement
       
  ie.Navigate strFile
   
  Do Until ie.Busy = False
    DoEvents
  Loop
    
  Set webpage = ie.Document
  
  Debug.Print webpage.Title
  
  For Each item In webpage.all
    If item.nodeName = "META" Then
      Debug.Print item.Content
    End If
  Next item
  
ExitHere:
  On Error Resume Next
  ie.Quit
  Set webpage = Nothing
  Set ie = Nothing
  Exit Sub
ErrHandler:
  Debug.Print Err, Err.Description
  Resume ExitHere
End Sub

VBSlammer

Unemployed in Houston, Texas

RoyVidar · Jan 3, 2005

I think, if this is a line by line approach, something like this can be used:

[tt] Set fso = New FileSystemObject
Set tsMyFile = fs

penTextFile(PUBTxtInputFile, ForReading)
Set re = New RegExp
With New RegExp
.Global = True
.MultiLine = True
.IgnoreCase = True
End With
Do Until tsMyFile.AtEndOfStream
.Pattern = "<t.>(.*)</title>)"
strIn=tsMyFile.ReadLine
if .test(strIn) then
PUBScrapedText = .replace(strIn, "$1")
end if
Loop[/tt]

Similarly, the next pattern could be something like this:

[tt]"<meta name=""Keywords"" content=""(.*)>"[/tt]

Roy-Vidar

VBslammer · Jan 4, 2005

I played with the Regex approach a bit and had good results with this:

Code:

Sub RegexHTML(ByVal strFile As String)
On Error GoTo ErrHandler

  Dim fso As FileSystemObject
  Dim tsMyFile As TextStream
  Dim re As RegExp
  Dim strIn As String

  Set fso = New FileSystemObject
  Set tsMyFile = fso.OpenTextFile(strFile, ForReading)
  Set re = New RegExp
  
  With re
    
    [green]'set attributes[/green]
    .Global = True
    .MultiLine = True
    .IgnoreCase = True
  
    Do Until tsMyFile.AtEndOfStream
      
      [green]'read a line[/green]
      strIn = tsMyFile.ReadLine
    
      [green]'get the title[/green]
      .Pattern = "<title>(.*)</title>"
      If .Test(strIn) = True Then
        [green]'replace following line with database storage code[/green]
        Debug.Print "TITLE: " & .Replace(strIn, "$1")
      End If
      
      [green]'get the meta content[/green]
      .Pattern = "<meta name=""Keywords"" content=""(.*)"">"
      If .Test(strIn) = True Then
        [green]'replace following line with database storage code[/green]
        Debug.Print "META Content: " & .Replace(strIn, "$1")
      End If

    Loop
  End With
  
ExitHere:
  On Error Resume Next
  tsMyFile.Close
  Set tsMyFile = Nothing
  Set fso = Nothing
  Set re = Nothing
  Exit Sub
ErrHandler:
  Debug.Print Err, Err.Description
  Resume ExitHere
End Sub

I still prefer manipulating the DOM using HTML objects since it is similar to using XML - which all programmers will have to deal with eventually, it's here to stay.

VBSlammer

Unemployed in Houston, Texas

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

RegEx HTML Parsing

RLB2

Programmer

VBslammer

Programmer

RoyVidar

Instructor

VBslammer

Programmer

Similar threads

Part and Inventory Search

Sponsor