Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

RegEx Help 1

Status
Not open for further replies.

DreXor

Programmer
Jun 17, 2003
2,224
US
I have a Billing information extractor froma general text file, problem is occationally there's 2 lines of address, and sometimes there's 2 batches of address information
THE DATA IS FIXED WIDTH, i tried originally to use this to my benefit, but turned into a multi array nightmare.

current issue(s) :
1.some records have 2 address lines
2.billing vs shipping address (shipping always on RIGHT)
3.combination of either or both 1 & 2

** note, tax line (text) may not always be present, but the line holder is always there.

here's the pattern i'm using: ( works wonderfully for regular items, also there's carriage returns in this, for ease of viewing, will need to remove those in use. )
Code:
(\r\n[^\r\n\d]*\d\s+(\b\d+?.?\d{2}\b))
(?:[\r\n])
(?:\s*TAX\s+)?(\b\d*.\d{2}\b)?
(?:[\r\n]\s*)
(?:[\r\n]\s*)
(?:[\r\n]\s*)
(?:[\r\n]\s+(\d+?.?\d{2}))
([\s|\S]*?)
(\d{6})
(?:[\r\n]\s*)
([^\s\r\n]+( [^\s\r\n]+)*)(?:[^\r\n]*)
(?:[\r\n]\s*)
([^\r\n]*)
(?:[\r\n]\s*)
([^\r\n]*)\s([a-z]{2})\s(\d{5}(-\d{4})?)
currently this works fine for the middle sample, with and without tax value. yet fails on the other 2 samples.


items needed are in bold in the sample text: ( formatting gets chewed up unfortunately in code brackets, might need to view source )
Code:
MOUNT FIRST NOTICES
                                                                         1ST NOTICE
     123456        06 24 04                          07 15 04
                                         52
                                         WEEKS
    LOCAL NEWS          LOCAL NEWS    1   [HIGHLIGHT]45.00[/HIGHLIGHT]





                                           [HIGHLIGHT]45.00[/HIGHLIGHT]
         1ST NOTICE


                                                        [HIGHLIGHT]123456[/HIGHLIGHT]
      JIM JONES                                         [HIGHLIGHT]JAMES JONES[/HIGHLIGHT]             EMIXED
      30 MAPLE LAWN DR                                  [HIGHLIGHT]30 MAPLE LAWN DRIVE[/HIGHLIGHT]
      SANTE FE VT 12345-1234                            [HIGHLIGHT]SANTE  FE VT 12345[/HIGHLIGHT]


                                                                         2ND NOTICE
     654321        06 24 04                          05 27 04
                                         52
                                         WEEKS
    LOCAL EXPLORER      LOCAL EXP.    1   [HIGHLIGHT]30.00[/HIGHLIGHT]

                           TAX              [HIGHLIGHT]2.19[/HIGHLIGHT]



                                           [HIGHLIGHT]32.19[/HIGHLIGHT]
         PLEASE DISREGARD THIS REMINDER NOTICE IF YOU HAVE
         ALREADY MAILED PAYMENT.   THANK YOU.

                                                        [HIGHLIGHT]654321[/HIGHLIGHT]
                                                        [HIGHLIGHT]STEVE SMITH[/HIGHLIGHT]             DLOCAL
                                                        [HIGHLIGHT]17215 TOWER RD[/HIGHLIGHT]
                                                        [HIGHLIGHT]PARIS DE 12345[/HIGHLIGHT]


                                                                         1ST NOTICE
     645123        06 24 04                          07 15 04
                                         52
                                         WEEKS
    LOCAL EXPLORER      LOCAL EXP.    1   [HIGHLIGHT]30.00[/HIGHLIGHT]

                           TAX              [HIGHLIGHT]2.26[/HIGHLIGHT]



                                           [HIGHLIGHT]32.26[/HIGHLIGHT]
         1ST NOTICE


                                                        [HIGHLIGHT]645123[/HIGHLIGHT]
                                                        [HIGHLIGHT]SUSAN DOE[/HIGHLIGHT]               DLOCAL
                                                        [HIGHLIGHT]NWCC LIBRARY BUILDING[/HIGHLIGHT]
                                                        [HIGHLIGHT]1234 MAIN ST[/HIGHLIGHT]   
                                                        [HIGHLIGHT]KANSAS CITY MO 1234[/HIGHLIGHT]

[thumbsup2]DreX
aKa - Robert
 
is there any way to give us a formatted text file (along with enters), i am unable to make out the file.

just one record would do...


Known is handfull, Unknown is worldfull
 
actually i just tested it, copy inside the code box, paste to notepad (font terminal or other fixed width font) and the formatting lays out properly

[thumbsup2]DreX
aKa - Robert
 
i tried it, the only record i could understand was

JIM JONES JAMES JONES EMIXED

am i correct in assuming that they are one record with two different addresess under 2 person's name???

Known is handfull, Unknown is worldfull
 
as i tried to note, a fixed width font will make a world of difference when trying to look at this data. i appreciate the effort vbkris, perhaps i could make up some type of layout map

[thumbsup2]DreX
aKa - Robert
 
yes the addresses can vary as well as the names, this is due to say son is at college, and dad is paying for the subscription, son is the active / delivery address, dad is the billing address.


[thumbsup2]DreX
aKa - Robert
 
also for those of whom are like me ... regex deficient [lol] here's some code to help in testing ...

page is broken into segments...

first is source file, non-modifiable, not designed for paste in, this is to ensure your origin data is unchanging. filled in by typing in the filename in the Using: box and clicking select file. nice part is that you can place multiple sample files in the folder,

second level is your pattern expression i typically will break my segments of pattern into seperate lines so that it's easier to view, on test the script will remove the line breaks to make regex happy.

third level is your pattern return ( $1, $2 etc )

fouth is additional header information to add to the beginning of the output

lastly is the output area, fileout default is csv, and filein default is set work with .txt

all sections are stored in external text files for storage, and ease of copy/paste later into your actual application and are linked directly from the page for download/view.

on page load or fileload the page will display the contents of the last successful run, pattern/return/header/output (matched to the selected input file) on first load it will only prefil the mid sections.


the outputfile is the same name as the input but with .csv extention, beware if .csv is the input, it will overwrite the input file.

hopefully this might be of benefit to someone, and i will probably repost this as a tip thread.

Code:
<%
Response.Expires = 0
Server.ScriptTimeout=60 ' to avoid bad patterns from locking down web services too long
%>
<html>
<head>
<STYLE TYPE="text/css">
<!--
BODY {font :14px arial,verdana,sans-serif; }
FORM {margin-top: 1px; 
margin-bottom:1px}
// -->
</style>
</head>
<body>
<table><tr><td>
<%
Set FS = CreateObject("Scripting.FileSystemObject")
If Request("FileLoc") <> "" Then 
  Session("FileLoc") = Request("FileLoc")
  FileLoc = Request("FileLoc")
ElseIf Session("FileLoc") <> "" Then
  FileLoc = Session("FileLoc")
End If

'this is to strip input outside current folder, to ensure the fileloaded/displayed is only in current path
If Instr(FileLoc,"/") > 0 Then FileLoc = Right(FileLoc,Len(FileLoc)-Instr(FileLoc,"/"))
If Instr(FileLoc,"\") > 0 Then FileLoc = Right(FileLoc,Len(FileLoc)-Instr(FileLoc,"\"))
If FileLoc <> "" Then
 SFileLoc = Server.MapPath(FileLoc)
End If

PatternFile = Server.MapPath("Patterns.txt")
ReturnFile = Server.MapPath("Returns.txt")
HeaderFile = Server.MapPath("Headers.txt")
DLFileLoc = Left(FileLoc,InstrRev(FileLoc,".")) & "csv"

Path = Request.ServerVariables("URL")
If instr(Path,"/") > 0 Then Char = "/"
Path = Server.Mappath(Left(Path,InstrRev(Path,Char)))

' for writing out the incoming data before anything is run to store current settings
If Request.Form <> "" AND request("action") = "TEST" then
  If Request("Pattern") <> "" Then
    Set F = FS.CreateTextFile(PatternFile,true)
    F.Write Request("Pattern")
    F.Close
  End If
  If Request("Returns") <> "" Then
    Set F = FS.CreateTextFile(ReturnFile,true)
    F.Write Request("Returns")
    F.Close
  End If
  Set F = FS.CreateTextFile(HeaderFile,true)
  F.Write Request("Headers")
  F.Close
End If
%>
Using : <%=Server.HTMLEncode(FileLoc)%><br>
<form method="post">
<table width="90%" border="0">
     <tr>
          <td><input type="Text" name="fileLoc" width="45" value="<%=Server.HTMLEncode(FileLoc)%>"><input type="submit" value="Select File"></td>
     </tr>
</table>
</form>
</td></tr>
<tr><td>
<input type="hidden" name="fileLoc" value="<%=Server.HTMLEncode(FileLoc)%>">
<b>SourceView:</b><br>
<textarea name="fileContent" Cols="100" Rows="20" wrap="OFF"><%
  If SFileLoc <> "" then
    If FS.FileExists(SFileLoc) then
	Set F = FS.OpenTextFile(SFileLoc)
	Response.Write Server.HTMLEncode(F.ReadAll)
        F.Close
    End If
  End If
%></textarea><br>
<form method="post" name="theform">
<b>Pattern View:</b><a href="patterns.txt">Fetch Pattern File</a><br>
<textarea name="Pattern" cols="100" rows="15" wrap="OFF"><%
  If PatternFile <> "" then
    If FS.FileExists(PatternFile) then
	Set F = FS.OpenTextFile(PatternFile)
	Response.Write Server.HTMLEncode(F.ReadAll)
        F.Close
    End If
  End If
%></textarea><br>
<b>Return Pattern Matches:</b><a href="returns.txt">Fetch Returns File</a><br>
<input type="text" name="returns" wrap="OFF" value="<%
  If ReturnFile <> "" then
    If FS.FileExists(ReturnFile) then
	Set F = FS.OpenTextFile(ReturnFile)
	Response.Write Server.HTMLEncode(F.Readall)
        F.Close
    End If
  End If
%>" Size="125"><br>
<b>Headers?:</b><a href="Headers.txt">Fetch Headers File</a><br>
<input type="text" name="Headers" wrap="OFF" value="<%
  If HeaderFile <> "" then
    If FS.FileExists(HeaderFile) then
      Set F = FS.OpenTextFile(HeaderFile)
      If Not f.atendofstream then
        Response.Write Server.HTMLEncode(F.Readall)
      End If
      F.Close
    End If
  End If
%>" Size="125"><br>
<input type="submit" name="action" value="TEST"><br>
<a name="A"><b>OutPut:</b><a href="<%=Server.HTMLEncode(DLFileLoc)%>">Fetch OutPut File</a><br>
<%
If SFileLoc <> "" Then
  OSFileLoc = Left(SFileLoc,InstrRev(SFileLoc,".")) & "csv"
  If FS.FileExists(SFileLoc) AND request("Action") = "TEST" Then
    Set inFile = FS.OpenTextFile(SFileLoc,1)
    Set outFile = FS.CreateTextFile(OSFileLoc,1)
    Set hdrFile = FS.OpenTextFile(Server.Mappath("headers.txt"),1)
    Set ptnFile = FS.OpenTextFile(Server.Mappath("patterns.txt"),1)
    If Not hdrFile.AtEndOfStream Then
      outFile.Write hdrFile.ReadAll & vbcrlf
    End If
    hdrFile.Close
  
    Dim inContents(0)
    inContents(0) = inFile.ReadAll
    inFile.Close
  
    Dim objRegExp
    Set objRegExp = New RegExp
    objRegExp.Global = True		'find all matches, not just first
    objRegExp.IgnoreCase = True	'ignore character case in pattern/text
    objRegExp.pattern = Replace(Replace(ptnFile.readall,vbcrlf,""),vblf,"")
  ptnfile.close
  
    Dim matches, match
    Set matches = objRegExp.Execute(inContents(0))
  
    If ReturnsFile <> "" Then 
      Set F = FS.OpenTextFile(ReturnsFile,1)
      MatchStr = Replace(Replace(F.ReadAll,vbcrlf,""),vblf,"")
      F.Close
    Else
      Matchstr = Request("returns")
    End If
  
  Counter = 0
  
    For Each match in matches
      Counter = Counter+1
      outFile.Write Counter & "," & objRegExp.Replace(match.Value,MatchStr) & vbCrLf
    Next
  
    outFile.Close
  
    Set inFile = Nothing
    Set hdrFile = Nothing
    Set matches = Nothing
    Set objRegExp = Nothing
  
  End If
  If FS.FileExists(SFileLoc) Then
    Set outFile = FS.OpenTextFile(OSFileLoc,1)
    If Not outFile.atendofstream then
      response.write "<xmp>" & outFile.readall & "</xmp>"
    End If
    outFile.Close
    Set outFile = Nothing
  End If
End If
Set F = nothing
Set FS = nothing
%>
</td>
</tr>
<tr><td></form></td></tr>
</table>
<script language="javascript">
  document.getElementById('A').scrollIntoView(); 
</script>
</body>
</html>

<%
Function ChkArray(Values,Value,Delim) 'returns true/false on a comparitive set
    If ISArray(Values) Then
        ChkArrayArr = Values
    Else
        ChkArrayArr = Split(Values,Delim)
    End If
    ChkArray = False
    For ChkArrayArrCounter=0 to Ubound(ChkArrayArr)
        If StrComp(ChkArrayArr(ChkArrayArrCounter),Value,vbTextCompare)=0 Then
            ChkArray = True
        End If
    Next
End Function
%>

[thumbsup2]DreX
aKa - Robert
 
sorry,
couldnt download the file. i love reexps, so i want to give a try on this...

Known is handfull, Unknown is worldfull
 
sorry web server is shut down at night :(

[thumbsup2]DreX
aKa - Robert
 
i'll keep the server up all weekend though, sorry again, could really use the help. :(

[thumbsup2]DreX
aKa - Robert
 
Got it, but it ain't pretty:
Regular Expression
Code:
[ignore]
(\d+\.\d{2})
([\n\r\s]+)
(TAX\s+)?(\d+\.\d{2})?
([\n\r\s]+)
(\d+\.\d{2})
((\s{11}(\w+\.? {0,3})+)+\s+)
(\d{6})
(\s+[\n\r]\s{6}([^\s]+\s)*\s+)
([^\s\r\n]+( [^\s\r\n]+)*)
([^\n\r]+[\n\r]\s{6}([^\s]+\s)*\s+)
((\w+(\s\w+)*)
([^\n\r]*[\n\r]\s{6}([^\s]+\s)*\s+))?
(\w+(\s\w+)*)
([^\n\r]*[\n\r]\s{6}([^\s]+\s)*\s+)
(\w+(\s\w+)*)(\s+)([A-Z]{2})(\s+)(\d{5}(-\d{4})?)[/ignore]

Breakdown of Regular Expression
Group 1:[ignore] (\d+\.\d{2})[/ignore] - Initial Price
Group 2:[ignore] ([\n\r\s]+)[/ignore] - whitespace, linebreaks
Group 3/4:[ignore] (TAX\s+)?(\d+\.\d{2})?[/ignore] - optional TAX line with price (group 4)
Group 5:[ignore] ([\n\r\s]+) [/ignore]- whitespace, carriage returns, linefeeds
Group 6:[ignore] (\d+\.\d{2}) [/ignore]- total price
Group 7/8/9:[ignore] ((\s{11}(\w+\.? {0,3})+)+\s+)[/ignore] - covers NOTICE and that 2 line paragraph
Group 10:[ignore] (\d{6}) [/ignore]- 6 number id
Group 11/12:[ignore] (\s+[\n\r]\s{6}([^\s]+\s)*\s+)[/ignore] - eats whitespace and the name if there is a left block (6 space indent)
Group 13/14:[ignore] ([^\s\r\n]+( [^\s\r\n]+)*) [/ignore]- name from right block
Group 15/16:[ignore] ([^\n\r]+[\n\r]\s{6}([^\s]+\s)*\s+) [/ignore]- eats trailing stuff, whitespace, anything in right block if it is there, more whitespace
Group 17/18/19/20/21: [ignore]((\w+(\s\w+)*)([^\n\r]*[\n\r]\s{6}([^\s]+\s)*\s+))? [/ignore]- optionally gets building name (see library entry in sample), eats whitespace, left street address if it is there, whitespace
Group 22/23:[ignore] (\w+(\s\w+)*)[/ignore] - street address from rt block
Group 24/25:[ignore] ([^\n\r]*[\n\r]\s{6}([^\s]+\s)*\s+)[/ignore] - trailing stuff, linebreak, whitespace, optional left stuff, more whitespace
Group 26/27/28/29/30/31/32:[ignore] (\w+(\s\w+)*)(\s+)([A-Z]{2})(\s+)(\d{5}(-\d{4})?)[/ignore] - city, spaces, state, spaces, zipcode with optional extended 4 digits

When I used the replace method on the matches I was outputting:
For Each match in matches
outFile.Write objRegExp.Replace(match.Value,"""$1"",""$4"",""$6"",""$10"",""$13"",""$18"",""$22"",""$26"",""$29"",""$31""") & vbCrLf & vbCrLf
Next

So the output is price, tax|empty, total, id, name, business|empty, street address, city, state, zip|extended zip


Some of that could be trimmed down by using \w's in some places, but it works and i'm not messing withit :p

Note: the TGML ignore tags are handy when display regular expressions.
Second Note: The sample file has an error, the last zip code is missing the 5th number...that was a pain until i noticed it. It should be 12345, not 1234.

01000111 01101111 01110100 00100000 01000011 01101111 01100110 01100110 01100101 01100101 00111111
The never-completed website:
 
Argh, I have an example file that has the optional line filled with:
ATTN; SERIALS

To fix it, replace group 17-21 with:[ignore]((\w+;?(\s{1,2}\w+)*)([^\n\r]*[\n\r]\s{6}([^\s]+\s)*\s+))?[/ignore]

The fix is the addition of the optional semi-colon and allowing 1 or 2 spaces between words. Grr.

-T

01000111 01101111 01110100 00100000 01000011 01101111 01100110 01100110 01100101 01100101 00111111
The never-completed website:
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top