Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Wanet Telecoms Ltd on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

search text

Status
Not open for further replies.

spiveygb

Programmer
Jun 24, 2003
27
US
I have an html page that I automatically retrieve from a partner site and then save on my machine. The html page displays store names and addresses. What I need to do is to extract certain store names and addresses from this file. Does anyone have suggestions about how to do this? Would some form of regexp be the best approach? Thanks
 
Don't undertand. Are you trying to retriev data from a database to populate your form?

tonycomment.gif


.....................................................................................................................
"The secret to creativity is knowing how to hide your sources."
-Albert Einstein

 
No. I am trying to extract the information directly from a plain html page. My thinking was that I could grab the html page, save it as a text file, and then search the text file for the appropriate information.
 
you'll need an HTTPGet parser as in MS's XMLHTTP or serverobjects.com's ASPHTTP

then you'll have to write a custom parser to handle the HTML return
 
Is the page in alphabetical order, or any other type of order?

Along the lines of your reegular expression idea, you could create the basic expression baed on the format of the page and then simply add in the users name your searching for at runtime:
Code:
dim str
str = &quot;(<tr><td>)([a-z]*&quot; & varUserName & &quot;[a-z]*)(</td><td>)(\d{3}-\d{3}-\d{3})&quot;

That pattern could be applied to a document that was structured as a table that looked like this:
Code:
<tr><td>bob</td><td>123-456-7890</td></tr>
<tr><td>sue</td><td>123-456-7890</td></tr>
<tr><td>joe smith</td><td>987-654-3211</td></tr>

You could then use the execute and replace methods of the regular expression to get all the matches with delimiter, like so:
Code:
Dim aRegExp
Set aRegExp = new RegExp
aRegExp.pattern = str
aRegExp.IgnoreCase = true
aRegExp.global=True

Dim matches, match, final
set matches = aRegExp.Execute(PageContent)
For Each match in matches
   final = final & aRegExp.Replace(match,&quot;##END##$2###$4&quot;)
Next

'Now we can split on ##END## to separate all the matching entries - if the UBound is zero than we don't have any entries

Dim arrFinal
arrFinal = Split(final,&quot;##END##&quot;)
If UBound(arrFinal) > 0 Then
   'and for each of those entries in the array we can split on ### to seperate the name and address
   Dim i, arrEntry
   For i = 0 to UBound(arrFinal)-1
      arrEntry = Split(arrFinal(i))
      Response.Write arrEntry(0) & &quot;'s phone number is &quot; & arrEntry(1) & &quot;<br>&quot;
   Next
Else
   Response.Write &quot;Nothing Matches the searc criteria&quot;
End If

This means the entire document must be searched. This could be slower, it would probably be much quicker if theer were some order to the document because then you could modify a search routine like quick search after loading the entire file into an array.

Either way will undoubtedly be slow if your loading the file from the remote server every time someone accesses the page. You may want to change that so that it only loads the file fresh once a day. Something like this:
Code:
1) Create an FSO object
2) Check if the listing.txt file exists
3) If it does, check if the last modified date is more than todays
4) If it doesn't exist or the date is less than todays, get a new copy and use the fso object to save it locally to listing.txt
5) Continue on to the parsing portion, but use the listing.txt file instead of loading it remotely

What this does is cause the first person that access it for the day the same slow down of remotely requesting it, but anyone else that requests it later in the day doesn't have to wait for a remote copy to be downloaded, they use the cached version you created earlier in the day called listing.txt, thus speeding up execution quite a bit.

-Tarwn

01010100 01101001 01100101 01110010 01101110 01101111 01101011 00101110 01100011 01101111 01101101
29 3K 10 3D 3L 3J 3K 10 32 35 10 3E 39 33 35 10 3K 3F 10 38 31 3M 35 10 36 3I 35 35 10 3K 39 3D 35 10 1Q 19
Do you know how hot your computer is running at home? I do
 
Tarwn-you have described what I intend to do (the data is updated every 4-6 hrs.) so I won't have to pull the html into a txt file and search it every time. I have been able to pull the page into a txt file. I needed help with the text searching which you provided. THANKS!
 
I can't quite figure out this part of your code example : &quot;##END##$2###$4.&quot; Could you elaborate a bit?
 
The use of the $1 within the replace method refers to the first saved submatch. If you had more than one submatch, you would refer to them consecutively by using $2, $3, and so on.

src


____________________________________________________
The most important part of your thread is the subject line.
Make it clear and about the topic so we can find it later for reference. Please!! faq333-2924

onpnt2.gif
 
onpnt brought to my attention an error that I made. I switched the way the array was being put together mid coding and forgot to change the loop, so I will update that example and provide a little better explanation:
Code:
Dim aRegExp
'instantiate the regular expression object
Set aRegExp = new RegExp
'feed it our pattern that we created
aRegExp.pattern = str
'make it ignore case, ie case insensitive now
aRegExp.IgnoreCase = true
'global means to find all matches, if this was false it would only find the first match
aRegExp.global=True

Dim matches, match, final
'get all the matches and assign the collection to the matches variable
set matches = aRegExp.Execute(PageContent)
For Each match in matches
   'ok, look after the code for the explanation on this, but basically it's going to be one long string of &quot;##END##name###phone number ##END##name###phone number &quot;
   final = final & aRegExp.Replace(match,&quot;##END##$2###$4&quot;)
Next

'Now we can split on ##END## to separate all the matching entries - if the UBound is zero than we don't have any entries. The reason I made it put a blank entry in the beginning is because UBound always returns at least 0, even if there is no array(0) element. This way if the UBound is 0 it will always mean there were no matches

Dim arrFinal
'split each entry, this will give us an array of string that look like: &quot;name###phone number &quot;
arrFinal = Split(final,&quot;##END##&quot;)
If UBound(arrFinal) > 0 Then
   'and for each of those entries in the array we can split on ### to seperate the name and phone number 
   Dim i, arrEntry

   '------- Note Correction Here
   'we skip the 0th element because it is blank, now loop through the rest
   For i = 1 to UBound(arrFinal)
      'split each entry into an array that should contain name and phone number as index 0 and 1 respectively
      arrEntry = Split(arrFinal(i))
      Response.Write arrEntry(0) & &quot;'s phone number is &quot; & arrEntry(1) & &quot;<br>&quot;
   Next
Else
   Response.Write &quot;Nothing Matches the search criteria&quot;
End If

Sorry for any confusion that may have caused.

The replace string works like this.Anything in the pattern for the regular expression that is surrounded by parantheses is considered a group by the regular exprssion. When we do a replace we can actually tell it to substityue values from the match back into the text we are replacing it with.
Thus the $2 and $4 mean to enter the value of the 2nd and 4th group, respectively. In out case above the second set of parans is the persons name and the 4th set of parans is the phone number section.
The two delimiters we are adding are importnt because they will keep the records and fields apart. If we were tring to seperate out name, phone number, favorite color (pretend this is $6), and favorite ice cream flavor (pretend this is $8) we would want something like:
##END##Bob###111-111-1111###Red###pecan##END##Joe###123-456-7890###Blue###Mint Chocolate Chip

Those delimiters could be anything you want, I only chose &quot;##END##&quot; and &quot;###&quot; because I don't expect them to ever show up in a name or phone number or other similar information. Even an address should never have more than one # in it, so it should be safe to use it as a delimiter.

When we split our example string from above on &quot;##END##&quot;, we would get an array that looks like this:
Code:
array(0) => &quot;&quot;
array(1) => &quot;Bob###111-111-1111###Red###pecan&quot;
array(2) => &quot;Joe###123-456-7890###Blue###Mint Chocolate Chip&quot;

We ignore the first entry. The reasoning here is that if there were no matches and we split an empty string we would still get an array with UBound = 0. If we didn't leave the blank there and we had one match with no extra &quot;##END##&quot; we would have a UBound of 0. The we would have to test to see if the first value was null or actually had a value. This cuts the test down by one step.

So now we loop through the rest of the &quot;records&quot; 1 at a time. Each one will need to be split into it's own array to retrieve the values. This is why we have the second delimiter of &quot;###&quot;.

Another scheme for delimiters that might make sense would be to use vbTab between fields and vbCRLF between records, like so:
vbCrLf & &quot;$2&quot; & vbTab & &quot;$4&quot;

That may be easier to think about because it basically just makes a tab delimited format, ie:
Code:
Bob	111-111-1111	Red	pecan
Joe	123-456-7890	Blue	Mint Chocolate Chip

Sorry for an earlier confusion,

-Tarwn

01010100 01101001 01100101 01110010 01101110 01101111 01101011 00101110 01100011 01101111 01101101
29 3K 10 3D 3L 3J 3K 10 32 35 10 3E 39 33 35 10 3K 3F 10 38 31 3M 35 10 36 3I 35 35 10 3K 39 3D 35 10 1Q 19
Do you know how hot your computer is running at home? I do
 
BTW Tarwn that was a perfect description of the process. I like! In fact I think a &quot;How to file contents&quot; FAQ could be made with a copy/paste

star!

____________________________________________________
The most important part of your thread is the subject line.
Make it clear and about the topic so we can find it later for reference. Please!! faq333-2924

onpnt2.gif
 
It would ned more explanation and generalities, which means more time. Considering I have to create a file upload/download/listing tool for my mother tonight not sure I hav the time to do it today :p

01010100 01101001 01100101 01110010 01101110 01101111 01101011 00101110 01100011 01101111 01101101
29 3K 10 3D 3L 3J 3K 10 32 35 10 3E 39 33 35 10 3K 3F 10 38 31 3M 35 10 36 3I 35 35 10 3K 39 3D 35 10 1Q 19
Do you know how hot your computer is running at home? I do
 
I wanted to post one more note to thank both of you for your help with this. Know that if I have the opportunity I will gladly return the favor.
 
[smile]

____________________________________________________
The most important part of your thread is the subject line.
Make it clear and about the topic so we can find it later for reference. Please!! faq333-2924

onpnt2.gif
 
Glad to be helpful, thanks for nudging me along onpnt :)

01010100 01101001 01100101 01110010 01101110 01101111 01101011 00101110 01100011 01101111 01101101
29 3K 10 3D 3L 3J 3K 10 32 35 10 3E 39 33 35 10 3K 3F 10 38 31 3M 35 10 36 3I 35 35 10 3K 39 3D 35 10 1Q 19
Do you know how hot your computer is running at home? I do
 
How would this code be updated to extract physical address information ? example :

in page there's misc text everywhere and formatted into the text is

jo bob hilton
1234 Here Ln. #5
somecity, ST 12345-4568

Blah Blah Blah blah

Frankie Miles
1234 there St
somecity, ST 14523

more junk

example output would be whitepages.com
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top