Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Shaun E on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Regular expression fun :)

Status
Not open for further replies.

Zippeh

Programmer
Sep 24, 2002
56
GB
Hi there,

I'm reading from a logfile and trying to pull out some querystring data. The querystring could be as follows:

cat=xxx&doc=yyy&anyotherdata
cat=xxx&anyotherdata
doc=yyy&anyotherdata

I need to pull out the cat and doc numbers.

The regular expression I have so far works with the bottom two situations, but I can't seem to pull out the data when. Here it is:

s/^([^ ]+ ){6}(cat=(\d+)[^ ]*)?(&)?(doc=(\d+)[^ ]*)?( .*)$/$3\|$6\n/

Any ideas?

Thanks in advance.
 
show us a live example and tell us which are numbers which are letters, but better show us a part of the real file
cause this thing you've done here looks like an alien.



``The wise man doesn't give the right answers,
he poses the right questions.''
TIMTOWTDI
 
2005-04-27 00:00:04 64.62.168.55 - GET /GWY_atborth.asp cat=3103&doc=9107&iaith=1&teitl=Agenda+%2D+4+January&rhaglen=/gwy_doc.asp&Language=1 200 0 21129 345 2875 HTTP/1.0 UKGovbot/2.0 - -
 
If the numbers are four (3103 or 9107)
Code:
if ($_ =~ /cat=(\d{4})&doc=(\d{4})/ ){
    print "cat = $1 \n";
    print "doc = $2 \n";
}
if they can be more or less then instead of {4}

for a range of 1 or more use {1,}
for a range of 2 or more use {2,}
...


``The wise man doesn't give the right answers,
he poses the right questions.''
TIMTOWTDI
 
I don't think that will work as it has to cope with the situations:

2005-04-27 00:05:28 192.168.1.12 - GET /GWY_section.asp cat=2537&language=2&rhan=pen&enw_tudalen=%2fDATRhagorol%2fcgi-bin%2frhestrarchif.cymraeg.pl&querystring=archiveid%3d12%26parentid%3d39858 200 0 8675 194 797 HTTP/1.0 - - -

and

2005-04-27 00:10:18 68.142.250.203 - GET /doc.asp doc=4545 302 0 396 251 31 HTTP/1.0 Mozilla/5.0+(compatible;+Yahoo!+Slurp;+ - -
 
You didn't tell me how many numbers you want to get ? Four ?

If yes then try this
Code:
open FH,"your_file.txt";
$cat = 'cat=(\d{4})';
$doc = 'doc=(\d{4})';
foreach (<FH>){
	if ($_ =~ /$cat/ ){
		print "cat = $1\n";
	}
	if ($_ =~ /$doc/){
		print "doc = $1 \n";
	}
}
close FH;


``The wise man doesn't give the right answers,
he poses the right questions.''
TIMTOWTDI
 
Sorry, the number can be anything. From 1 to a gazillion :)
 
from this line what is the number that you want?

Code:
2005-04-27 00:10:18 68.142.250.203 - GET /doc.asp doc=4545 302 0 396 251 31 HTTP/1.0 [URL unfurl="true"]www.gwynedd.gov.uk[/URL]


``The wise man doesn't give the right answers,
he poses the right questions.''
TIMTOWTDI
 
I just want the 4545 number. Into the variable for the doc number.
 
Why try to do too much in one regex when you don't have to?
Here's my simple little solution:

Code:
#!/usr/bin/perl -w

use strict;

while(<DATA>) {
  chomp;
  print "Testing ip of $1\n" if(/\b(\d+\.\d+\.\d+\.\d+)\b/);
  print "Cat number is $1\n" if(/cat=(\d+)/);
  print "Doc number is $1\n" if(/doc=(\d+)/);
}
__DATA__
2005-04-27 00:00:04 64.62.168.55 - GET /GWY_atborth.asp cat=3103&doc=9107&iaith=1&teitl=Agenda+%2D+4+January&rhaglen=/gwy_doc.asp&Language=1 200 0 21129 345 2875 HTTP/1.0 [URL unfurl="true"]www.gwynedd.gov.uk[/URL] UKGovbot/2.0 - -
2005-04-27 00:05:28 192.168.1.12 - GET /GWY_section.asp cat=2537&language=2&rhan=pen&enw_tudalen=%2fDATRhagorol%2fcgi-bin%2frhestrarchif.cymraeg.pl&querystring=archiveid%3d12%26parentid%3d39858 200 0 8675 194 797 HTTP/1.0 [URL unfurl="true"]www.gwynedd.gov.uk[/URL] - - -
2005-04-27 00:10:18 68.142.250.203 - GET /doc.asp doc=4545 302 0 396 251 31 HTTP/1.0 [URL unfurl="true"]www.gwynedd.gov.uk[/URL] Mozilla/5.0+(compatible;+Yahoo!+Slurp;+[URL unfurl="true"]http://help.yahoo.com/help/us/ysearch/slurp)[/URL] - -

And the output is:
Testing ip of 64.62.168.55
Cat number is 3103
Doc number is 9107
Testing ip of 192.168.1.12
Cat number is 2537
Testing ip of 68.142.250.203
Doc number is 4545


BTW: In future, the more sample data you supply, the more accurate the reply is likely to be.
Let me know if this does not do what you want.

Trojan.
 
That seems spot on, but I'm having some difficulty in putting them into the variables:

Code:
  while ($llinell = readline(FFEIL_DARLLEN)) {
	  chomp($llinell);
	  $strRhifau = $llinell;
      
      if ($strRhifau =~ /^([^\s]+ ){6}(cat=(\d+)[^\s]*)?(&?doc=(\d+)[^\s]*)?( .*)$/) {

	$intRhifCategori = $llinell;
	$intRhifCategori =~ s/cat=(\d+)/$1/;

	$intRhifDogfen = $llinell;
	$intRhifDogfen =~ s/doc=(\d+)/$1/;
    }
}

Sorry, I'm very new to perl!! So I know it probably doesn't look very pretty!
 
Then this is it
Code:
open FH,"file.txt";
$cat = 'cat=(\d{1,})(\s|&)';
$doc = 'doc=(\d{1,})(\s|&)';
foreach (<FH>){
	if ($_ =~ /$cat/ ){
		print "cat = $1\n";
	}
	if ($_ =~ /$doc/){
		print "doc = $1 \n";
	}
}
close FH;


``The wise man doesn't give the right answers,
he poses the right questions.''
TIMTOWTDI
 
Using this:

$cat = 'cat=(\d{1,})(\s|&)';

How would i tell it to ignore case? I know with the "/blah/i" its with the i at the end, but where do i put it in the above one?
 
You can use the qr// operator to precocmpile the regexp, which allows you to include the /i flag. Taking the code that pengo posted, it then becomes:
Code:
open FH,"file.txt";
$cat = qr/cat=(\d{1,})(\s|&)/i;
$doc = qr/oc=(\d{1,})(\s|&)/i;
foreach (<FH>){
    if ($_ =~ $cat ){
        print "cat = $1\n";
    }
    if ($_ =~ $doc){
        print "doc = $1 \n";
    }
}
close FH;
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top