×
INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!
  • Students Click Here

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Second look at my regex please

Second look at my regex please

Second look at my regex please

(OP)
Hi All,

I hope someone might be able to help me by taking a look at my regex and telling me where i am going wrong smile

I have a load of text files in the format

Title:
blah blah blah over one or many lines

Article Body:
blah blah blah over many lines

I am trying to capture the content of the title and the article body without the "Title:" or "Article Body:"
Additionally, there is sometimes other thing in between where the title and body text appear in the file such as  

Wordcount:
this is a number

Keywords:
blah blah blah

I want to ignore these other items in the file.
As a rule the content blocks to capture have their field name followed by a semi colon and then i want to capture everything until the next field name followed by a semi colon, but without any of the field names.


Here is what i am doing so far.
I have read the file into a single var ($result) since the files are not very big but there is lots of them,

CODE

my $pattern = qr/^Title:\n+^(.+\n)/mx;
if($result=~/$pattern/){
    $title = $1;
}
my $pattern2 = qr/^Article Body:\n+^(.+\n)/mx;
if($result=~/$pattern2/){
    $body = $1;
}


If i could, it would be better in one regex but I am not sure how to get that working.


I would really appreciate any suggestions as my regex knowledge is quite rusty (and probably wasn't that great to start with :)

Thanks,

Jez

RE: Second look at my regex please

The 'x' modifier has no use in your regex (afaik it is used to include comments).
Your regex fails because the end of your title or body fields is given either by another field name or the file end.
It can be done, of course, but a regex is for sure the most inefficient way to do this, especially if the file may be rather long.
I would do this by reading the file line by line, more or less like so (untested and using if's for simplicity, could use switch or other constructs):

CODE

my($title,$intitle,$content,$incontent);
while(<FILE>){
  if($incontent){
    $content.=$_;
  }elsif($intitle){
    $title.=$_;
  }elsif(/^Title:$/){
    $intitle=1;
    $incontent=0;
  }elsif(/^Article Body:$/){
    $intitle=0;
    $incontent=1;
  }else{
    $intitle=$incontent=0;
  }
}
A note: from your regex it seems that you expect multiple eol's after the field name: this should be clarified.

Franco
http://www.xcalcs.com : Online engineering calculations
http://www.megamag.it : Magnetic brakes for fun rides
http://www.levitans.com : Air bearing pads

RE: Second look at my regex please

(OP)
Hi Thanks for the suggestion, it is a good approach.
I am having some difficulty getting it to work though.

The content is indeed multiple lines.

I think i can get it working though, so Thank again.

Jez  

RE: Second look at my regex please

Hi

Personally I prefer generic solutions :

CODE --> Perl ( fragment )

while ($line=<FILE>) {
  if ($line=~m/^([\w\s]+):$/) {
    $section=$1;
    next;
  }
  $piece{$section}.=($piece{$section}?"\n":'').$line;
}
Then you will have the desired data in $piece{'Title'} and $piece{'Article Body'}.

Note that the regular expression may need some adjustments if you have some tricky content lines. For the worst scenario you can still enumerate all allowed section names as /^(Title|Article Body|Wordcount|Keywords):$/ .

Feherke.

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close