Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How can I track my position in a regex? 1

Status
Not open for further replies.

eggmatters

Programmer
Jan 31, 2008
5
US
I am attempting to 'find and replace' content in a rather large XML file. Basically, I am exporting data from an excel spreadsheet into a Visio. Some of the fields don't match directly due to existence of newlines, spaces, special characters and other subtle differences between tags. Roughly, the schema is:
<Text><meta data and custom tags and attributes>Data to be sought</meta data and custom tags and attributes></Text>.

the the variable containing the pattern being sought is the entire file as one variable: $infile.

One problem I encounter is when I write the expression:

$infile =~ s/(<Text>)(.*)(<\/Text>)/$marker/s;

it replaces everything between the first and the last text tag. I need it to be able to find each text tag.
 
Ok, I got that figured out.
$infile =~ s/(<Text>)(.?)(<\/Text>)/$marker/s; Now I just need to figure out how to call it iteratively for each instance. For some reason, Perl isn't letting me write a function that contains that regex to be called.
 
more correctly .*?
.? means zero or one of any character
.*? means zero or more but match as little as possible (non-greedy matching)

Code:
$infile =~ s/(<Text>)[b](.*?)[/b](<\/Text>)/$marker/s;

or maybe better with no parentheis because you appear to be doing nothing with them of any use, try adding the "g" modifier if you need to treplace more than one instance of the same pattern:

Code:
$infile =~ s/<Text>.*?<\/Text>/$marker/[b]g[/b]s;


------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
To come back to the question in the second post, I assume you don't really need that replacement, and you want to extract the stuff between text tags. In that case do the following (untested):
Code:
while($infile=~/<Text>(.*?)<\/Text>/gs){
  #$1 contains here the text between the tags for each subsequent match
  #pos($infile) returns the position of the match in the string, if that may be useful
}

Franco
: Online tools for structural design
: Magnetic brakes for fun rides
: Air bearing pads
 
YES!!! that is exactly what I needed. I was concerned that within the block, when I invoked another regex, it would lose it's position. The marker replacement was an un-graceful approach to tracking the position of the tag. Thank you for you insight. Expect more Perl questions from me!

Speaking of which, how do I now "go to" that position within the string?
 
If I understand, this should do what you want:

Code:
$infile =~ s{<Text>.*?</Text>}{<Text>$marker</Text>}gs;

But maybe I am not understanding what you are trying to do.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Well, essentially, I can use pos() to find the position returned by (*?), or $1. I want to take $1 and see if it matches some other stuff, and if so, replace the value of $1 <i>at</i> whatever was returned by pos(). Perhaps some variant of:

Code:
while($infile=~/<Text>(.*?)<\/Text>/gs){
$itsPos = pos($infile);
# . . . do some stuf
#I've decided I want to replace $1 at pos()with some variable $var.
#This isn't what I want but close:
substr($infile, $itsPos, $someCharLength);
 
I want to take $1 and see if it matches some other stuff,

At this point I need some example data (before/after) to understand what you are trying to do.


------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
ok, essentially, I'm performing a find and replace on data that exists in a large XML file (derived from a Visio) I am finding data in the XML file from a list I've derived from an excell spreadsheet Also from that spreadsheet I have data I want to replace the old data with.

The $infile variable is actually the entire visio xml document (vsx). The spreadsheet contains data present in the vsx with data I want to replace. I successfully draw data from the spreadsheet, place the data in an associative array with the sought data matched with the replace data. I format it, escape special characters etc and then perform a brute force find and replace on the vsx document.

So this works for most of it:

Code:
while (($soughtData, $replaceData) = each(%assocArrayFromXL))
{
   $infile =~ s/$soughtData/$replaceData/gi
}

Some of the data won't fit though due to formatting weirdness such as leading whitespace, random newlines etc. If I try to strip them in the visio, it will blow up. I really want to avoid writing a regex that verbose although I wish I was that good. So instead, I want to parse the missing data into each word, and see if it exists from within the <Text> blocks of the Visio. For example:

Code:
(data sample from .xls)
This is a data sample imported from an excel spreadsheet!

This is data I want to replace it with!

(and here is what it looks like in the VSX:)
<Text>
<cp IX='0'/><pp IX='0'/>
This is
is a data sample     imported <data import button>
from an excel [1 second pause before continuing]



spreadsheet!


<cp IX='0'/><pp IX='0'/><tp IX='0'/>Page <fld IX='0'>40</fld>
</Text>

So the trick here is matching all of that data where it appears in the visio. Then I would like to place the string: "This is data I want to replace it with!" where the data appears. I don't mind how the new data looks, I just need to dump it, as is, over all of the data between the <pp IX='0'/> and the <cp IX='0'/> tags.

The problem is, we can't predict how the data is formatted in the Visio's xml. In that sample, there are more than one whitespaces between words, newlines within the phrase and multiple newlines. There is also meta data (the occurence of "<data import button>" and "[1 second pause before continuing]" breaking up the data. I have no way of predicting how that data is going to be formated so it is difficult to write a regex to escape or ignore all of those weird word boundary conditions. Also the cp and pp tags are different all throughout the document so I cannot rely on them to give me an accurate boundary between just the text.

My solution is to parse the import data word by word (via split()) find each occurence of a word within that <Text> block and if I get a decent enough match, replace the text I want to replace. This is what I have so far:

Code:
$soughtData = "This is a data sample imported from an excel spreadsheet!";

$replaceData = "This is data I want to replace it with! Ole! It does not map word for word to the soughtData you see?"

#$infile is a huge xml file containing the chunk I showed you above where all I know is that the $soughtData resides somewhere between a <Text> and </Text> tag.

#strip the sample data into an array of words:

      @parse = split(/ /, $soughtData);
      $length = @parse;

#Create counters to track how many words I can match

      $matchcount = 0;
      $nomatch = 0;

#get data between <Text> tags:
      while ($infile =~ /<Text>(.*?)<\/Text>/gs)
      {
         $chunk = $1;
         $here = pos($infile);
         print $chunk."\n";

#see if I can get a match for every word:
         
         foreach my $word (@parse)
         {
	    if ($chunk =~ m/$word/g)
	    {
	       $matchcount++;
	    }
	    else 
	    {
	       $nomatch++;
	    } 
         } #End foreach

         if ($matchcount == $length)

#Ok, I have a match here, This is where I get tripped up. I need to go back to the beginning of the text block from the visio and just plop the replaceData in. in order to do that, I'm trying:

         {
            
             $first = $parse[0];
             $last = $parse[$length];
             $start = "START";
             $stop = "STOP";
             $newChunk = $chunk;
             $newChunk =~ s/$first/$start/g;
             $newChunk =~ s/$last/$stop/g;
             $newChunk =~ s/$start.*$stop/$replaceData/;
             $infile =~ s/$chunk/$newChunk/s;
         }
         $matchcount = 0;
         $nomatch = 0;

so what the last block tries to so unelegantly do is create a new block of data, place markers between the text I want to replace, replace it and then insert the new block of data into the visio where the old block was.

Is there a more elegant way of achieving this?
 
There are many VISIO modules listed on CPAN, have you taken a look at them? I have no clue if any of them do what you want or are any good.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
If you can depend on some of the "tag" formatting, and the text can be inserted anywhere between <pp ...> and <cp ...> tags. I used two arrays (of arrays), one to hold the tags and one to hold the text. I assumed that "Page" and page number (40) are not to be considered text and seperated from the tags. If the stuff like <data import button> and [1 second pause before continuing] have to remain in their original positions that can maybe be worked out.

This is probably slow as hell for a big file.

Code:
[url=http://perldoc.perl.org/functions/use.html][black][b]use[/b][/black][/url] [green]strict[/green][red];[/red]
[black][b]use[/b][/black] [green]warnings[/green][red];[/red]
[black][b]use[/b][/black] [green]Data::Dumper[/green][red];[/red]

[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]@tags[/blue][red];[/red]
[black][b]my[/b][/black] [blue]@text[/blue][red];[/red]
[black][b]my[/b][/black] [blue]$find[/blue] = [red]q{[/red][purple]This is a data sample imported from an excel spreadsheet![/purple][red]}[/red][red];[/red]
[black][b]my[/b][/black] [blue]$replace[/blue] = [red]q{[/red][purple]This is data I want to replace it with![/purple][red]}[/red][red];[/red]
[black][b]my[/b][/black] [blue]$i[/blue] = [fuchsia]0[/fuchsia][red];[/red]
[black][b]my[/b][/black] [blue]$text[/blue][red];[/red]

[olive][b]while[/b][/olive] [red]([/red]<DATA>[red])[/red][red]{[/red]
   [olive][b]if[/b][/olive] [red]([/red][red]m#[/red][purple]<cp IX='0'/><pp IX='0'/><tp IX='0'/>Page <fld IX='0'>[purple][b]\d[/b][/purple]+</fld>[/purple][red]#[/red][red])[/red] [red]{[/red]
      [url=http://perldoc.perl.org/functions/push.html][black][b]push[/b][/black][/url] [blue]@[/blue][red]{[/red][blue]$tags[/blue][red][[/red][blue]$i[/blue][red]][/red][red]}[/red],[blue]$_[/blue][red];[/red]
      [olive][b]next[/b][/olive][red];[/red]
   [red]}[/red]
   [olive][b]if[/b][/olive] [red]([/red][red]m/[/red][purple](<Text>[purple][b]\n[/b][/purple]?)[/purple][red]/[/red][red])[/red] [red]{[/red]
      [black][b]push[/b][/black] [blue]@[/blue][red]{[/red][blue]$tags[/blue][red][[/red][blue]$i[/blue][red]][/red][red]}[/red],[blue]$1[/blue][red];[/red]
      [blue]$text[/blue] = [red]'[/red][purple][/purple][red]'[/red][red];[/red] 
      [olive][b]next[/b][/olive][red];[/red]
   [red]}[/red]
   [olive][b]elsif[/b][/olive] [red]([/red][red]m/[/red][purple](<[purple][b]\/[/b][/purple]Text>[purple][b]\n[/b][/purple]?)[/purple][red]/[/red][red])[/red] [red]{[/red]
      [olive][b]for[/b][/olive] [red]([/red][blue]$text[/blue][red])[/red] [red]{[/red]
         [red]s/[/red][purple][purple][b]\s[/b][/purple]+[/purple][red]/[/red][purple] [/purple][red]/[/red][red]g[/red][red];[/red]
         [red]s/[/red][purple]^[purple][b]\s[/b][/purple][/purple][red]/[/red][purple][/purple][red]/[/red][red];[/red]
         [red]s/[/red][purple][purple][b]\s[/b][/purple]$[/purple][red]/[/red][purple][/purple][red]/[/red][red];[/red]
      [red]}[/red]
      [black][b]push[/b][/black] [blue]@[/blue][red]{[/red][blue]$text[/blue][red][[/red][blue]$i[/blue][red]][/red][red]}[/red],[blue]$text[/blue][red];[/red]
      [black][b]push[/b][/black] [blue]@[/blue][red]{[/red][blue]$tags[/blue][red][[/red][blue]$i[/blue][red]][/red][red]}[/red],[blue]$1[/blue][red];[/red]
      [blue]$i[/blue]++[red];[/red]
      [olive][b]next[/b][/olive][red];[/red]
   [red]}[/red]
   [olive][b]while[/b][/olive] [red]([/red][red]/[/red][purple]( [[purple][b]\Q[/b][/purple]<[[purple][b]\E[/b][/purple]] [^[purple][b]\Q[/b][/purple]>][purple][b]\E[/b][/purple]]+ [[purple][b]\Q[/b][/purple]>][purple][b]\E[/b][/purple]] [purple][b]\n[/b][/purple]? )[/purple][red]/[/red][red]ogx[/red][red])[/red][red]{[/red]
      [black][b]push[/b][/black] [blue]@[/blue][red]{[/red][blue]$tags[/blue][red][[/red][blue]$i[/blue][red]][/red][red]}[/red],[blue]$1[/blue][red];[/red]
   [red]}[/red]
   [url=http://perldoc.perl.org/functions/chomp.html][black][b]chomp[/b][/black][/url][red];[/red]
   [red]s/[/red][purple]( [[purple][b]\Q[/b][/purple]<[[purple][b]\E[/b][/purple]] [^[purple][b]\Q[/b][/purple]>][purple][b]\E[/b][/purple]]+ [[purple][b]\Q[/b][/purple]>][purple][b]\E[/b][/purple]] )[/purple][red]/[/red][purple] [/purple][red]/[/red][red]ogx[/red][red];[/red]
   [blue]$text[/blue] .= [red]"[/red][purple] [blue]$_[/blue][/purple][red]"[/red][red];[/red]
[red]}[/red]

[olive][b]foreach[/b][/olive] [black][b]my[/b][/black] [blue]$i[/blue] [red]([/red][fuchsia]0..[/fuchsia][blue]$#text[/blue][red])[/red] [red]{[/red]
   [blue]$text[/blue][red][[/red][blue]$i[/blue][red]][/red][red][[/red][fuchsia]0[/fuchsia][red]][/red] =~ [red]s/[/red][purple][blue]$find[/blue][/purple][red]/[/red][purple][blue]$replace[/blue][/purple][red]/[/red][red];[/red]
[red]}[/red]


[olive][b]foreach[/b][/olive] [black][b]my[/b][/black] [blue]$i[/blue] [red]([/red][fuchsia]0..[/fuchsia][blue]$#tags[/blue][red])[/red] [red]{[/red]
   [black][b]my[/b][/black] [blue]$flag[/blue] = [fuchsia]1[/fuchsia][red];[/red]
   [olive][b]foreach[/b][/olive] [black][b]my[/b][/black] [blue]$n[/blue] [red]([/red][fuchsia]0..[/fuchsia][blue]$#[/blue][red]{[/red][blue]$tags[/blue][red][[/red][blue]$i[/blue][red]][/red][red]}[/red] [red])[/red] [red]{[/red]
      [url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] [blue]$tags[/blue][red][[/red][blue]$i[/blue][red]][/red][red][[/red][blue]$n[/blue][red]][/red][red];[/red]
      [olive][b]if[/b][/olive] [red]([/red][blue]$tags[/blue][red][[/red][blue]$i[/blue][red]][/red][red][[/red][blue]$n[/blue][red]][/red] =~ [red]m#[/red][purple]<pp IX='0'/>[/purple][red]#[/red] && [blue]$flag[/blue][red])[/red] [red]{[/red]
         [black][b]print[/b][/black] [blue]$text[/blue][red][[/red][blue]$i[/blue][red]][/red][red][[/red][fuchsia]0[/fuchsia][red]][/red],[red]"[/red][purple][purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]
         [blue]$flag[/blue] = [fuchsia]0[/fuchsia][red];[/red]         
      [red]}[/red]
   [red]}[/red]
[red]}[/red]

[gray][i]#print Dumper \@tags,\@text;[/i][/gray]

[teal]__DATA__[/teal]
[teal]<Text>[/teal]
[teal]<cp IX='0'/><pp IX='0'/>[/teal]
[teal]This is[/teal]
[teal]a data sample     imported <data import button>[/teal]
[teal]from an excel [1 second pause before continuing][/teal]



[teal]spreadsheet![/teal]


[teal]<cp IX='0'/><pp IX='0'/><tp IX='0'/>Page <fld IX='0'>40</fld>[/teal]
[teal]</Text>[/teal]
[teal]<Text>[/teal]
[teal]<cp IX='0'/><pp IX='0'/>[/teal]
[teal]This is[/teal]
[teal]a data sample     imported <data import button>[/teal]
[teal]from an excel [1 second pause before continuing][/teal]



[teal]spreadsheet![/teal]


[teal]<cp IX='0'/><pp IX='0'/><tp IX='0'/>Page <fld IX='0'>40</fld>[/teal]
[teal]</Text>[/teal]


output:

Code:
<Text>
<cp IX='0'/><pp IX='0'/>
This is data I want to replace it with!
<data import button>
[1 second pause before continuing]
<cp IX='0'/><pp IX='0'/><tp IX='0'/>Page <fld IX='0'>40</fld>
</Text>
<Text>
<cp IX='0'/><pp IX='0'/>
This is data I want to replace it with!
<data import button>
[1 second pause before continuing]
<cp IX='0'/><pp IX='0'/><tp IX='0'/>Page <fld IX='0'>40</fld>

</Text>

Of course something this crazy is bound to blow up in your face and have numerous problems. And this is all very preliminary and somewhat contrived to the data you posted. If the real data is too much different then.....

Sorry for not having comments in the code but this took long enough as it was. Ask questions if needed.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top