Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations wOOdy-Soft on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

newline in pattern

Status
Not open for further replies.

cmeyers

Programmer
Jul 6, 2001
24
US
I'm trying to snag multiple lines of data by including the newline character in my pattern:

Code:
gawk '/<NAME>.*<\/NAME>.*<SEQ>.*<\/SEQ>/' file.xml

NAME and SEQ are each on one line.

There may or may not be another line in between:
[tt]<DESC>Description Here</DESC>[/tt]

My hope was that &quot;.*&quot; would get the description line and the newlines on either side. My understanding was that &quot;.&quot; matched even a newline. Arnold Robbins...where are you?

I spent some time (not a lot) searching the forum and the FAQ but came up empty.

Ultimately I would like to perform a global substitution once I can get the proper pattern to match.

Regards,
CM >:):O>
 

Hello, CraigMan!

A few words about methacharacters in regExp:

. matches any single character except newline.

.* matches any number of any character.

That's a theory.

I try it; this is a file:

awk 1977

c 1971


Command awk '/./' file gives this result:

awk 1977
c 1971

Command awk '/.*/' file gives this result:

awk 1977

c 1971

Conclusion: .* maches any number of any character even newlines.

I hope this helps.

KP.
 
You are right. I misunderstood &quot;.&quot;
But...&quot;.*&quot; does not work as I would expect.

My file will either look like this:
[tt]
<NAME>Name Here</NAME>
<DESC>Description Here</DESC>
<SEQ>Sequence Here</SEQ>
[/tt]
or like this:
[tt]
<NAME>Name Here</NAME>
<SEQ>Sequence Here</SEQ>
[/tt]
This is the code again:
Code:
gawk '/<NAME>.*<\/NAME>.*<SEQ>.*<\/SEQ>/' file.xml

Whether I have a Description line or not, I expect to match either the two or three lines above. I'm not getting it.

However, I do appreciate you clearing up my major metacharacter misconception!

Regards,
CraigMan >:):O>
 

CraigMan,

to match either the two or three lines above, try this example

awk '/<NAME>/, /<\/SEQ>/' file.xml

or this

awk '/NAME/, /SEQ/' file.xml

This is the pattern range. Pattern range /NAME/, /SEQ/ prints all lines between NAME and SEQ.

Bye!

KP.
 

BTW, at the time I try to parse XML code with awk. I do it character by character (not line by line!).

KP.
 
if tags are alone on a line, then this should work:

/<NAME>/ { collect=1 }
collect!=0 { x = x RS $0 }
/<\/SEQ>/ { collect=0 ; print x; x=&quot;&quot;}
END {if (x) print x}


cya

--
pkiller
 
Thanks for the input.

I'm familiar with both methods (pattern ranges //,// and setting/unsetting flags).

I'll probably use a pattern range since it's more concise.

However, I'm still disappointed and mystified.

Why does &quot;.*&quot; fail to match what it ought to match?

It's a matter of principle.

CraigMan >:):O>
 
Hi CraigMan,

To answer your question, awk in all it's forms is a line
oriented interpreter as you may know. Therefore, when
the second line is read, the previous line no longer has
focus and as a result is lost to further processing.

The newline is the mechanism used to terminate the
processing of any given line. Move off the line and
you are done whether you were finished or not!

To string multiple lines together you can concatenate
them into one as my example code will do, but you
cannot retain an embedded newline between the lines
thus joined. Such is the architecture of the awk programming language.

Example:

nawk '{
while ($0!~/^$/) {
line = line$0
getline
}

if ($0~/^$/) print line

line = &quot;&quot;

next

}' inputfile > outputfile

Hope this helps you!


flogrr
flogr@yahoo.com

 
I get it now. Thanks.

I also tried using Krunek's advice and set RS=&quot;&quot; for character-by-character parsing.

What I find interesting is this method seems to choke on large files. The result is a core dump.
 
I had to abandon RS=&quot;&quot;. I should have realized the core dump I was getting was due to the limit awk has for the number of characters in an input line.

gawk just dumped on me. nawk gave me a meaningful error message.

Krunek: What is your method for parsing XML character-by-character?

CraigMan >:):O>

 
Hi!

You are right, CraigMan. I forgot this. Sorry.
awk has some limitations, for example:
100 fields
3000 chars per input record
3000 chars per output record
1024 chars per field

I try to make simple and generic XML parser with
awk for small XML files. Valid XML document can
look like this:

<root><tag>data</tag><tag>data</tag></root>

For this XML document I suggest character by
character parsing.

You can also see this thread:


But it's pretty easy to create an XML document
with awk. My gift for you and other awkers: a
program solution with awk for generating XML file
from text file with space as field separator:


# to_xml.awk - converts text data to xml format
# to_xml.awk - croatian: pohrana tekstovnih podataka u xml-zapis
# Kruno Peter, kruno_peter@yahoo.com
# awk, Public Domain, March 2001, last update: July 2001
# Jesus loves you.


BEGIN { print &quot;<?xml version=\&quot;1.0\&quot;?>&quot; }

NR == 1 { print &quot;<file filename=\&quot;&quot; FILENAME &quot;\&quot;>&quot; }

! /^$/ {

print &quot;<row>&quot;

for (i = 1; i <= NF; i ++)
print &quot; <data&quot; i &quot;>&quot; $i &quot;<\/data&quot; i &quot;>&quot;

print &quot;<\/row>&quot;
}

END { print &quot;<\/file>&quot; }


If input file with name &quot;data.txt&quot; look like this:


Sinisa python 29
Krunek awk 30


Output will be:


<?xml version=&quot;1.0&quot;?>
<file filename=&quot;data.txt&quot;>
<row>
<data1>Sinisa</data1>
<data2>python</data2>
<data3>29</data3>
</row>
<row>
<data1>Krunek</data1>
<data2>awk</data2>
<data3>30</data3>
</row>
</file>


Bye!

KP.
 
Krunek,

Thanks for the gift!

I appreciate the parsing advice as well as the XML generator.

CraigMan
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top