Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations derfloh on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

XML Parser

Status
Not open for further replies.

ricgamch

Programmer
Joined
Jan 25, 2007
Messages
3
Location
CR
Hi People:

Could anybody tell me how can I do this with awk?
I've a file with some XML systaxis:

#cat file
<SENT>word1 <ENT> ent1</ENT> word2 word3 word4<ENT>ent2</ENT> </SENT>
<SENT>word5 word6 word7 <ENT>ent3</ENT> word8 word9 word10<ENT>ent4</ENT></SENT>

I need a script that gets all the entities (entX) from a text file, then the 2 previus words(wordX) and 2 words (wordX) afters the ent.


and i need get this:

# -----WORD WORD ENT WORD WORD------
word1 ent1 word2 word3
word3 word4 ent2
word6 word7 ent3 word8 word9
word9 word10 ent4

Thanks in advance and regards! =)

-ric
 
I am more familiar with Perl than awk, and can see a method for doing this, but I'll let you write the code. Here is what I would probably do:

Open the file and read it line by line in a loop
On each line substitute <.+> with ! (where .+ means one or more characters)
"split" the line into an array using [! ] as the word separator (where [! ] means '!' or 'space' characters)
Go through the array looking for 'entX'
When found, output the contents of the 5 locations in the array (if they exist) around & including 'entX'
Repeat the loop
Close the files


I hope that helps to get you started.

Mike
 

Thanks for write Mike042,

Well,i can get all the ENT from the file now

#awk -f script.awk file
ent1
ent2
ent3
ent4

This is my code:

#cat script.awk
BEGIN {
FS="<ENT>"
}
{
for (i =1; i <= NF; i++)
{
#print NR "->" $i;
FIN=match($i,"</ENT>")

if ( FIN > 0 )
{ printf substr ($i ,0 ,FIN-1)
printf "\n"
}
}



And i need return this:

pal1 ent1 pal2 pal3
pal3 pal4 ent2
pal6 pal7 ent3 pal8 pal9
pal9 pal10 ent4 pal11


Well, thanks!

-ric
 
Hi ricgamch,

This is a problem with awk. From your example you need several input field separators (FS) at the same time for:
<SENT> <ENT> </SENT> and </ENT>

How about replacing these with a single space for ease of processing? Then when you find a match, print the previous 2 words (if they exist), the current word ($i) and the next 2 words (if they exist).

I hope that helps.

Mike
 
Status
Not open for further replies.

Similar threads

Part and Inventory Search

Sponsor

Back
Top