INTELLIGENT WORK FORUMS
FOR COMPUTER PROFESSIONALS

Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Jobs

General programming in Awk

How do I write my first Awk program? by futurelet
Posted: 6 Dec 04 (Edited 22 Jan 05)

An Awk program consists of pairs like this:

test { actions }

Awk automatically reads the files listed on the command line and executes all of the test-actions pairs once for each line read. If the test succeeds (produces a result other than zero), then the actions are performed.  Example:

CODE

/Harold/ { print $0 }
The test portion is  /Harold/.  It means, "Is `Harold' found in the line just read?"  If the name is found, then the action is performed.  The line just read is referred to as $0.  So this program simply prints every line containing "Harold".


This program can be shortened to

CODE

/Harold/ { print }
When we tell Awk to print but don't tell it what to print, it prints $0.  Let's shorten the program even more.

CODE

/Harold/
Here we have omitted the  { actions }  portion entirely.  When we do that, Awk assumes we want { print $0 }.  Knowing this, we can write a very short program that prints every line of the file we are reading:

1

Any number other than 0 will do.  A less cryptic way of printing every line would be

{ print }

Here we have omitted the test  portion, so Awk assumes we want the action to be performed for every line read.

Let's make our previous program skip any line that contains "bogus".

CODE

/bogus/ { print "Skipping invalid line."
          next }
/Harold/
The  next  command tells Awk to skip the rest of the test-actions pairs and to read the next line from the input file immediately.

A shorter way of printing all lines with "Harold" but without "bogus":

CODE

/Harold/ && !/bogus/
The  &&  means "and"; the  !  means "not".  So Awk will print the line if it contains "Harold" and does not contain "bogus".

Now you can see that  /Harold/ && /genuine/  will display lines that have "Harold" and  "genuine".  But what if we want the line only if "genuine" follows "Harold"?  In that case, we can use the power of "regular expressions".


Regular expressions

On the command-line of your computer, you have probably typed something like  data*.txt  to refer to all files whose names start with "data" and that have the extension ".txt".  Regular expressions extend that capability even further.

The most important regular expression "wild card" characters (metacharacters) are:

/  Begins and ends a regular expression.
.  Stands for any single character.
*  Matches any number of the preceding item.
+  Matches one or more of the preceding item.
?  Matches 0 or 1 of the preceding item (makes the item optional).
[  Begins a character set (also called "character class").
]  Ends a character set.
|  Means "or".
^  Matches the beginning of the string.
$  Matches the end of the string.
(  Begins a group.
)  Ends a group.

So the solution to our problem is

CODE

/Harold.*genuine/
The  .  matches any character;  *  matches any number of the preceding item; together they match any sequence of characters.  So all of these lines in the file will be displayed:

Harold genuine
Haroldgenuine
Harold is genuine
Is Harold genuine?
Harold certainly isn't genuine.

What if we actually want to find "Harold" followed by a period and an asterisk?  When we want to search for a literal metacharacter, we have to "escape" it, that is, put a backslash in front of it:

CODE

/Harold\.\*/
To match either "Harold" or "harold", use a character set:

CODE

[Hh]arold
The  [Hh]  will match either "H" or "h".  Ranges can be used in character sets.  To display only lines that contain a numeral:

CODE

/[0-9]/
A  ^  at the start of a character set "negates" it.  To show lines that contain at least one non-numeral:

CODE

/[^0-9]/
To show lines that consist entirely of numerals:

CODE

/^[0-9]+$/

Running your program

There are two ways to run an Awk program.  You can save it in a file and then type something like this on the command line:

awk -f myprog.awk  infile.txt >outfile.txt

The >outfile.txt makes the output go to a file instead of the screen.

If the program is short, it can be typed on the command line. Here's one that prints every line that has "foo" or a numeral.
For Unix:

awk '/foo/ || /[0-9]/ {print "Line " NR}' infile.txt

For DOS:

awk "/foo/ || /[0-9]/ {print \"Line \" NR}" infile.txt

NR  is a built-in variable that keeps track of how many records (lines) have been read.  This tiny program illustrates how strings are concatenated (joined together) in Awk.  By simply putting NR after the string "Line " we make Awk convert the number to a string and to splice the two strings together before executing the print command.


Fields

Earlier it was mentioned that the variable $0 holds the line just read.  The variables $1, $2, $3, etc., are the fields into which Awk automatically splits $0.  Unless you change the variable FS, Awk uses whitespace (spaces and tabs) as the separator between fields.  NF holds the number of fields, so the last field can be gotten by $NF. If the program is

{ print $1 "-" $NF }

and the input file is

Willy isn't nilly
Stop the growing gap
Good bye

the output will be

Willy-nilly
Stop-gap
Good-bye


Here's a longer program:

BEGIN { print "Adding columns 1, 2, and 3." }
{ sum1 += $1; sum2 += $2
  sum3 += $3
}
END { print sum1, sum2, sum3 }

BEGIN is a special test that is used to perform actions before any lines are read.  END is used to designate actions that will be done after all lines have been read.  Statements can be separated by ; or by putting them on separate lines. sum1 += $1 is equivalent to sum1 = sum1 + $1.


Looping with "for"

To print the even integers 2 through 10:

CODE

for (i=2; i<=10; i+=2)
  print i
To print the integers 1 through 5 and flag the odd ones:

CODE

for (i=1; i<6; i++)
{ print i
  if ( i % 2 )
    print "  odd"
}


Arrays

Arrays in Awk are "associative"; an array's indexes are strings.  In the code below, the output is bold.

CODE

info["Tom"] = "A workaholic."
info[1] = "Indexed by '1'."
print info["1"]
Indexed by '1'.
info["year"] = 2005
for (i in info)
  print i "-->" info[i]
Tom-->A workaholic.
year-->2005
1-->Indexed by '1'.
The order in which the entries are produced by  for (i in info)  will not necessarily be the order in which the entries were added.


Additional techniques:

BEGIN {
  s = "abc;def;ghi"
  # Print string starting at 2nd character.
  print substr( s, 2 )
  # Print 3rd character.
  print substr( s, 3, 1 )
  # Make an array from the string, splitting at ";".
  split( s, array, /;/ )
  # Print the members of the array.
  for (i=1; i in array; i++)
    print array[i]
  # Print the location of "gh" in s.
  print index( s, "gh" )
  # Print the uppercased string.
  print toupper( s )
}

# Now we're reading the input file.
# If this is the first line (in Awk-speak,
# lines are called "records"), print it.
1==NR

# If there's more than one field in the
# line, print sum of first 2 fields,
# padding with blanks for a width of 9.
NF > 1 { printf "%9g\n", $1+$2 }


A useful function

The built-in function split() produces an array of the non-matching parts of the string.  Here's a function that makes an array containing both the matching and the non-matching parts, in this order:
<nonmatching><matching><nonmatching>...<matching><nonmatching>

CODE

# Produces array of nonmatching and matching substrings.
# The size of the array will always be an odd number.
# The first and the last item will always be nonmatching.
function shatter( s, array, re )
{ gsub( re, "\1&\1", s  )
  return split( s, array, "\1" )
}


Back to AWK FAQ Index
Back to AWK Forum

My Archive

Resources

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close