In awk, as in C, the logical equality operator is == rather than =. The single = compares memory location, whereas == compares values. When the pattern is a comparison, the pattern matches if the comparison is true (non-null or non-zero). Here's an
example: what if you wanted to only print lines where the first field had a numeric value of less than twenty? No problem in awk:
$1 < 20 {print $0}
If the expression is arithmetic, it is matched when it evaluates to a nonzero number. For example, here's a small program that will print the first ten lines that have exactly seven words:
BEGIN {i=0}
NF==7 { print $0 ; i++ }
/i==10/ {exit}
There's another way that you could use these comparisons too, since awk understands collation orders (that is, whether words are greater or lesser than other words in a standard dictionary ordering). Consider the situation where you have a phone
directorya sorted list of namesin a file and want to print all the names that would appear in the corporate phoneguide before a certain person, say D. Hughes. You could do this quite succinctly:
$1 >= "Hughes,D" { exit }
When the pattern is a string, a match occurs if the expression is non-null. In the earlier example with the pattern /Ann/, it was assumed to be a string since it was enclosed in slashes. In a comparison expression, if both operands have a numeric value,
the comparison is based on the numeric value. Otherwise, the comparison is made using string ordering, which is why this simple example works.
The pattern $2 <= $1 could involve either a numeric comparison or a string comparison. Whichever it is, it will vary from file to file or even from record to record within the same file.
|
Operator |
Meaning |
|
! |
not |
|
|| |
or (you can also use | in regular expressions) |
|
&& |
and |
The pattern may be simple or quite complicated: (NF<3) || (NF >4). This matches all input records not having exactly four fields. As is usual in awk, there are a wide variety of ways to do the same thing (specify a pattern). Regular expressions
are allowed in string matching, but their use is not forced. To form a pattern that matches strings beginning with a or b or c or d, there are several pattern options:
/^[a-d].*/ /^a.*/ !! /^b.*/ || /^c.*/ || /^d.*/
For instance, consider the following simple input file:
$ cat mydata 1 0 3 1 4 1 5 1 7 0 4 2 5 2 1 0 4 3
The first range I try, '$1==3,$1==5, produces:
$ awk '$1==3,$1==5' mydata 3 1 4 1 5 1
Compare this to the following pattern and output.
$ awk '$1>=3 && $1<=5' mydata 3 1 4 1 5 1 4 2 5 2 4 3
Range patterns cannot be parts of a combined pattern.
The remainder of this chapter explores the action part of a pattern action statement. As the name suggests, the action part tells awk what to do when a pattern is found. Patterns are optional. An awk program built solely of actions looks like other
iterative programming languages. But looks are deceptiveeven without a pattern, awk matches every input record to the first pattern action statement before moving to the second.
Actions must be enclosed in curly braces ({}) whether accompanied by a pattern or alone. An action part may consist of multiple statements. When the statements have no pattern and are single statements (no compound loops or conditions), brackets for
each individual action are optional provided the actions begin with a left curly brace and end with a right curly brace. Consider the following two action pieces:
{name = $1
print name}
and
{name = $1}
{print name},
These two produce identical output.
An integral part of any programming language are variables, the virtual boxes within which you can store values, count things, and more. In this section, I talk about variables in awk. Awk has three types of variables: user-defined variables, field
variables, and predefined variables that are provided by the language automatically. The next section is devoted to a discussion of built-in variables. Awk doesn't have variable declarations. A variable comes to life the first time it is mentioned; in a
twist on René Descarte's philosophical conundrum, you use it, therefore it is. The section concludes with an example of turning an awk program into a shell script.
The rule for naming user-defined variables is that they can be any combination of letters, digits, and underscores, as long as the name starts with a letter. It is helpful to give a variable a name indicative of its purpose in the program. Variables
already defined by awk are written in all uppercase. Since awk is case-sensitive, ofs is not the same variable as OFS and capitalization (or lack thereof) is a common error. You have already seen field variablesvariables beginning with $, followed by
a number, and indicating a specific input field.
A variable is a number or a string or both. There is no type declaration, and type conversion is automatic if needed. Recall the car sales file used earlier. For illustration suppose I enter the program awk -F: { print $1 * 10}
emp.data, and awk obligingly provides the rest:
0 0 0 0 0
Of course, this makes no sense! The point is that awk did exactly what it was asked without complaint: it multiplied the name of the employee times ten, and when it tried to translate the name into a number for the mathematical operation it failed,
resulting in a zero. Ten times zero, needless to say, is zero...
Before examining the next example, review what you know about shell programming (Chapters 10-14). Remember, every file containing shell commands needs to be changed to an executable file before you can run it as a shell script. To do this you should
enter chmod +x filename from the command line.
Sometimes awk's automatic type conversion benefits you. Imagine that I'm still trying to build an office system with awk scripts and this time I want to be able to maintain a running monthly sales total based on a data file that contains individual
monthly sales. It looks like this:
cat monthly.sales John Anderson,12,23,7 Joe Turner,10,25,15 Susan Greco,15,13,18 Bob Burmeister,8,21,17
These need to be added together to calculate the running totals for each person's sales. Let a program do it!
$cat total.awk
BEGIN {OFS=,} #change OFS to keep the file format the same.
{print $1, " monthly sales summary: " $2+$3+$4 }
That's the awk script, so let's see how it works:
$ awk -f total.awk monthly.sales cat sales John Anderson, monthly sales summary: 42 Joe Turner, monthly sales summary: 50 Susan Greco, monthly sales summary: 46 Bob Burmeister, monthly sales summary: 46
Your task has been reduced to entering the monthly sales figures in the sales file and editing the program file total to include the correct number of fields (if you put a for loop for(i=2;i<+NF;i++) the number of fields is correctly calculated, but
printing is a hassle and needs an if statement with 12 else if clauses).
In this case, not having to wonder if a digit is part of a string or a number is helpful. Just keep an eye on the input data, since awk performs whatever actions you specify, regardless of the actual data type with which you're working.
This section discusses the built-in variables found in awk. Because there are many versions of awk, I included notes for those variables found in nawk, POSIX awk, and gawk since they all differ. As before, unless otherwise noted, the variables of
earlier releases may be found in the later implementations. Awk was released first and contains the core set of built-in variables used by all updates. Nawk expands the set. The POSIX awk specification encompasses all variables defined in nawk plus one
additional variable. Gawk applies the POSIX awk standards and then adds some built-in variables which are found in gawk alone; the built-in variables noted when discussing gawk are unique to gawk. This list is a guideline not a hard and fast rule. For
instance, the built-in variable ENVIRON is formally introduced in the POSIX awk specifications; it exists in gawk; it is in also in the System V implementation of nawk, but SunOS nawk doesn't have the variable ENVIRON. (See the section "'Oh man! I
need help.'"in Chapter 5 for more information on how to use man pages).
As I stated earlier, awk is case sensitive. In all implementations of awk, built-in variables are written entirely in upper case.
When awk first became a part of UNIX, the built-in variables were the bare essentials. As the name indicates, the variable FILENAME holds the name of the current input file. Recall the function finder code; type the new line below:
/function functionname/,/} \/* end of functionname/' {print $0}
END {print ""; print "Found in the file " FILENAME}
This adds the finishing touch.
The value of the variable FS determines the input field separator. FS has a space as its default value. The built-in variable NF contains the number of fields in the current record (remember, fields are akin to words, and records are input lines). This
value may change for each input record.
What happens if within an awk script I have the following statement?
$3 = "Third field"
It reassigns $3 and all other field variables, also reassigning NF to the new value. The total number of records read may be found in the variable NR. The variable OFS holds the value for the output field separator. The default value of OFS is a space.
The value for the output format for numbers resides in the variable OFMT which has a default value of %.6g. This is the format specifier for the print statement, though its syntax comes from the C printf format string. ORS is the output record separator.
Unless changed, the value of ORS is newline(\n).
The built-in variable ARGC holds the value for the number of command line arguments. The variable ARGV is an array containing the command line arguments. Subscripts for ARGV begin with 0 and continue through ARGC-1. ARGV[0] is always awk. The available
UNIX options do not occupy ARGV. The variable FNR represents the number of the current record within that input file. Like NR, this value changes with each new record. FNR is always <= NR. The built-in variable RLENGTH holds the value of the length of
string matched by the match function. The variable RS holds the value of the input record separator. The default value of RS is a newline. The start of the string matched by the match function resides in RSTART. Between RSTART and RLENGTH, it is possible
to determine what was matched. The variable SUBSEP contains the value of the subscript separator. It has a default value of "\034".
The POSIX awk specification introduces one new built-in variable beyond those in nawk. The built-in variable ENVIRON is an array that holds the values of the current environment variables. (Environment variables are discussed more thoroughly later in
this chapter.) The subscript values for ENVIRON are the names of the environment variables themselves, and each ENVIRON element is the value of that variable. For instance, ENVIRON["HOME"] on my PC under Linux is "/home". Notice that
using ENVIRON can save much system dependence within awk source code in some cases but not others. ENVIRON["HOME"] at work is "/usr/anne" while my SunOS account doesn't have an ENVIRON variable because it's not POSIX compliant.
Here's an example of how you could work with the environment variables:
ENVIRON[EDITOR] == "vi" {print NR,$0}
This program prints my program listings with line numbers if I am using vi as my default editor. More on this example later in the chapter.
The GNU group further enhanced awk by adding four new variables to gawk, its public re-implementation of awk. Gawk does not differ between UNIX versions as much as awk and nawk do, fortunately. These built-in variables are in addition to those mentioned
in the POSIX specification as described above. The variable CONVFMT contains the conversion format for numbers. The default value of CONVFMT is "%.6g" and is for internal use only. The variable FIELDWIDTHS allows a programmer the option of having
fixed field widths rather than a single character field separator. The values of FIELDWIDTHS are numbers separated by a space or Tab (\t), so fields need not all be the same width. When the FIELDWIDTHS variable is set, each field is expected to have a
fixed width. Gawk separates the input record using the FIELDWIDTHS values for field widths. If FIELDWIDTHS is set, the value of FS is disregarded. Assigning a new value to FS overrides the use of FIELDWIDTHS; it restores the default behavior.
To see where this could be useful, let's imagine that you've just received a datafile from accounting that indicates the different employees in your group and their ages. It might look like:
$ cat gawk.datasample 1Swensen, Tim 24 1Trinkle, Dan 22 0Mitchel, Carl 27
The very first character, you find out, indicates if they're hourly or salaried: a value of 1 means that they're salaried, and a value of 0 is hourly. How to split that character out from the rest of the data field? With the FIELDWIDTHS statement.
Here's a simple gawk script that could attractively list the data:
BEGIN {FIELDWIDTHS = 1 8 1 4 1 2}
{ if ($1 == 1) print "Salaried employee "$2,$4" is "$6" years old.";
else print "Hourly employee "$2,$4" is "$6" years old."
}
The output would look like:
Salaried employee Swensen, Tim is 24 years old. Salaried employee Trinkle, Dan is 22 years old. Hourly employee Mitchel, Carl is 27 years old.
The variable IGNORECASE controls the case sensitivity of gawk regular expressions. If IGNORECASE has a nonzero value, pattern matching ignores case for regular expression operations. The default value of IGNORECASE is zero; all regular expression
operations are normally case sensitive.
Awk program statements are, by their very nature, conditional; if a pattern matches, then a specified action or actions occurs. Actions, too, have a conditional form. This section discusses conditional flow. It focuses on the syntax of the if statement,
but, as usual in awk, there are multiple ways to do something.
A conditional statement does a test before it performs the action. One test, the pattern match, has already happened; this test is an action. The last two sections introduced variables; now you can begin putting them to practical uses.
An if statement takes the form of a typical iterative programming language control structure where E1 is an expression, as mentioned in the "Patterns" section earlier in this chapter:
if E1 S2; else S3.
While E1 is always a single expression, S2 and S3 may be either single- or multiple-action statements (that means conditions in conditions are legal syntax, but I am getting ahead of myself). Returns and indention are, as usual in awk, entirely up to
you. However, if S2 and the else statement are on the same line, and S2 is a single statement, a semicolon must separate S2 from the else statement. When awk encounters an if statement, evaluation occurs as follows: first E1 is evaluated, and if E1 is
nonzero or nonnull(true), S2 is executed; if E1 is zero or null(false) and there's an else clause, S3 is executed. For instance, if you want to print a blank line when the third field has the value 25 and the entire line in all other cases, you could use a
program snippet like this:
{ if $3 == 25
print ""
else
print $0 }
The portion of the if statement involving S is completely optional since sometimes your choice is limited to whether or not to have awk execute S2:
{ if $3 == 25
print "" }
Although the if statement is an action, E1 can test for a pattern match using the pattern-match operator ~. As you have already seen, you can use it to look for my name in the password file another way. The first way is shorter, but they do the same
thing.
$awk '/Ann/'/etc/passwd
$awk '{if ($0 ~ /Ann/) print $0}' /etc/passwd
One use of the if statement combined with a pattern match is to further filter the screen input. For example here I'm going to only print the lines in the password file that contain both Ann and a capital m character:
$ awk '/Ann/ { if ($0 ~ /M/) print}' /etc/passwd
amarshal:oPWwC9qVWI/ps:2005:12:Ann Marshall:/usr/grad/amarshal:/bin/csh
cmcintyr:0FciKEDDMkauU:2630:22:Carol Ann McIntyre:/usr/lteach/cmcintyr:/bin/csh
jflanaga:ShrMnyDwLI/mM:2654:22:JoAnn Flanagan:/usr/lteach/jflanaga:/bin/csh
Either S2 or S3 or both may consist of multiple-action statements. If any of them do, the group of statements is enclosed in curly braces. Curly braces may be put wherever you wish as long as they enclose the action. The rule of thumb: if it's one
statement, the braces are optional. More than one and it's required.
You can also use multiple else clauses. The car sales example gets one field longer each month. The first two fields are always the salesperson's name and the last field is the accumulated annual total, so it is possible to calculate the month by the
value of NF:
if(NF=4) month="Jan." else if(NF=5) month="Feb" else if(NF=6) month="March" else if(NF=7) month="April" else if(NF=8) month="May" # and so on
Nawk++ also has a conditional statement, really just shorthand for an if statement. It takes the format shown and uses the same conditional operator found in C:
E1 ? S2 : S3
Here, E1 is an expression, and S2 and S3 are single-action statements. When it encounters a conditional statement, awk evaluates it in the same order as an if statement: first E1 is evaluated; if E1 is nonzero or nonnull (true), S2 is executed; if E1 is
zero or null (false), S3 is executed. Only one statement, S2 or S3, is chosen, never both.
The conditional statement is a good place for the programmer to provide error messages. Return to the monthly sales example. When we wanted to differentiate between hourly and salaried employees, we had a big if-else statement:
{ if ($1 == 1) print "Salaried employee "$2,$4" is "$6" years old.";
else print "Hourly employee "$2,$4" is "$6" years old."
}
In fact, there's an easier way to do this with conditional statements:
{ print ($1==1? "Salaried":"Hourly") "employee "$2,$4" is "$6" years old." }
At first glance, and for short statements, the if statement appears identical to the conditional statement. On closer inspection, the statement you should use in a specific case differs. Either is fine for use when choosing between either of two single
statements, but the if statement is required for more complicated situations, such as when E2 and E3 are multiple statements. Use if for multiple else statements (the first example), or for a condition inside a condition like the second example below:
{ if (NR == 100)
{ print \$(NF-1)\{""
print "This is the 100th record"
print $0
print
}
}
{ if($1==0)
if(name~/Fred/
print "Fred is broke" }
As if that does not provide ample choice, notice that the program relying on pattern-matching (had I chosen that method) produces the same output. Look at the program and its output.
$ cat lowsales.awk}
BEGIN {OFS=\\t\{"\t"}}
$(NF-1) <= 7 {print $1, $(NF-1),\,\"Check \Attendance"\ {Sales"} }
$(NF-1) > 7 {print $1, $(NF-1) } # Next to last field
{$ awk -f lowsales.awk emp.data}
John Anderson 7 \check attendance\ {Check Sales}
Joe Turner 15
Susan Greco 18
Bob Burmeister 17
Since the two patterns above are nonoverlapping and one immediately follows the other, the two programs accomplish the same thing. Which to use is a matter of programming style. I find the conditional statement or the if statement more readable than two
patterns in a row. When you are choosing whether to use the nawk conditional statement or the if statement because you're concerned about printing two long messages, using the if statement is cleaner. Above all, if you chose to use the conditional
statement, keep in mind you can't use awk; you must use nawk or gawk.
People often write programs to perform a repetitive task or several repeated tasks. These repetitions are called loops. Loops are the subject of this section. The loop structures of awk very much resemble those found in C. First, let's look at a
shortcut in counting with 1 notation. Then I'll show you the ways to program loops in awk. The looping constructs of awk are the do(nawk), for, and while statements. As with multiple-action groups in an if statement, curly braces({}) surround a group of
action statements associated in a loop. Without curly braces, only the statement immediately following the keyword is considered part of the loop.
The section concludes with a discussion of how (and some examples of why) to interrupt a loop.
As stated earlier, assignment statements take the form x = y, where the value y is being assigned to x. Awk has some shorthand methods of writing this. For example, to add a monthly sales total to the car sales file, you'll need to add a variable to
keep a running total of the sales figures. Call it total . You need to start total at zero and add each $(NF-1) as read. In standard programming practice, that would be written total = total + $(NF -1). This is okay in awk, too. However, a shortened format
of total += $(NF-1) is also acceptable.
There are two ways to indicate line+= 1 and line -=1 (line =line+1 and line=line-1 in awk shorthand). They are called increment and decrement, respectively, and can be further shortened to the simpler line++ and line. At any reference to a
variable, you can not only use this notation but even vary whether the action is performed immediately before or after the value is used in that statement. This is called prefix and postfix notation, and is represented by ++line and line++.
For clarity's sake, focus on increment for a moment. Decrement functions the same way using subtraction. Using the ++line notation tells awk to do the addition before doing the operation indicated in the line. Using the postfix form says to do the
operation in the line, then do the addition. Sometimes the choice does not matter; keeping a counter of the number of sales people (to later calculate a sales average at the end of the month) requires a counter of names. The statements totalpeople++ and
++totalpeople do the same thing and are interchangeable when they occupy a line by themselves. But suppose I decide to print the person's number along with his or her name and sales. Adding either of the second two lines below to the previous example
produces different results based on starting both at totalpeople=1.
$ cat awkscript.v1
BEGIN { totalpeople = 1 }
{print ++totalpeople, $1, $(NF-1) }
$ cat awkscript.v2
BEGIN { totalpeople = 1 }
{print totalpeople++, $1, $(NF-1) }
The first example will actually have the first employee listed as #2, since the totalpeople variable is incremented before it's used in the print statement. By contrast, the second version will do what we want because it'll use the variable value, then
afterwards increment it to the next value.
Awk provides the while statement for general looping. It has the following form:
while(E1)
S1
Here, E1 is an expression (a condition), and S1 is either one action statement or a group of action statements enclosed in curly braces. When awk meets a while statement, E1 is evaluated. If E1 is true, S1 executes from start to finish, then E1 is again
evaluated. If E1 is true, S1 again executes. The process continues until E1 is evaluated to false. When it does, execution continues with the next action statement after the loop. Consider the program below:
{ while ($0~/M/)
print
}
Typically the condition (E1) tests a variable, and the variable is changed in the while loop.
{ i=1
while (i<20)
{ print i
i++
}
}
This second code snippet will print the numbers from 1 to 19, then once the while loop tests with i=20, the condition of i<20 will become false and the loop will be done.
Nawk++ provides the do statement for looping in addition to the while statement. The do statement takes the following form:
do
S
while .
Here, S is either a single statement or a group of action statements enclosed in curly braces, and E is the test condition. When awk comes to a do statement, S is executed once, and then condition E is tested. If E evaluates to nonzero or nonnull, S
executes again, and so on until the condition E becomes false. The difference between the do and the while statement rests in their order of evaluation. The while statement checks the condition first and executes the body of the loop if the condition is
true. Use the while statement to check conditions that may be initially false. For instance, while (not end-of-file(input)) is a common example. The do statement executes the loop first and then checks the condition. Use the do statement when testing a
condition which depends on the first execution to meet the condition.
The do statement can be initiated using the while statement. Put the code that is in the loop before the condition as well as in the body of the loop.
The for statement is a compacted while loop designed for counting. Use it when you know ahead of time that S is a repetitive task and the number of times it executes can be expressed as a single variable. The for loop has the following form:
for(pre-loop-statements;TEST:post-loop-statements)
Here, pre-loop-statements usually initialize the counting variable; TEST is the test condition; and post-loop-statements indicate any loop variable increments.
For example,
{ for(i=1; i<=30; i++) print i.}
This is a succinct way of saying initialize i to 1, then continue looping while i<=30, and incrementing i by one each time through. The statement executed each time simply prints the value of i. The result of this statement is a list of the numbers 1
through 30.
The for loop can also be used involving loops of unknown size:
for (i=1; i<=NF; i++)
print $i
This prints each field on a unique line. True, you don't know what the number of fields will be, but you do know NF will contain that number.
The for loop does not have to be incremented; it could be decremented instead:
$awk -F: '{ for (i = NF; i > 0; i) print $i }' sales.data
This prints the fields in reverse order, one per line.
The only restriction of the loop control value is that it must be an integer. Because of the desire to create easily readable code, most programmers try to avoid branching out of loops midway. Awk offers two ways to do this; however, if you need it:
break and continue. Sometimes unexpected or invalid input leaves little choice but to exit the loop or have the program crashsomething a programmer strives to avoid. Input errors are one accepted time to use the break statement. For instance, when
reading the car sales data into the array name, I wrote the program expecting five fields on every line. If something happens and a line has the wrong number of fields, the program is in trouble. A way to protect your program from this is to have code
like:
{ for(i=1; i<=NF; i++)
if (NF != 5) {
print "Error on line " NR invalid input...leaving loop."
break }
else
continue with program code...
The break statement terminates only the loop. It is not equivalent to the exit statement which transfers control to the END statement of the program. I handle the problem as shown on the CD-ROM in file LIST15_1.
As another use for the break statement consider do S while (1). It is an infinite loop depending on another way out. Suppose your program begins by displaying a menu on screen. (See the LIST 15_2 file on the CD-ROM.)
The above example shows an infinite loop controlled with the break statement giving the end user a way out.
The continue statement causes execution to skip the current iteration remaining in both the do and the while statements. Control transfers to the evaluation of the test condition. In the for loop control goes to post-loop-instructions. When is this of
use? Consider computing a true sales ratio by calculating the amount sold and dividing that number by hours worked.
Since this is all kept in separate files, the simplest way to handle the task is to read the first list into an array, calculate the figure for the report, and do whatever else is needed.
FILENAME=="total" read each $(NF-1) into monthlytotal[i]
FILENAME=="per" with each i
monthlytotal[i]/$2
whatever else
But what if $2 is 0? The program will crash because dividing by 0 is an illegal statement. While it is unlikely that an employee will miss an entire month of work, it is possible. So, it is good idea to allow for the possibility. This is one use for the
continue statement. The above program segment expands to Listing 15.1.
BEGIN { star = 0
other stuff...
}
FILENAME=="total" { for(i=1;NF;i++)
monthlyttl[i]=$(NF-1)
}
FILENAME=="per" { for(i=1;NF;i++)
if($2 == 0) {
print "*"
star++
continue }
else
print monthlyttl[i]/$2
whatever else
}
END { if(star>=1)
print "* indicates employee did not work all month."
else
whatever
}
The above program makes some assumptions about the data in addition to assuming valid input data. What are these assumptions and more importantly, how do you fix them? The data in both files is assumed to be the same length, and the names are assumed to
be in the same order.
Recall that in awk, array subscripts are stored as strings. Since each list contains a name and its associated figure, you can match names. Before running this program, run the UNIX sort utility to insure the files have the names in alphabetical order
(see "Sorting Text Files" in Chapter 6). After making changes, use file LIST15_4 on the CD-ROM.
There are two primary types of data that awk can work withnumeric values or sequences of characters and digits that comprise words, phrases or sentences. The latter are called strings within awk and most other programming languages. For instance,
"now is the time for all good men" is a string. A string is always enclosed in double quotes(""). It can be almost any length (the exact number varies from UNIX version to version).
One of the important string operations is called concatenation. The word means putting together. When you concatenate two strings you are creating a third string that is the combination of string1, followed immediately by string2. To perform
concatenation in awk simply leave a space between two strings.
print "My name is" "Ann."
This prints the line:
My name isAnn.
(To ensure that a space is included you can either use a comma in the print statement or simply add a space to one of the strings: print "My name is " "Ann").
As a rule, awk returns the leftmost, longest string in all its functions. This means that it will return the string occurring first (farthest to the left). Then, it collects the longest string possible. For instance, if the string you are looking for is
"y*" in the string "any of the guyys knew it" then the match returns "yy" over "y" even though the single y appears earlier in the string.
Let's consider the different string functions available, organized by awk version.
The original awk contained few built-in functions for handling strings. The length function returns the length of the string. It has an optional argument. If you use the argument, it must follow the keyword and be enclosed in parentheses:
length(string). If there is no argument, the length of $0 is the value. For example, it is difficult to determine from some screen editors if a line of text stops at 80 characters or wraps around. The following invocation of awk aids by listing just those
lines that are longer than 80 characters in the specified file.
$ awk '{ if (length > 80) { print NR ": " $0}' file-with-long-lines
The other string function available in the original awk is substring, which takes the form substr(string,position,len) and returns the len length substring of the string starting at position.
When awk was expanded to nawk, many built-in functions were added for string manipulation while keeping the two from awk. The function gsub(r, s, t) substitutes string s into target string t every time the regular expression r occurs and returns the
number of substitutions. If t is not given gsub() uses $0. For instance, gsub(/l/, "y","Randall") turns Randall into Randayy. The g in gsub means global because all occurrences in the target string change.
The function sub(r, s, t) works like gsub(), except the substitution occurs only once. Thus sub(/l/, "y","Randall") returns "Randayl". The place the substring t occurs in string s is returned with the function index(s, t):
index("i", "Chris")) returns 4. As you'd expect the return value is zero if substring t is not found. The function match(s, r) returns the position in s where the regular expression r occurs. It returns the index where the substring
begins or 0 if there is no substring. It sets the values of RSTART and RLENGTH.
The split function separates a string into parts. For example, if your program reads in a date as 5-10-94, and later you want it written May 10, 1994 the first step is to divide the date appropriately. The built-in function split does this:
split("5-10-94", store, "-") divides the date, and sets store["1"] = "5", store["2"] = "10" and store["3"] = 94. Notice that here the subscripts start with "1" not
"0".
The POSIX awk specification added two built-in functions for use with strings. They are tolower(str) and toupper(str). Both functions return a copy of the string str with the alphabetic characters converted to the appropriate case. Non-alphabetic
characters are left alone.
Gawk provides two functions returning time-related information. The systime() function returns the current time of day in seconds since Midnight UTC (Universal Time Coordinated, the new name for Greenwich Mean Time), January 1970 on POSIX systems. The
function strftime(f, t), where f is a format and t is a timestamp of the same form as returned by system(), returns a formatted timestamp similar to the ANSI C function strftime().
String constants are the way awk identifies a non-keyboard, but essential, character. Since they are strings, when you use one, you must enclose it in double quotes (""). These constants may appear in printing or in patterns involving regular
expressions. For instance, the following command prints all lines less than 80 characters long that don't begin with a tab. See Table 15.3.
awk 'length < 80 && /\t/' another-file-with-long-lines
Expression |
Meaning |
|
\\ |
The way of indicating to print a backslash. |
|
\a |
The "alert" character; usually the ASCII BEL. |
|
\b |
A backspace character. |
|
\f |
A formfeed character. |
|
\n |
A newline character. |
|
\r |
Carriage return character. |
|
\t |
Horizontal tab character. |
|
\v |
Vertical tab character. |
|
\x |
Indicates the following value is a hexidecimal number. |
|
\0 |
Indicates the following value is an octal number. |
An array is a method of storing pieces of similar data in the computer for later use. Suppose your boss asks for a program that reads in the name, social security number, and a bunch of personnel data to print check stubs and the detachable check. For
three or four employees keeping name1, name2, etc. might be feasible, but at 20, it is tedious and at 200, impossible. This is a use for arrays! See file LIST15_5 on the CD-ROM.
Much easier, cleaner, and quicker! It also works for any number of employees without code changes. Awk only supports single-dimension arrays. (See the section "Advanced Concepts" for how to simulate multiple-dimensional arrays.) That and a few
other things set awk arrays apart from the arrays of other programming languages. This section focuses on arrays; I will explain their use, then discuss their special property. I conclude by listing three features of awk (a built-in function, a built-in
variable, and an operator) designed to help you work with arrays.
Arrays in awk, like variables, don't need to be declared. Further, no indication of size must be given ahead of time; in programming terms, you'd say arrays in awk are dynamic. To create an array, give it a name and put its subscript after the name in
square brackets ([]), name[2] from above, for instance. Array subscripts are also called the indices of the array ; in name[2], 2 is the index to the array name, and it accesses the one name stored at location 2.
Awk arrays are different from those of other programming languages because in awk, array subscripts are stored as strings, not numbers. Technically, the term is associative arrays and it's unusual in programming languages. Be aware that the use of
strings as subscripts can confuse you if you think purely in numeric terms. Since "3" > "15", an array element with a subscript 15 is stored before one with subscript of "3", even though numerically 3 > 15.
Since subscripts are strings, a subscript can be a field value. grade[$1]=$2 is a valid statement, as is salary["John"].
Nawk++ has additions specifically intended for use with arrays. The first is a test for membership. Suppose Mark Turner enrolled late in a class I teach, and I don't remember if I added his name to the list I keep on my computer. The following program
checks the list for me.
BEGIN {i=1}
{ name [i++] = $1 }
END { if ("Mark Turner" in name)
print "He's enrolled in the course!"
}
The delete function is a built-in function to remove array elements from computer memory. To remove an element, for example, you could use the command delete name[1].
Although technology is advancing and memory is not the precious commodity it once was considered to be, it is still a good idea to clean up after yourself when you write a program. Think of the check printing program above. Two hundred names won't fill
the memory. But if your program controls personnel activity, it writes checks and checkstubs; adds and deletes employees; and charts sales. It's better to update each file to disk and remove the arrays not in use. For one thing, there is less chance of
reading obsolete data. It also consumes less memory and minimizes the chance of using an array of old data for a new task. The clean-up can be most easily done:
END {i= totalemps
while(i>0) {
delete name[i]
delete data[i] }
}
Nawk++ creates another built-in variable for use when simulating multidimensional arrays. More on its use appears later, in the section "Advanced Concepts." It is called SUBSEP and has a default value of "\034". To add this variable
to awk, just create it in your program:
BEGIN { SUBSEP = "\034" }
Recall that in awk, array subscripts are stored as strings. Since each list contains a name and its associated figure, you can match names and hence match files. Here are the answers to the question about using two files and assuring they have the same
order (from the car sales example earlier). Before running this program, run the UNIX sort utility to insure the files have the names in alphabetical order. (See "Sorting Text Files" in Chapter 6.) After making changes, use the program in file
LIST15_6 on the CD-ROM.
Although awk is primarily a language for pattern matching, and hence, text and strings pop into mind more readily than math and numbers, awk also has a good set of math tools. In this section, first I show the basics, then we look at the math functions
built into awk.
Awk supports the usual math operations. The expression x^y is x superscript y, that is, x to the y power. The % operator calculates remainders in awk: x%y is the remainder of x divided by y, and the result is machine-dependent. All math uses, floating
point, and numbers are equivalent no matter which format they are expressed in so 100 = 1.00e+02.
The math operators in awk consist of the four basic functions: + (addition), - (subtraction), / (division), and * (multiplication), plus ^ and % for exponential and remainder.
As you saw earlier in the most recent sales example, fields can be used in arithmetic too. If, in the middle of the month, my boss asks for a list of the names and latest monthly sales totals, I don't need to panic over the discarded figures; I can just
print a new list. My first shot seems simple enough (Listing 15.2).
BEGIN {OFS="\t"}
{ print $1, $2, $6 } # field #6 = May
Then a thought hits. What if my boss asks for the same thing next month? Sure, changing a field number each month is not a big deal but is it really necessary??
I look at the data. No matter what month it is, the current month's totals are always the next to last field. I start over with the program in Listing 15.3.
BEGIN {OFS= _\t_}
{ print $1,$2, $(NF-1) }
Another use for arithmetic concerns assignment. Field variables may be changed by assignment. Given the following file, the statement $3 = 7 is a valid statement and produces the results below:
$ cat inputfile
1 2
3 4
5 6
7 8
9 10
$ awk '{$3 = 7}' inputfile
1 2 7
3 4 7
5 6 7
7 8 7
9 10 7
If I run the following program, four lines appear on the monitor, showing the new values.
{ if(NR==1)
print $0, NF }
{ if (NR >= 2 && NR <= 4) { $3=7; print $0, NF } }
END {print $0, NF }
Now when we run the data file through awk here's what we see:
$awk -f newsample.awk inputfile 1 2 2 3 4 7 3 5 6 7 3 7 8 7 3
Awk has a well-rounded selection of built-in numeric functions. As before in the sections on "Built-in Variables" and "Strings," the functions build on each other beginning with those found in awk.
To start, awk has built-in functions exp(exp), log(exp), sqrt(exp), and int(exp) where int() truncates its argument to an integer.
Nawk added further arithmetic functions to awk. It added atan2(y,x) which returns the arctangent of y/x. It also added two random number generator functions: rand() and srand(x). There is also some disagreement over which functions originated in awk and
which in nawk. Most versions have all the trigonometric functions in nawk, regardless of where they first appeared.
This section takes a closer look at the way input and output function in awk. I examine input first and look briefly at the getline function of nawk++ . Next, I show how awk output works, and the two different print statements in awk: print and printf.
Awk handles the majority of input automaticallythere is no explicit read statement, unlike most programming languages. Each line of the program is applied to each input record in the order the records appear in the input file. If the input file
has 20 records then the first pattern action statement in the program looks for a match 20 times. The next statement causes the input to skip to the next program statement without trying the rest of the input against that pattern action statement. The exit
statement acts as if all input has been processed. When awk encounters an exit statement, if there is one, the control goes to the END pattern action statement.
One addition, when awk was expanded to nawk, was the built-in function getline. It is also supported by the POSIX awk specification. The function may take several forms. At its simplest, it's written getline. When written alone, getline retrieves the
next input record and splits it into fields as usual, setting FNR, NF and NR. The function returns 1 if the operation is successful, 0 if it is at the end of the file (EOF), and -1 if the function encounters an error. Thus,
while (getline == 1)
simulates awk's automatic input.
Writing getline variable reads the next record into variable (getline char from the earlier menu example, for instance). Field splitting does not take place, and NF remains 0; but FNR and NR are incremented. Either of the above two may be
written using input from a file besides the one containing the input records by appending < "filename" on the end of the command. Furthermore, getline char < "stdin" takes the input from the keyboard. As you'd expect neither FNR
nor NR are affected when the input is read from another file. You can also write either of the two above forms, taking the input from a command.
There are two forms of printing in awk: the print statement and the printf statement. Until now, I have used the print statement. It is the fallback. There are two forms of the print statement. One has parentheses; one doesn't. So, print $0 is the same
as print($0). In awk shorthand, the statement print by itself is equivalent to print $0. As shown in an earlier example, a blank line is printed with the statement print "". Use the format you prefer.
For a simple example consider file1:
$cat file1 1 10 3 8 5 6 7 4 9 2 10 0
The command line
$ nawk 'BEGIN {FS="\t"}; {print($1>$2)}' file1
shows
0 0 0 1 1 1
on the monitor.
Knowing that 0 indicates false and 1 indicates true, the above is what you'd expect, but most programming languages won't print the result of a relation directly. Nawk will.
Nawk prints the results of relations with both print and printf. Both print and printf require the use of parentheses when a relation is involved, however, to distinguish between > meaning greater than and > meaning the redirection operator.
printf is used when the use of formatted output is required. It closely resembles C's printf. Like the print statement, it comes in two forms: with and without parentheses. Either may be used, except the parentheses are required when using a relational
operator. (See below.)
printf format-specifier, variable1,variable2, variable3,..variablen printf(format-specifier, variable1,variable2, variable3,..variablen)
The format specifier is always required with printf. It contains both any literal text, and the specific format for displaying any variables you want to print. The format specifier always begins with a %. Any combination of three modifiers may occur: a
- indicates the variable should be left justified within its field; a number indicates the total width of the field should be that number, if the number begins with a 0: %-05 means to make the variable 5 wide and pad with 0s as needed; the last modifier is
.number the meaning depends on the type of variable, the number indicates either the maximum number string width, or the number of digits to follow to the right of the decimal point. After zero or more modifiers, the display format ends with a
single character indicating the type of variable to display.
Remember the format specifier has a string value and since it does, it must always be enclosed in double quotes("), whether it is a literal string such as
printf("This is an example of a string in the display format.")
or a combination,
printf("This is the %d example", occurrence)
or just a variable
printf("%d", occurrence).
Before I go into detail about display format modifiers, I will show the characters used for display types. The following list shows the format specifier types without any modifiers.
Format |
Meaning |
|
%c |
An ASCII character |
|
%d |
A decimal number (an integer, no decimal point involved) |
|
%i |
Just like %d (Remember i for integer) |
|
%e |
A floating point number in scientific notation (1.00000E+01) |
|
%f |
A floating point number (10001010.434) |
|
%g |
awk chooses between %e or %f display format, the one producing a shorter string is selected. Nonsignificant zeros are not printed. |
|
%o |
An unsigned octal (base 8) number |
|
%s |
A string |
|
%x |
An unsigned hexadecimal (base 16) number |
|
%X |
Same as %x but letters are uppercase rather than lowercase. |
Look at some examples without display modifiers. When the file file1 looks like this:
$ cat file1 34 99 -17 2.5 -.3
the command line
awk '{printf("%c %d %e %f\n", $1, $1, $1, $1)}' file1
produces the following output:
" 34 3.400000e+01 34.000000 c 99 9.900000e+01 99.000000 _ -17 -1.700000e+01 -17.000000 _ 2 2.500000e+00 2.500000 0 -3.000000e-01 -0.300000
By contrast, a slightly different format string produces dramatically different results with the same input:
$ awk '{printf("%g %o %x", $1)}' file1
34 42 22
99 143 63
-17 37777777757 ffffffef
2.5 2 2
-0.3 0 0
Now let's change file1 to contain just a single word:
$cat file1 Example
The string above has seven characters. For clarity, I have used * instead of a blank space so the total field width is visible on paper.
printf("%s\n", $1)
Example
printf("%9s\n", $1)
**Example
printf("%-9s\n", $1)
Example**
printf("%.4s\n", $1)
Exam
printf("%9.4s\n", $1)
*****Exam
printf("%-9.4s\n", $1)
Exam*****
One topic pertaining to printf remains. The function printf was written so that it writes exactly what you tell it to writeand how you want it written, no more and no less. That is acceptable until you realize that you can't enter every character
you may want to use from the keyboard. Awk uses the same escape sequences found in C for nonprinting characters. The two most important to remember are \n for a carriage return and \t for a tab character.
Unlike most programming languages there is no way to open a file in awk; opening files is implicit. However, you must close a file if you intend to read from it after writing to it. Suppose you enter the command cat file1 < file2 in your awk program.
Before you can read file2 you must close the pipe. To do this, use the statement close(cat file1 < file2). You may also do the same for a file: close(file2).
As you have probably noticed, awk presents a programmer with a variety of ways to accomplish the same thing. This section focuses on the command line. You will see how to pass command line arguments to your program from the command line and how to set
the value of built-in variables on the command line. A summary of command line options concludes the section.
Command line arguments are available in awk through a built-in array called, as in C, ARGV. Again echoing C semantics, the value of the built-in ARGC is one less than the number of command line arguments. Given the command line awk -f programfile
infile1, ARGC has a value of 2. ARGV[0] = awk and ARGV[1] = infile1.
It is possible to pass variable values from the command line to your awk program just by stating the variable and its value. For example, for the command line, awk -f programfile infile x=1 FS=,. Normally, command line arguments are filenames,
but the equal sign indicates an assignment. This lets variables change value before and after a file is read. For instance, when the input is from multiple files, the order they are listed on the command line becomes very important since the first named
input file is the first input read. Consider the command line awk -f program file2 file1 and this program segment.
BEGIN { if ( FILENAME = "foo") {
print 'Unexpected input...Abandon ship!"
exit
}
}
The programmer has written this program to accept one file as first input and anything else causes the program to do nothing except print the error message.
awk -f program x=1 file1 x=2 file2
The change in variable values above can also be used to check the order of files. Since you (the programmer) know their correct order, you can check for the appropriate value of x.
This section discusses user-defined functions, also known in some programming languages as subroutines. For a discussion of functions built into awk see either "Strings" or "Arithmetic" as appropriate.
The ability to add, define, and use functions was not originally part of awk. It was added in 1985 when awk was expanded. Technically, this means you must use either nawk or gawk, if you intend to write awk functions; but again, since some systems use
the nawk implementation and call it awk, check your man pages before writing any code.
An awk function definition statement appears like the following:
function functionname(list of parameters) {
the function body
}
A function can exist anywhere a pattern action statement can be. As most of awk is, functions are free format but must be separated with either a semicolon or a n