These are probably familiar to you from arithmetic and other programming languages you may have seen. From this, you can surmise that the command
gawk '$4 > 100' testfile will display every line in testfile in which the value in the fourth column is greater than 100.
All of the normal arithmetic commands are available, including add, subtract, multiply, and divide. There are also more advanced functions such as exponentials and remainders (also called modulus). Table 26.2 shows the basic arithmetic operations that
gawk supports.
You can combine column numbers and math, too. For example, the action
{print $3/2}
divides the number in the third column by 2.
There is also a set of arithmetic functions for trigonometry and generating random numbers. See Table 26.3.
The order of operations is important to gawk, as it is to regular arithmetic. The rules gawk follows are the same as with arithmetic: all multiplications, divisions, and remainders are performed before additions and subtractions. For example,
the command
{print $1+$2*$3}
multiplies column two by column three and then adds the result to column one. If you wanted to force the addition first, you would have to use parentheses:
{print ($1+$2)*$3}
Because these are the same rules you have heard about since grade school, they shouldn't cause you any confusion. Remember, if in doubt, put parentheses in the proper places to force the operations.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Placeholder | Description |
| c | If a string, the first character of the string; if an integer, the character that matches the first value |
| d | An integer |
| e | A floating-point number in scientific notation |
| f | A floating-point number in conventional notation |
| g | A floating-point number in either scientific or conventional notation, whichever is shorter |
| o | An unsigned integer in octal format |
| s | A string |
| x | An unsigned integer in hexadecimal format |
Whenever you use one of the format characters, you can place a number before the character to show how many digits or characters are to be used. Therefore, the format 6d would have six digits of an integer. Many formats can be on a line, but
each must have a value at the end of the line, as in this example
{printf "%5s works for %5s and earns %2d an hour", $1, $2, $3}
Here, the first string is the first column, the second string is the second column, and the third set of digits is from the third column in a file. The output would be something like:
Joe works for Mike and earns 12 an hour
A few little tricks are useful. As you saw in an earlier example, strings are right-justified, so the command
{printf "%5s likes this language\n", $2}
results in the output
Tim likes this language Geoff likes this language Mike likes this language Joe likes this language
To left-justify the names, place a minus sign in the format statement:
{printf "%-5s likes this language\n", $2}
This will result in the output
Tim likes this language Geoff likes this language Mike likes this language Joe likes this language
Notice that the name is justified on the left instead of on the right.
When dealing with numbers, you can specify the precision to be used, so that the command
{printf "%5s earns $%.2f an hour", $3, $6}
will use the string in column three and put five characters from it in the first placeholder, and then take the value in the sixth column and place it in the second placeholder with two digits after the decimal point. The output of the command would be
like this:
Joe earns $12.17 an hour
The dollar sign was inside the quotation marks in the printf command, and was not generated by the system. It has no special meaning inside the quotation marks. If you want to limit the number of digits to the right of the period, you can do that too.
The command
{printf "%5s earns $%6.2f an hour", $3, $6}
will put six digits before the period and two after.
Finally, we can impose some formatting on the output lines themselves. In an earlier example, you saw the use of \n to add a newline character. These are called escape codes, because the backslash is interpreted by gawk to mean something different than
a backslash. Table 26.5 shows the important escape codes that gawk supports.
| Code | Description |
| \a | Bell |
| \b | Backspace |
| \f | Formfeed |
| \n | Newline |
| \r | Carriage return |
| \t | Tab |
| \v | Vertical tab |
| \ooo | Octal character ooo |
| \xdd | Hexadecimal character dd |
| \c | Any character c |
You can, for example, escape a quotation mark by using the sequence \", which will place a quotation mark in the string without interpreting it to mean something special. For example:
{printf "I said \"Hello\" and he said "\Hello\"."
Awkward-looking, perhaps, but necessary to avoid problems. You'll see lots more escape characters used in examples later in this chapter.
As I mentioned earlier, the default field separator is always a whitespace character (spaces or tabs). This is not often convenient, as we found with the /etc/passwd file. You can change the field separator on the gawk command line by using the -F
option followed by the separator you want to use:
gawk -F":" '/tparker/{print}' /etc/passwd
This command changes the field separator to a colon and searches the etc/passwd file for the lines containing the string tparker. The new field separator is put in quotation marks to avoid any confusion. Also, the -F option (it must be a capital F) is
before the first quote character enclosing the pattern-action pair. If it came after, it wouldn't be applied.
Earlier I mentioned that gawk is particular about its pattern-matching habits. The string cat will match anything with the three letters on the line. Sometimes you want to be more exact in the matching. If you only want to match the word "cat"
but not "concatenate," you should put spaces on either side of the pattern:
/ cat / {print}
What about matching different cases? That's where the or instruction, represented by a vertical bar, comes in. For example
/ cat | CAT / {print}
will match "cat" or "CAT" on a line. However, what about "Cat?" That's where we also need to specify options within a pattern. With gawk, we use square brackets for this. To match any combination of "cat" in
upper- or lowercase, we must write the pattern like this:
/ [Cc][Aa][Tt] / {print}
This can get pretty awkward, but it's seldom necessary. To match just "Cat" and "cat," for example, we would use the pattern
/ [Cc]at / {print}
A useful matching operator is the tilde (~). This is used when you want to look for a match in a particular field in a record. For example, the pattern
$5 ~ /tparker/
will match any records where the fifth field is tparker. It is similar to the == operator. The matching operator can be negated, so
$5 !~ /tparker/
will find any record where the fifth field is not equal to tparker.
A few characters (called metacharacters) have special meaning to gawk. Many of these metacharacters will be familiar to shell users, because they are carried over from UNIX shells. The metacharacters shown in Table 26.6 can be used in gawk patterns.
| Metacharacter | Meaning | Example | Meaning of Example |
| ~ | The beginning of the field | $3 ~ /^b/ | Matches if the third field starts with b |
| $ | The end of the field | $3 ~ /b$/ | Matches if the third field ends with b |
| . | Matches any single character | $3 ~ /i.m/ | Matches any record that has a third field value of i, another character, and then m |
| | | Or | /cat|CAT/ | Matches cat or CAT |
| * | Zero or more repetitions of a character | /UNI*X/ | Matches UNX, UNIX, UNIIX, UNIIIX, and so on |
| + | One or more repetitions of a character | /UNI+X/ | Matches UNIX, UNIIX, and so on, but not UNX |
| \{a,b\} | The number of repetitions between a and b (both integers) | /UNI\{1,3\}X | Matches only UNIX, UNIIX, and UNIIIX |
| ? | Zero or one repetitions of a string | /UNI?X/ | Matches UNX and UNIX only |
| [] | Range of characters | /I[BDG]M/ | Matches IBM, IDM, and IGM |
| [^] | Not in the set | /I[^DE]M/ | Matches all three character sets starting with I and ending in M, except IDM and IEM |
Some of these metacharacters are used frequently. You will see some examples later in this chapter.
Running pattern-action pairs one or two at a time from the command line would be pretty difficult (and time consuming), so gawk allows you to store pattern-action pairs in a file. A gawk program (called a script) is a set of pattern-action pairs stored
in an ASCII file. For example, this could be the contents of a valid gawk script:
/tparker/{print $6}
$2 != "foo" {print}
The first line would look for tparker and print the sixth column, and the second line would start at the top of the file again and look for second columns that don't match the string "foo", then display the entire line. When you are writing a
script, you don't need to worry about the quotation marks around the pattern-action pairs as you did on the command line, because the new command to execute this script makes it obvious where the pattern-action pairs start and end.
After you have saved all of the pattern-action pairs in a program, they are called by gawk with the -f option on the command line:
gawk -f script filename
This command causes gawk to read all of the pattern-action pairs from the file script and process them against the file called filename. This is how most gawk programs are written. Don't confuse the -f and -F options!
If you want to specify a different field separator on the command line (they can be specified in the script, but use a special format you'll see later), the -F option must follow the -f option:
gawk -f script -F":" filename
If you want to process more than one file using the script, just append the names of the files:
gawk -f script filename1 filename2 filename3 ...
By default, all output from the gawk command is displayed on the screen. You could redirect it to a file with the usual Linux redirection commands:
gawk -f script filename > save_file
There is another way of specifying the output file from within the script, but we'll come back to that in a moment.
Two special patterns supported by gawk are useful when writing scripts. The BEGIN pattern is used to indicate any actions that should take place before gawk starts processing a file. This is usually used to initialize values, set parameters such as
field separators, and so on. The END pattern is used to execute any instructions after the file has been completely processed. Typically, this can be for summaries or completion notices.
Any instructions following the BEGIN and END patterns are enclosed in curly braces to identify which instructions are part of both patterns. Both BEGIN and END must appear in capitals. Here's a simple example of a gawk script that uses BEGIN and END,
albeit only for sending a message to the terminal:
BEGIN { print "Starting the process the file" }
$1 == "UNIX" {print}
$2 > 10 {printf "This line has a value of %d", $2}
END { print "Finished processing the file. Bye!"}
In this script, a message is initially printed out, and each line that has the word UNIX in the first column is echoed to the screen. Next, the file is processed again to look for any line with the second column greater than 10, and the message is
generated with its current value. Finally, the END pattern prints out a message that the program is finished.
If you have used any programming language before, you know that a variable is a storage location for a value. Each variable has a name and an associated value, which may change.
With gawk, you assign a variable a value using the assignment operator, =:
var1 = 10
This assigns the value 10 (numeric, not string) to the variable var1. With gawk, you don't have to declare variable types before you use them as you must with most other languages. This makes it easy to work with variables in gawk.
Don't confuse the assignment operator, =, which assigns a value, with the comparison operator, ==, which compares two values. This is a common error that takes a little practice to overcome.
The gawk language lets you use variables within actions, so the pattern-action pair
$1 == "Plastic" { count = count + 1 }
checks to see if the first column is equal to the string "Plastic", and if it is, increments the value of count by one. Somewhere above this line we should set a preliminary value for the variable count (usually in the BEGIN section), or we
will be adding one to something that isn't a recognizable number.
Actually, gawk assigns all variables a value of zero when they are first used, so you don't really have to define the value before you use it. It is, however, good programming practice to initialize the variable anyway.
Here's a more complete example:
BEGIN { count = 0 }
$5 == "UNIX" { count = count + 1 }
END { printf "%d occurrences of UNIX were found", count }
In the BEGIN section, the variable count is set to zero. Then, the gawk pattern-action pair is processed, with every occurrence of "UNIX" adding one to the value of count. After the entire file has been processed, the END statement displays
the total number.
Variables can be used in combination with columns and values, so all of the following statements are legal:
count = count + $6 count = $5 - 8 count = $5 + var1
Variables can also be part of a pattern. The following are all valid as pattern-action pairs:
$2 > max_value {print "Max value exceeded by ", $2 - max_value}
$4 - var1 < min_value {print "Illegal value of ", $4}
Two special operators are used with variables to increment and decrement by one, because these are common operations. Both of these special operators are borrowed from the C language:
| count++ | increments count by one |
| count | decrements count by one |
The gawk language has a few built-in variables that are used to represent things such as the total number of records processed. These are useful when you want to get totals. Table 26.7 shows the important built-in variables.
| Variable | Description |
| NR | The number of records read so far |
| FNR | The number of records read from the current file |
| FILENAME | The name of the input file |
| FS | Field separator (default is whitespace) |
| RS | Record separator (default is newline) |
| OFMT | Output format for numbers (default is %g) |
| OFS | Output field separator |
| ORS | Output record separator |
| NF | The number of fields in the current record |
The NR and FNR values are the same if you are processing only one file, but if you are doing more than one file, NR is a running total of all files, while FNR is the total for the current file only.
The FS variable is useful, because it controls the input file's field separator. To use the colon for the /etc/passwd file, for example, you would use the command
FS=":"
in the script, usually as part of the BEGIN pattern.
You can use these built-in variables as you would any other. For example, the command
NF <= 5 {print "Not enough fields in the record"}
gives you a way to check the number of fields in the file you are processing and generate an error message if the values are incorrect.
Enough of the details have been covered to allow us to start doing some real gawk programming. Although we have not covered all of gawk's pattern and action considerations, we have seen all the important material. Now we can look at writing control
structures.
If you have any programming experience at all, or have tried some shell script writing, many of these control structures will appear familiar. If you haven't done any programming, common sense should help, as gawk is cleanly laid out without weird
syntax. Follow the examples and try a few test programs of your own.
Incidentally, gawk enables you to place comments anywhere in your scripts, as long as the comment starts with a # sign. You should use comments to indicate what is going on in your scripts if it is not immediately obvious.
The if statement is used to allow gawk to test some condition and, if it is true, execute a set of commands. The general syntax for the if statement is
if (expression) {commands} else {commands}
The expression is always evaluated to see if it is true or false. No other value is calculated for the if expression. Here's a simple if script:
# a simple if loop
(if ($1 == 0){
print "This cell has a value of zero"
}
else {
printf "The value is %d\n", $1
})
You will notice that I used the curly braces to lay out the program in a readable manner. Of course, this could all have been typed on one line and gawk would have understood it, but writing in a nicely formatted manner makes it easier to understand
what is going on, and debugging the program becomes much easier if the need arises.
In this simple script, we test the first column to see if the value is zero. If it is, a message to that effect is printed. If not, the printf statement prints the value of the column.
The flow of the if statement is quite simple to follow. There can be several commands in each part, as long as the curly braces mark the start and end. There is no need to have an else section. It can be left out entirely, if desired. For example, this
is a complete and valid gawk script:
(if ($1 == 0){
print "This cell has a value of zero"
})
The gawk language, to be compatible with other programming languages, allows a special format of the if statement when a simple comparison is being conducted. This quick-and-dirty if structure is harder to read for novices, and I don't recommend it if
you are new to the language. For example, here's the if statement written the proper way:
# a nicely formatted if loop
(if ($1 > $2){
print "The first column is larger"
}
else {
print "The second column is larger"
})
Here's the quick-and-dirty method:
# if syntax from hell
$1 > $2{
print "The first column is larger"
}
{print "The second column is larger")
You will notice that the keywords if and else are left off. The general structure is retained: expression, true commands, and false commands. However, this is much less readable if you do not know that it is a if statement! Not all versions of gawk will
allow this method of using if, so don't be too surprised if it doesn't work. Besides, you should be using the more verbose method of writing if statements for readability's sake.
The while statement allows a set of commands to be repeated as long as some condition is true. The condition is evaluated each time the program loops. The general format of the gawk while loop is
while (expression){
commands
}
For example, the while loop can be used in a program that calculates the value of an investment over several years (the formula for the calculation is value=amount(1+interest_rate)^years):
# interest calculation computes compound interest
# inputs from a file are the amount, interest_rate, and years
{var = 1
while (var <= $3) {
printf("%f\n", $1*(1+$2)^var)
var++
}
}
You can see in this script that we initialize the variable var to 1 before entering the while loop. If we hadn't done this, gawk would have assigned a value of zero. The values for the three variables we use are read from the input file. The
autoincrement command is used to add one to var each time the line is executed.
The for loop is commonly used when you want to initialize a value and then ignore it. The syntax of the gawk for loop is
for (initialization; expression; increment) {
command
}
The initialization is executed only once and then ignored, the expression is evaluated each time the loop executes, and the increment is executed each time the loop is executed. Usually the increment is a counter of some type, but it can be any
collection of valid commands. Here's an example of a for loop, which is the same basic program as shown for the while loop:
# interest calculation computes compound interest
# inputs from a file are the amount, interest_rate, and years
{for (var=1; var <= $3; var++) {
printf("%f\n", $1*(1+$2)^var)
}
}
In this case, var is initialized when the for loop starts. The expression is evaluated, and if true, the loop runs. Then the value of var is incremented and the expression is tested again.
The format of the for loop might look strange if you haven't encountered programming languages before, but it is the same as the for loop used in C, for example.
The next instruction tells gawk to process the next record in the file, regardless of what it was doing. For example, in this script:
{ command1
command2
command3
next
command4
}
as soon as the next statement is read, gawk moves to the next record in the file and starts at the top of the current script block (given by the curly brace). In this example, command4 will never be executed because the next statement moves back up to
command1 each time.
The next statement is usually used inside an if loop, where you may want execution to return to the start of the script if some condition is met.
The exit statement makes gawk behave as though it has reached the end of the file, and it then executes any END patterns (if any exist). This is a useful method of aborting processing if there was an error in the file.
The gawk language supports arrays and enables you to access any element in the array easily. No special initialization is necessary with an array, because gawk treats it like any other variable. The general format for declaring arrays is
var[num]=value
As an example, consider the following script that reads an input file and generates an output file with the lines reversed in order:
# reverse lines in a file
{line[NR] = $0 } # remember each line
END {var=NR # output lines in reverse order
while (var > 0){
print line[var]
var
}
}
In this simple program (try and do the same task in any other programming language to see how efficient gawk is!), we used the NR (number of records) built-in variable. After reading each line into the array line[], we simply start at the last record
and print them again, stepping down through the array each time. We don't have to declare the array or do anything special with it, which is one of the powerful features of gawk.
We've only scratched the surface of gawk's abilities, but you might have noticed that it is a relatively easy language to work with and places no special demands on the programmer. That's one of the reasons gawk is so often used for quick programs. It
is ideal, for example, for writing a quick script to count the total size of all the files in a directory. In the C language, this would take many lines, but it can be done in less than a dozen lines in gawk.
If you are a system administrator or simply a power user, you will find that gawk is a great complement to all the other tools you have available, especially because it can accept input from a pipe or redirection. For more information on gawk, check the
man pages or one of the few awk guides that are available.