Overview
This tutorial is written from the spective of processing lots of text data in a very quick fashion some ways that you could use this tool would be to parse log files or command output. I have used AWK to quicly parse and summarize logs that contain very small numbers of records to thousands of records. The great thing about awk is that it is included out of the box in most Unix based operating systems such as Linux, GNU, and Mac OSX.
Required Technologies
- Unix based operating system such as Linux, GNU, or even Mac OSX.
- Shell such as BASH, ZSH, etc.
- VIM, or your favorite programmers text editor.
Initial setup
Ensure that AWK is installed on your operating system.
awk --version
If you have awk installed in your terminal it should pull back something like:
awk version 20200816
If you do not have awk installed or you get an error install it using your favorite package manager as seen below:
Mac OSX Homebrew:
brew install awk
Debian based GNU/Linux using Advance Package Tool (APT):
sudo apt-get update
sudo apt-get install gawk
RPM based GNU/Linux, use Yellowdog Updator Modifier [yum]:
yum install gawk
Check where Awk is installed and that it is accessible to your terminal: Input:
which awk
Output:
/usr/bin/awk
Remember where this was installed because we are going to use this later when we create our awk file.
Anatomy of a chainsaw: Awk and it’s workflow.
The general workflow of awk is to “Until End Of File Read, Execute, Repeat”. Thus AWK reads a line from the input stream (file, pipe, or stdin), stores it in memory and then executes commands on that line. This process of reading and execution continues until the end of the input stream a.k.a EOF. Once completed Awk sends the results to the output stream (file, pipe, or stdout).
Here is an example of how to run Awk on myfile.txt
awk 'Begin-Body-End-blocks' myfile.txt
As such Begin-Body-End-blocks will be executed on each line of myfile.txt
That being said there are three code blocks that make up Awk with two of which are optional as seen in the table below.
Block Name | Optional/Required | Syntax | Execution |
---|---|---|---|
Begin | Optional | BEGIN {Awk do stuff} | Executed at the beginning of the program once |
Body | Optional | /pattern/ {Awk do stuff} | Executed on each line until End of File (EOF) |
End | Optional | END {Awk Awk do stuff} | Executed at the end of the program once |
Having learned all of that lets kick the tires a bit in the next section.
Syntax
AWK can be executed directly from the terminal or form of a text file containing AWK commands.
Terminal Execution
Awk can be executed from the terminal directly isng the below example.
awk [options] 'Begin-Body-End-blocks' target_input_stream ...
- Create a file titled ‘myfile.txt’ using the touch command as seen below. Input:
touch myfile.txt
Output: A file named myfile.txt is created in the cureent directory. 2. Now lets open myfile.txt in our favorite text editor such as VIM and copy the following text into the file saving it.
Line1c1 Line1c2
Line2c1 Line2c2
Line3c1 Line3c2
Line4c1 Line4c2
- Lets execute our first awk command on our file. Input:
awk 'BEGIN{printf "---Begin Block---"} {print} END{printf "---End Block---"}' myfile.txt
Output:
---Begin Block---
Line1c1 Line1c2
Line2c1 Line2c2
Line3c1 Line3c2
Line4c1 Line4c2
---End Block---
As you can see from our example the Begin block (BEGIN{printf “—Begin Block—"}) executes, the Body block executes ({print}) printing each line in the file, and finally the End block executes (END{printf “—End Block—"}).
Text File Execution
In order to use a text file we replace 'Begin-Body-End-blocks'
with -f scrpt_file
such as below.
awk [options] -f scrpt_file target_input_stream ...
To test this let’s expand on what we have created for the terminal example a bit.
- Create a file titled ‘scriptfile.awk’ using the touch command as seen below. Input:
touch scriptfile.awk
Output: A file named scriptfile.awk is created in the current directory.
- Now lets open scriptfile.awk in our favorite text editor such as VIM and copy the following text into the file saving it.
BEGIN{
printf "---Begin Block---"
}
{
print
}
END{
printf "---End Block---"
}
- Lets execute our first awk command on our file. Input:
awk -f scriptfile.awk myfile.txt
Output:
---Begin Block---
Line1c1 Line1c2
Line2c1 Line2c2
Line3c1 Line3c2
Line4c1 Line4c2
---End Block---
Options
Awk comes with options that can be very helpful in our text mining journey here are just a few. To get a full list of options reference the man page for Awk or check out the references at the bottom of this blog post.
-v
The -v option assigns a value to a variable. It assigns varialbe values before program execution.
Input:
awk -v fname=Harry 'BEGIN{printf "Good morning %s!\n", fname}'
Output:
Good morning Harry!
If you notice this still works because no matter what the Begin block is always executed at the beginning of hte program regardless of the existance of the other blocks.
Built-In Variables
Awk provides several convenient built-in variables that make things very convenient for us to use.
ARGC
ARGC stores a count of the number of arguments provided at the terminal. Input:
awk 'BEGIN {print "Argument Count =", ARGC}' arg1 arg2 arg3 arg4
Output:
Argument Count = 5
As you can see the argument count shows 5 because awk counts as the first argument.
ARGV
ARGV is a one dimensional array that stores the arguments passed into Awk with valid inddexes rangeing from 0 to ARGC-1.
Input:
awk 'BEGIN {
for (i = 0; i < ARGC; i++) {
printf "ARGV[%d] = %s\n", i, ARGV[i]
}
}' arg1 arg2 arg3 arg4
Output:
ARGV[0] = awk
ARGV[1] = arg1
ARGV[2] = arg2
ARGV[3] = arg3
ARGV[4] = arg4
CONVFMT
CONVFMT stores the conversion format for numbers. By default its' value is %.6g.
Input:
awk 'BEGIN { print "Number Conversion Format =", CONVFMT }'
Output:
Number Conversion Format = %.6g
ENVIRON
ENVIRON is an associative array for environmental variables. Input:
awk 'BEGIN { print ENVIRON["USER"] }'
Output:
someuser
To find other environment variables, use env
command. To learn more about enviornmental variables see Linuxize Enviornmental variables.
FILENAME
FILENAME stores the current file name.
Input:
awk 'END {print FILENAME}' myfile.txt
Output:
myfile.txt
NOTE: FILENAME is an undefined variable in the BEGIN block.
FS
FS represents the (input) field delineator. The default value is a space " “. You may also change this by using -F terminal option.
Input:
awk 'BEGIN {print "FS = " FS}' | cat -vte
Output:
FS = $
NF
NF stores the number of fields in the current record.
For example, the below prints only lines containing more than three fields.
Input:
echo -e "field1 field2\nfield1 field2 field3\nfield1 field2 field3 field4" | awk 'NF > 3'
Output:
field1 field2 field3 field4
NR
NR stores the number of the current record.
For instance, the below prints the record if the current record number is less than 4.
Input:
echo -e "One\nTwo\nThree\nFour" | awk 'NR < 4'
Output:
One
Two
Three
FNR
FNR is simmilar to NR, but is relative to the current input stream.
This is particularly useful when Awk is working with multiple input streams as such the value of FNR resets with each new input stream.
OFMT
OFMT stores the output format number. The default value is %.6g.
Input:
awk 'BEGIN {print "OFMT = " OFMT}'
Output:
OFMT = %.6g
OFS
OFS stores the output field separator. Its default value is space " “.
Input:
awk 'BEGIN {print "OFS = " OFS}' | cat -vte
Output:
OFS = $
ORS
ORS stores the output record delineator. Its default value is newline “\n”.
Input:
awk 'BEGIN {print "ORS = " ORS}' | cat -vte
Output:
ORS = $
$
RLENGTH
RLENGTH stores the string length matched by the match function.
AWK’s match function searches for a given string in the input-string.
Input:
awk 'BEGIN{ if (match("One Two Three", "re")) { print RLENGTH } }'
Output:
2
RS
RS stoes (input) record separator. Its default value is newline “\n”.
Input:
awk 'BEGIN {print "RS = " RS}' | cat -vte
Output:
RS = $
$
RSTART
RSTART stores the first position in the string matched by match function.
Input:
awk 'BEGIN { if (match("The quick brown fox", "bro")) { print RSTART } }'
Output:
11
SUBSEP
SUBSEP stores the separator character for array subscripts. The default value is \034.
Input:
awk 'BEGIN { print "SUBSEP = " SUBSEP }' | cat -vte
Output:
SUBSEP = ^\$
$0
$0 stores the entire input stream.
Input:
awk '{print $0}' myfile.txt
Output:
Line1c1 Line1c2
Line2c1 Line2c2
Line3c1 Line3c2
Line4c1 Line4c2
$n
$n stores the nth field of the input stream which is delineated by FS.
For example if I wanted to print the second field of myfile.txt.
Input:
awk '{print $2}' myfile.txt
Output:
Line1c2
Line2c2
Line3c2
Line4c2
Regular Expressions a.k.a. REGEX
AWK powerfully and effectively handles regular expressions. Numerous problems can be solved with simple regular expressions. As this article does not cover regular expressions themselves please see the reference section of this blog post. That being said see below some of the implementations of regex in Awk.
Method 1: Execute a body block on a line only if something in the line matches the pattern.
awk ‘/some-regex/ {do-awk-stuff}’ target_input_stream
Input:
awk '/e2/ {print}' myfile.txt
Output:
Line2c1 Line2c2
Method 2: Your regular expression can also be used in a conditional argument as seen below.
In this case it can either take the form of ~
is Like or !~
is Not like.
awk '{if ($n ~ /some-regex/) {do-awk-stuff}}' target_input_stream
or
awk '{if ($n !~ /some-regex/) {do-awk-stuff}}' target_input_stream
Here is another way to write our first regex example. Input:
awk '{if ($0 ~ /e2/) {print}}' myfile.txt
Output:
Line2c1 Line2c2
If I wanted to search in the second column and return data from the second column. Input:
awk '{if ($2 ~ /e2/) {print $2}}' myfile.txt
Output:
Line2c2
For more on regular experssions please see the following sites that I have found helpful.
Rexegg - Regex quick Start guide
Regex 101 - BUild Test and Debug Regualr Experssions
References
Linuxize Enviornmental variables
Geeks for Geeks - Enviornmental Variables
Regex 101 - BUild Test and Debug Regualr Experssions
Rexegg - Regex quick Start guide