awk, named for its authors Aho, Weinberger, and Kernighan, is a very cool little tool that you know exists and is installed on your system, but you have never bothered to learn how to use. I’m here to tell you that you really ought to!
If I stop for a moment to ponder the question, “what is the coolest tool in Unix?”, the immediate answer is awk. If I insist on pondering it for longer, giving each tool a moment for fair evaluation, the answer is still awk.¹ There are few tools as perfectly suited to their problem as awk is.
I’m not going to tell you what awk is, because there are already plenty of other resources for that. If you are totally unfamiliar with awk, then here’s a very brief summary:
awk reads a plaintext file as a list of newline-separated records of whitespace-separated columns. It then matches each line to a set of rules (defined by regular expressions) and then performs the actions listed by each matching rule (such as summing or averaging a column, reformatting the output, or doing any other number of things).
In short, awk is a domain-specific language which reads the kind of files you probably have a lot of already, then mutates them or computes something interesting from them.
→ "awk" in the POSIX specification
Instead of teaching you how to use it, I’m just going to tell you about some things I’ve used awk for, in an effort to convince you that it’s cool.
Please hold while I hastily run history | grep awk. Wait, no, it would be more ironic to run history | awk '/awk/ { print $0 }' instead. One sec…
Rewriting references in comments
I had some code like this:
// Duplicates a [dirent] object. Call [dirent_free] to get rid of it later.
export fn dirent_dup(e: *dirent) dirent = { // ...
It needed to become this:
// Duplicates a [[dirent]] object. Call [[dirent_free]] to get rid of it later.
export fn dirent_dup(e: *dirent) dirent = { // ...
So the following awk command looks for lines which are a comment, then globally substitutes a regex for another string:
awk '/^\/\/.*/ { gsub(/\[[a-zA-Z:]+\]/, "[&]") } { print $0 }' < input
In this manner I used it as a kind of modified sed which only operates on certain lines.
Extracting data from one script to use in another script
I have a script which includes something like this:
modules="ascii
bufio
bytes
compress_flate
compress_zlib
crypto_blake2b
crypto_math
crypto_random
crypto_md5
crypto_sha1
crypto_sha256
crypto_sha512
# ...
uuid"
I wanted to extract this list of names from the script, and replace _ with :: for all of them. awk to the rescue!
# Yes, I am entirely aware that this is a hack
modules=$(awk '
/^modules="/ { sub(/.*=\"/, ""); gsub(/_/, "::"); print $1; mods = 1 }
/^[a-z][a-z0-9_]+$/ { if (mods == 1) { gsub(/_/, "::"); print $1 } }
/^[a-z][a-z0-9_]+"$/ { if (mods == 1) { mods = 0; sub(/"/, ""); gsub(/_/, "::"); print $1 } }
' < scripts/gen-stdlib)
Adding syntax highlighting to patches
My email client, aerc, lets you pipe emails into an arbitrary command to format them nicely for displaying in your terminal. One kind of email I get often is a patch, with a diff dropped directly into the email. I wrote this awk script to add ANSI colors to such an email:
BEGIN {
bright = "\x1B[1m"
red = "\x1B[31m"
green = "\x1B[32m"
cyan = "\x1B[36m"
reset = "\x1B[0m"
hit_diff = 0
}
{
if (hit_diff == 0) {
# Strip carriage returns from line
gsub(/\r/, "", $0)
if ($0 ~ /^diff /) {
hit_diff = 1;
print bright $0 reset
} else if ($0 ~ /^.*\|.*(\+|-)/) {
left = substr($0, 0, index($0, "|")-1)
right = substr($0, index($0, "|"))
gsub(/-+/, red "&" reset, right)
gsub(/\++/, green "&" reset, right)
print left right
} else {
print $0
}
} else {
# Strip carriage returns from line
gsub(/\r/, "", $0)
if ($0 ~ /^-/) {
print red $0 reset
} else if ($0 ~ /^\+/) {
print green $0 reset
} else if ($0 ~ /^ /) {
print $0
} else if ($0 ~ /^@@ (-[0-9]+,[0-9]+ \+[0-9]+,[0-9]+) @@.*/) {
sub(/^@@ (-[0-9]+,[0-9]+ \+[0-9]+,[0-9]+) @@/, cyan "&" reset)
print $0
} else {
print bright $0 reset
}
}
}
Pulling a specific column out of another command
This is most basic use of awk. git ls-tree returns something like this:
100644 blob aa61a6c84fa215178b560e2bddcdcb18bf62ccc7 .build.yml
100644 blob 73ab8769f93bbbd5c4b69d33c2fa86329d05bc85 .gitignore
100644 blob 65d4d3ae9206f664e72c49ffed1489414852e637 LICENSE
100644 blob 6fd05a7d17471026df258d9931309a19ac286c5f README.md
040000 tree dfdc471efbd87a131fd7fe41706debdb48411ebe assets
100644 blob ede1a81e5b44031d95e315985ed7e7831067d609 config.toml
040000 tree a8ab3f0c2db376725d480c673e289d654d289acc content
040000 tree 7b1a4966c4143bcea991e9f77620cb4fda887d66 layouts
040000 tree e4dfc4f7500e111652aa7880002252b47239a2d0 static
100644 blob f31711885de4cd43571ee633b553016b766d3ec1 webring-in.template
Recently I was looking for comments on the first line of any file in my git repository, so:
git ls-tree -r HEAD | awk '{ print $4 }' | xargs -n1 sed 1q | grep '//' | less
I include this to demonstrate some restraint. The xargs, sed, and grep commands in this pipeline could all have been incorporated into awk, but it’s simpler not to.
Numbering lines from stdin
Sometimes I have a file and I want it to have line numbers. So, I wrote a little shell one-liner that does the job:
$ cat ~/bin/lineno
exec awk '{ print NR "\t" $0 }'
$ lineno < /etc/os-release
1 NAME="Alpine Linux"
2 ID=alpine
3 VERSION_ID=3.14.0_alpha20210212
4 PRETTY_NAME="Alpine Linux edge"
5 HOME_URL="https://alpinelinux.org/"
6 BUG_REPORT_URL="https://bugs.alpinelinux.org/"
In conclusion
You’re doing yourself a disservice if you don’t know how to use awk. Awk is only applicable to a certain kind of problem, but it’s a problem you’ll encounter more often than you think. Plus, once you get thinking in awk terms, you’ll find yourself subtly formatting your data in awk-friendly ways :) Learn it!
¹ Though special mention goes to ar (I dunno), cut (it’s useful), dd (for being a silly wart), ed (for not being installed on anyone’s system by default anymore, which pisses me off), and fort77 (for being specified by POSIX for some reason).