LFCS Part 40: Introduction to awk for Text Processing and Data Extraction

Welcome back to the LFCS Certification - Phase 1 series! In our previous posts, we explored regular expressions (Posts 36-37), text transformation with tr (Post 38), and pattern matching with grep (Post 39). Now we're going to learn one of the most powerful text processing tools in Linux: awk.

While grep searches for patterns and tr transforms characters, awk is a full-fledged programming language designed for text processing. It excels at extracting, manipulating, and reporting on structured data—making it invaluable for system administrators who need to parse logs, process configuration files, and analyze command output.

What is awk?

awk is a pattern-scanning and text-processing language created in 1977 by Alfred Aho, Peter Weinberger, and Brian Kernighan at Bell Labs. The name comes from their initials: Aho, Weinberger, Kernighan.

awk reads input line by line, splits each line into fields (columns), and allows you to perform actions based on patterns. Think of it as a combination of:

grep — for pattern matching
cut — for extracting fields
A programming language — for calculations and logic

Why awk Matters

As a system administrator, you'll use awk for:

Extract specific columns from output:

ps aux | awk '{print $1, $11}'  # Show user and command

Calculate totals and averages:

df -h | awk '{sum+=$3} END {print sum}'  # Sum disk usage

Parse log files:

awk '/error/ {print $1, $2, $NF}' /var/log/syslog

Process CSV data:

awk -F',' '{print $2, $4}' data.csv  # Extract columns 2 and 4

awk is installed by default on virtually every Linux system. Let's dive into how to use it.

Part 1: Basic awk Syntax

The basic syntax of awk is:

awk 'pattern {action}' file

Or with input from a pipe:

command | awk 'pattern {action}'

pattern: Condition to match (optional)
action: What to do when pattern matches
file: Input file (optional if using pipes)

If you omit the pattern, the action applies to all lines. If you omit the action, awk prints matching lines (like grep).

Simple Example

Let's create a test file:

cat > grades.txt << 'EOF'
Alice 85 92 88
Bob 78 85 90
Charlie 92 88 95
David 65 70 68
Eve 88 90 92
EOF

Print the entire file:

awk '{print}' grades.txt

Or simply:

awk '1' grades.txt

Output:

Alice 85 92 88
Bob 78 85 90
Charlie 92 88 95
David 65 70 68
Eve 88 90 92

The 1 is always true, so all lines are printed.

Part 2: Working with Fields

awk automatically splits each line into fields (columns) based on whitespace. You access fields using $1, $2, $3, etc.

Field Variables

$0 — The entire line
$1 — First field
$2 — Second field
$3 — Third field
$NF — Last field (Number of Fields)
$(NF-1) — Second-to-last field

Example: Extract Specific Fields

Print just the names (first field):

awk '{print $1}' grades.txt

Output:

Alice
Bob
Charlie
David
Eve

Print name and first score:

awk '{print $1, $2}' grades.txt

Output:

Alice 85
Bob 78
Charlie 92
David 65
Eve 88

Notice that fields are separated by a space in the output. awk uses a comma in print to separate fields with a space.

Print name and last score:

awk '{print $1, $NF}' grades.txt

Output:

Alice 88
Bob 90
Charlie 95
David 68
Eve 92

$NF always refers to the last field, regardless of how many fields there are.

Print Without Commas

If you don't use commas in print, fields are concatenated:

awk '{print $1 $2}' grades.txt

Output:

Alice85
Bob78
Charlie92
David65
Eve88

No space between name and number!

Part 3: Built-in Variables

awk has several useful built-in variables:

3.1: NR (Number of Records)

NR is the current line number:

awk '{print NR, $0}' grades.txt

Output:

1 Alice 85 92 88
2 Bob 78 85 90
3 Charlie 92 88 95
4 David 65 70 68
5 Eve 88 90 92

Each line is prefixed with its line number.

3.2: NF (Number of Fields)

NF is the number of fields in the current line:

awk '{print NF, $0}' grades.txt

Output:

4 Alice 85 92 88
4 Bob 78 85 90
4 Charlie 92 88 95
4 David 65 70 68
4 Eve 88 90 92

All lines have 4 fields (name + 3 scores).

3.3: Combining NR and NF

awk '{print "Line", NR, "has", NF, "fields"}' grades.txt

Output:

Line 1 has 4 fields
Line 2 has 4 fields
Line 3 has 4 fields
Line 4 has 4 fields
Line 5 has 4 fields

3.4: FS (Field Separator)

By default, awk uses whitespace as the field separator. You can change this with -F or by setting FS:

Using -F flag:

# Create a colon-separated file
cat > data.csv << 'EOF'
Alice:85:92:88
Bob:78:85:90
EOF

# Use colon as separator
awk -F':' '{print $1, $2}' data.csv

Output:

Alice 85
Bob 78

Setting FS in BEGIN block:

awk 'BEGIN {FS=":"} {print $1, $2}' data.csv

Same output.

3.5: OFS (Output Field Separator)

By default, awk separates output fields with a space. Change this with OFS:

awk 'BEGIN {OFS=","} {print $1, $2, $3}' grades.txt

Output:

Alice,85,92
Bob,78,85
Charlie,92,88
David,65,70
Eve,88,90

Now fields are comma-separated.

Part 4: Patterns and Conditions

You can filter which lines awk processes using patterns.

4.1: Match Lines Containing Text

Print lines containing "Alice":

awk '/Alice/ {print}' grades.txt

Output:

Alice 85 92 88

This is similar to grep Alice grades.txt.

4.2: Comparison Operators

Print students with first score > 80:

awk '$2 > 80 {print $1, $2}' grades.txt

Output:

Alice 85
Charlie 92
Eve 88

Available operators:

== — Equal to
!= — Not equal to
> — Greater than
< — Less than
>= — Greater than or equal
<= — Less than or equal

4.3: Logical Operators

AND (&&):

# Students with first score > 80 AND last score > 90
awk '$2 > 80 && $NF > 90 {print $1}' grades.txt

Output:

Alice
Eve

OR (||):

# Students with first score > 90 OR last score > 90
awk '$2 > 90 || $NF > 90 {print $1}' grades.txt

Output:

Charlie
Bob
Charlie
Eve

4.4: Match Specific Fields

Print lines where name starts with "A":

awk '$1 ~ /^A/ {print}' grades.txt

Output:

Alice 85 92 88

The ~ operator means "matches regex".

Doesn't match (!~):

# Names NOT starting with A
awk '$1 !~ /^A/ {print $1}' grades.txt

Output:

Bob
Charlie
David
Eve

Part 5: BEGIN and END Blocks

awk has special blocks that execute before and after processing:

5.1: BEGIN Block

Executes once before reading any input:

awk 'BEGIN {print "Name Score"} {print $1, $2}' grades.txt

Output:

Name Score
Alice 85
Bob 78
Charlie 92
David 65
Eve 88

The header is printed first.

5.2: END Block

Executes once after all input is processed:

awk '{print $1} END {print "Total:", NR, "students"}' grades.txt

Output:

Alice
Bob
Charlie
David
Eve
Total: 5 students

5.3: Combining BEGIN and END

awk 'BEGIN {print "=== Student Report ==="}
     {print $1, $2}
     END {print "=== End of Report ==="}' grades.txt

Output:

=== Student Report ===
Alice 85
Bob 78
Charlie 92
David 65
Eve 88
=== End of Report ===

Part 6: Calculations and Arithmetic

awk can perform calculations on your data.

6.1: Calculate Average

Average the three scores for each student:

awk '{avg = ($2 + $3 + $4) / 3; print $1, avg}' grades.txt

Output:

Alice 88.3333
Bob 84.3333
Charlie 91.6667
David 67.6667
Eve 90

6.2: Sum a Column

Sum all first scores:

awk '{sum += $2} END {print "Total:", sum}' grades.txt

Output:

Total: 408

Breaking it down:

sum += $2 — Add each student's first score to sum
END {print "Total:", sum} — After all lines, print the total

6.3: Calculate Average of Column

awk '{sum += $2} END {print "Average:", sum/NR}' grades.txt

Output:

Average: 81.6

NR in the END block equals the total number of lines.

6.4: Count Matches

Count students with scores > 80:

awk '$2 > 80 {count++} END {print count, "students scored >80"}' grades.txt

Output:

3 students scored >80

6.5: Find Maximum

awk 'BEGIN {max=0} $2 > max {max=$2; name=$1} END {print name, max}' grades.txt

Output:

Charlie 92

What happens:

BEGIN {max=0} — Initialize max
$2 > max — If current score is greater than max
{max=$2; name=$1} — Update max and remember the name
END {print name, max} — Print the winner

Part 7: Formatting Output

7.1: printf for Formatted Output

Use printf for precise formatting (like in C):

awk '{printf "%-10s %3d\n", $1, $2}' grades.txt

Output:

Alice       85
Bob         78
Charlie     92
David       65
Eve         88

Format specifiers:

%-10s — Left-aligned string, 10 characters wide
%3d — Integer, 3 digits wide
\n — Newline (printf doesn't auto-newline)

7.2: Format Numbers

awk '{avg = ($2+$3+$4)/3; printf "%s: %.2f\n", $1, avg}' grades.txt

Output:

Alice: 88.33
Bob: 84.33
Charlie: 91.67
David: 67.67
Eve: 90.00

%.2f means floating-point with 2 decimal places.

7.3: Create Tables

awk 'BEGIN {printf "%-10s %5s %5s %5s %7s\n", "Name", "S1", "S2", "S3", "Avg"}
     {avg=($2+$3+$4)/3; printf "%-10s %5d %5d %5d %7.2f\n", $1, $2, $3, $4, avg}' grades.txt

Output:

Name           S1    S2    S3     Avg
Alice          85    92    88   88.33
Bob            78    85    90   84.33
Charlie        92    88    95   91.67
David          65    70    68   67.67
Eve            88    90    92   90.00

Part 8: Working with /etc/passwd

Let's apply awk to a real system file: /etc/passwd.

The format of /etc/passwd is:

username:password:UID:GID:comment:home:shell

Fields are separated by colons (:).

8.1: Extract Usernames

awk -F':' '{print $1}' /etc/passwd | head -5

Example output:

root
daemon
bin
sys
sync

8.2: Find Users with Bash Shell

awk -F':' '$7 == "/bin/bash" {print $1}' /etc/passwd

Shows users whose shell is /bin/bash.

8.3: Extract UIDs Greater Than 1000

Regular users typically have UID ≥ 1000:

awk -F':' '$3 >= 1000 {print $1, $3}' /etc/passwd

Example output:

alice 1001
bob 1002
charlie 1003

8.4: Count Shell Types

awk -F':' '{shells[$7]++} END {for (s in shells) print s, shells[s]}' /etc/passwd

Example output:

/bin/bash 5
/usr/sbin/nologin 25
/bin/sync 1
/bin/false 3

This uses an associative array to count occurrences.

8.5: Pretty Print User Info

awk -F':' '$3 >= 1000 {printf "User: %-15s UID: %5d Home: %s\n", $1, $3, $6}' /etc/passwd

Example output:

User: alice            UID:  1001 Home: /home/alice
User: bob              UID:  1002 Home: /home/bob
User: charlie          UID:  1003 Home: /home/charlie

Part 9: Real-World System Administration Examples

9.1: Analyze Disk Usage

Show directories using most space:

du -sh /var/* | awk '$1 ~ /G/ {print $2, $1}'

This shows only directories with gigabyte usage.

Better version with sort:

du -sk /var/* | sort -rn | head -10 | awk '{printf "%5dMB %s\n", $1/1024, $2}'

Shows top 10 directories by size in MB.

9.2: Process Analysis

Find processes using most memory:

ps aux | awk 'NR>1 {print $4, $11}' | sort -rn | head -10

Breaking it down:

NR>1 — Skip header line
$4 — Memory percentage
$11 — Command name
sort -rn — Sort numerically, reverse order

9.3: Parse Apache Access Logs

Extract IP addresses and count requests per IP:

awk '{ips[$1]++} END {for (ip in ips) print ip, ips[ip]}' /var/log/apache2/access.log | sort -k2 -rn | head

Shows IPs with most requests.

9.4: Calculate Average Load from uptime

uptime | awk -F'load average:' '{print $2}' | awk -F',' '{avg=($1+$2+$3)/3; printf "Avg load: %.2f\n", avg}'

Calculates average of 1, 5, and 15-minute load averages.

9.5: Monitor Network Connections

ss -tan | awk 'NR>1 {states[$1]++} END {for (s in states) print s, states[s]}'

Counts connection states (ESTABLISHED, TIME_WAIT, etc.).

9.6: Parse CSV Files

Create a sample CSV:

cat > sales.csv << 'EOF'
Product,Quantity,Price
Laptop,5,1200
Mouse,20,25
Keyboard,15,75
Monitor,8,300
EOF

Calculate total revenue:

awk -F',' 'NR>1 {total += $2 * $3} END {printf "Total Revenue: $%d\n", total}' sales.csv

Output:

Total Revenue: $10900

9.7: Find Large Files

find /var/log -type f -exec ls -lh {} \; | awk '$5 ~ /M|G/ {print $5, $9}'

Shows files with size in megabytes or gigabytes.

9.8: Summarize Log Errors by Hour

awk '/error/ {hour=substr($3,1,2); hours[hour]++} END {for (h in hours) print h":00 -", hours[h], "errors"}' /var/log/syslog

Groups errors by hour of day.

Part 10: Arrays in awk

awk supports associative arrays (like Python dictionaries or Bash associative arrays).

10.1: Basic Array Usage

awk 'BEGIN {
    fruits["apple"] = 5
    fruits["banana"] = 3
    fruits["orange"] = 7

    print "Apples:", fruits["apple"]
    print "Bananas:", fruits["banana"]
}'

Output:

Apples: 5
Bananas: 3

10.2: Loop Through Array

awk 'BEGIN {
    fruits["apple"] = 5
    fruits["banana"] = 3
    fruits["orange"] = 7

    for (fruit in fruits) {
        print fruit, fruits[fruit]
    }
}'

Output:

apple 5
banana 3
orange 7

10.3: Count Occurrences

Count how many times each name appears:

cat > names.txt << 'EOF'
Alice
Bob
Alice
Charlie
Bob
Alice
EOF

awk '{count[$1]++} END {for (name in count) print name, count[name]}' names.txt

Output:

Alice 3
Bob 2
Charlie 1

10.4: Group Data

Create a sales file:

cat > daily_sales.txt << 'EOF'
Monday Laptop 1200
Monday Mouse 25
Tuesday Laptop 1200
Tuesday Keyboard 75
Monday Monitor 300
Tuesday Mouse 25
EOF

Sum sales by day:

awk '{sales[$1] += $3} END {for (day in sales) printf "%s: $%d\n", day, sales[day]}' daily_sales.txt

Output:

Monday: $1525
Tuesday: $1300

Part 11: Multi-Line awk Programs

For complex tasks, you can write awk as a script file.

11.1: Using -f Flag

Create an awk script:

cat > stats.awk << 'EOF'
BEGIN {
    print "=== Grade Statistics ==="
}

{
    sum += $2
    if ($2 > max) {
        max = $2
        top_student = $1
    }
    if (min == 0 || $2 < min) {
        min = $2
    }
}

END {
    print "Total students:", NR
    print "Average score:", sum/NR
    print "Highest score:", max, "(" top_student ")"
    print "Lowest score:", min
}
EOF

# Run the script
awk -f stats.awk grades.txt

Output:

=== Grade Statistics ===
Total students: 5
Average score: 81.6
Highest score: 92 (Charlie)
Lowest score: 65

11.2: Inline Multi-Line

awk '
BEGIN { print "Processing..." }
{
    if ($2 > 85) {
        print $1, "is excellent"
    } else if ($2 > 75) {
        print $1, "is good"
    } else {
        print $1, "needs improvement"
    }
}
END { print "Done!" }
' grades.txt

Output:

Processing...
Alice is good
Bob is good
Charlie is excellent
David needs improvement
Eve is excellent
Done!

Part 12: Practical awk One-Liners

Essential One-Liners

Print specific columns:

awk '{print $1, $3}'

Sum a column:

awk '{sum+=$1} END {print sum}'

Average a column:

awk '{sum+=$1} END {print sum/NR}'

Count lines:

awk 'END {print NR}'

Print lines longer than 80 characters:

awk 'length > 80'

Remove duplicate lines (while maintaining order):

awk '!seen[$0]++'

Print every 5th line:

awk 'NR % 5 == 0'

Print lines between patterns:

awk '/START/,/END/'

Replace field:

awk '{$2="REDACTED"; print}'

Add line numbers:

awk '{print NR, $0}'

Part 13: Practice Labs

Let's practice with comprehensive labs!

Warm-up Labs (1-5): Basic Operations

Lab 1: Create Test Data and Extract Fields

Task: Create a file with employee data (name, department, salary). Extract and print only names and salaries.

Solution

# Create employee file
cat > employees.txt << 'EOF'
Alice Engineering 95000
Bob Marketing 75000
Charlie Engineering 98000
David HR 72000
Eve Sales 85000
Frank Engineering 102000
EOF

# Extract names and salaries
awk '{print $1, $3}' employees.txt

Output:

Alice 95000
Bob 75000
Charlie 98000
David 72000
Eve 85000
Frank 102000

Lab 2: Print Line Numbers

Task: Using the employees.txt file, print each line with its line number.

Solution

awk '{print NR, $0}' employees.txt

Output:

1 Alice Engineering 95000
2 Bob Marketing 75000
3 Charlie Engineering 98000
4 David HR 72000
5 Eve Sales 85000
6 Frank Engineering 102000

Lab 3: Filter Based on Condition

Task: Print employees with salary greater than 80000.

Solution

awk '$3 > 80000 {print $1, $3}' employees.txt

Output:

Alice 95000
Charlie 98000
Eve 85000
Frank 102000

Lab 4: Count Number of Fields

Task: Print the number of fields in each line of employees.txt.

Solution

awk '{print "Line", NR, "has", NF, "fields"}' employees.txt

Output:

Line 1 has 3 fields
Line 2 has 3 fields
Line 3 has 3 fields
Line 4 has 3 fields
Line 5 has 3 fields
Line 6 has 3 fields

Lab 5: Print Last Field

Task: Print the name (first field) and salary (last field) using $NF.

Solution

awk '{print $1, $NF}' employees.txt

Output:

Alice 95000
Bob 75000
Charlie 98000
David 72000
Eve 85000
Frank 102000

Core Practice Labs (6-13): Intermediate Skills

Lab 6: Calculate Total Salaries

Task: Calculate the total of all salaries in employees.txt.

Solution

awk '{sum += $3} END {print "Total salaries: $" sum}' employees.txt

Output:

Total salaries: $527000

Lab 7: Calculate Average Salary

Task: Calculate the average salary.

Solution

awk '{sum += $3} END {printf "Average salary: $%.2f\n", sum/NR}' employees.txt

Output:

Average salary: $87833.33

Lab 8: Find Maximum and Minimum

Task: Find the employee with the highest salary and the one with the lowest.

Solution

awk 'BEGIN {min=999999; max=0}
     $3 > max {max=$3; max_name=$1}
     $3 < min {min=$3; min_name=$1}
     END {
         print "Highest:", max_name, "$" max
         print "Lowest:", min_name, "$" min
     }' employees.txt

Output:

Highest: Frank $102000
Lowest: David $72000

Lab 9: Group by Department

Task: Count how many employees are in each department.

Solution

awk '{dept[$2]++} END {for (d in dept) print d, dept[d], "employees"}' employees.txt

Output:

Engineering 3 employees
Marketing 1 employees
HR 1 employees
Sales 1 employees

Lab 10: Sum Salaries by Department

Task: Calculate total salaries for each department.

Solution

awk '{dept_salary[$2] += $3}
     END {
         for (d in dept_salary) {
             printf "%s: $%d\n", d, dept_salary[d]
         }
     }' employees.txt

Output:

Engineering: $295000
Marketing: $75000
HR: $72000
Sales: $85000

Lab 11: Format Output as Table

Task: Create a nicely formatted table with headers.

Solution

awk 'BEGIN {printf "%-15s %-15s %10s\n", "Name", "Department", "Salary"; print "-------------------------------------------"}
     {printf "%-15s %-15s $%9d\n", $1, $2, $3}' employees.txt

Output:

Name            Department           Salary
-------------------------------------------
Alice           Engineering      $    95000
Bob             Marketing        $    75000
Charlie         Engineering      $    98000
David           HR               $    72000
Eve             Sales            $    85000
Frank           Engineering      $   102000

Lab 12: Filter with Multiple Conditions

Task: Find Engineering employees with salary > 95000.

Solution

awk '$2 == "Engineering" && $3 > 95000 {print $1, $3}' employees.txt

Output:

Charlie 98000
Frank 102000

Lab 13: Change Field Separator

Task: Create a colon-separated file and process it.

Solution

# Create colon-separated data
cat > data.txt << 'EOF'
Alice:30:Engineering
Bob:25:Marketing
Charlie:35:Engineering
EOF

# Process with custom separator
awk -F':' '{print $1, "is", $2, "years old"}' data.txt

Output:

Alice is 30 years old
Bob is 25 years old
Charlie is 35 years old

Advanced Labs (14-20): Complex Scenarios

Lab 14: Process /etc/passwd

Task: Extract all regular users (UID >= 1000) with their home directories and shells.

Solution

awk -F':' '$3 >= 1000 && $3 < 65534 {printf "%-15s %-25s %s\n", $1, $6, $7}' /etc/passwd

This shows usernames, home directories, and shells for regular users (excluding nobody which has UID 65534).

Lab 15: Calculate Disk Usage Summary

Task: Use df output to calculate total used disk space.

Solution

df -k | awk 'NR>1 {sum+=$3} END {printf "Total used: %.2f GB\n", sum/1024/1024}'

Explanation:

NR>1 skips header
$3 is the "Used" column in KB
Convert KB to GB by dividing by 1024 twice

Lab 16: Parse Log File by Time

Task: Create a mock syslog and count messages by hour.

Solution

# Create mock log
cat > system.log << 'EOF'
Dec 10 08:15:23 server app: Starting
Dec 10 08:16:45 server app: Ready
Dec 10 09:01:12 server app: Processing
Dec 10 09:15:30 server app: Complete
Dec 10 10:30:45 server app: Error occurred
Dec 10 10:31:00 server app: Recovering
EOF

# Count by hour
awk '{hour=substr($3,1,2); hours[hour]++}
     END {for (h in hours) print h":00 -", hours[h], "messages"}' system.log

Output:

08:00 - 2 messages
09:00 - 2 messages
10:00 - 2 messages

Lab 17: Remove Duplicates While Preserving Order

Task: Create a file with duplicate lines and remove them.

Solution

# Create file with duplicates
cat > duplicates.txt << 'EOF'
apple
banana
apple
cherry
banana
date
apple
EOF

# Remove duplicates
awk '!seen[$0]++' duplicates.txt

Output:

apple
banana
cherry
date

How it works: !seen[$0]++ returns true the first time each line is seen (when seen[$0] is 0), then increments it.

Lab 18: Calculate Running Total

Task: Create sales data and show running total.

Solution

# Create sales data
cat > sales.txt << 'EOF'
Monday 1200
Tuesday 850
Wednesday 1500
Thursday 920
Friday 2100
EOF

# Show running total
awk '{sum+=$2; printf "%s %5d (Total: %d)\n", $1, $2, sum}' sales.txt

Output:

Monday  1200 (Total: 1200)
Tuesday  850 (Total: 2050)
Wednesday  1500 (Total: 3550)
Thursday  920 (Total: 4470)
Friday  2100 (Total: 6570)

Lab 19: Process CSV with Quoted Fields

Task: Handle CSV files with comma-separated values and quoted fields.

Solution

# Create CSV with quoted fields
cat > products.csv << 'EOF'
"Laptop",5,1200
"Mouse, Wireless",20,25
"Keyboard",15,75
EOF

# Process (basic approach)
awk -F',' '{gsub(/"/, "", $1); print $1, "Qty:", $2, "Price:", $3}' products.csv

Output:

Laptop Qty: 5 Price: 1200
Mouse, Wireless Qty: 20 Price: 25
Keyboard Qty: 15 Price: 75

Note: This is basic; production CSV parsing needs more robust handling.

Lab 20: Generate Report from Multiple Metrics

Task: Create a comprehensive system report using ps output.

Solution

ps aux | awk '
BEGIN {
    print "=== Process Analysis Report ==="
    print ""
}
NR > 1 {
    users[$1]++
    mem[$1] += $4
    cpu[$1] += $3
}
END {
    print "Processes by user:"
    for (u in users) {
        printf "  %-15s %3d processes, CPU: %5.1f%%, MEM: %5.1f%%\n",
               u, users[u], cpu[u], mem[u]
    }
    print ""
    print "Total processes:", NR-1
}'

This creates a summary showing process count, CPU, and memory usage per user.

Best Practices

1. Quote Your awk Scripts

# Good
awk '{print $1}' file.txt

# Bad (works but can cause issues with shell expansion)
awk {print $1} file.txt

2. Use BEGIN for Initialization

# Good
awk 'BEGIN {sum=0} {sum+=$1} END {print sum}'

# Works but less clear
awk '{sum+=$1} END {print sum}'

3. Use Meaningful Variable Names

# Good
awk '{total_sales += $3} END {print total_sales}'

# Works but cryptic
awk '{x+=$3} END {print x}'

4. Format Complex Scripts for Readability

# For complex logic, use multi-line format
awk '
    BEGIN { FS=":"; OFS="\t" }
    $3 >= 1000 {
        print $1, $6, $7
    }
' /etc/passwd

5. Test Patterns First

# Test your pattern matching before adding actions
awk '$3 > 1000' file.txt  # See what matches

# Then add the action
awk '$3 > 1000 {print $1, $3}' file.txt

6. Use printf for Formatted Output

# Instead of print for numbers
awk '{printf "%.2f\n", $1}'  # Controls decimal places

7. Specify Field Separator Explicitly

# Explicit is better than implicit
awk -F':' '{print $1}' /etc/passwd

# Even better with BEGIN
awk 'BEGIN {FS=":"} {print $1}' /etc/passwd

Common Pitfalls to Avoid

1. Forgetting Field Separator for Non-Whitespace Data

# Wrong for colon-separated data
awk '{print $1}' /etc/passwd  # Prints entire line!

# Correct
awk -F':' '{print $1}' /etc/passwd

2. Not Handling Empty Lines

# Can cause division by zero
awk '{avg = ($1+$2)/$3; print avg}' data.txt

# Better
awk '$3 != 0 {avg = ($1+$2)/$3; print avg}' data.txt

3. Mixing print and printf

# No newline with printf!
awk '{printf $1}' file.txt  # All on one line

# Correct
awk '{printf "%s\n", $1}' file.txt

4. Not Initializing Variables

# May not work as expected
awk '$1 > max {max=$1}' file.txt  # max is undefined initially

# Better
awk 'BEGIN {max=0} $1 > max {max=$1} END {print max}' file.txt

5. Forgetting About NR in END Block

# NR in END is total line count, not current line
awk '{sum+=$1} END {print sum/NR}' file.txt  # Correct for average

6. String vs Numeric Comparison

# String comparison
awk '$1 == "100"'  # Matches string "100"

# Numeric comparison
awk '$1 == 100'    # Matches number 100

Quick Reference

Basic Syntax

awk 'pattern {action}' file
awk -F':' '{print $1}' file     # Custom field separator
awk -f script.awk file           # Run awk script file

Field Variables

| Variable | Meaning | |----------|---------| | $0 | Entire line | | $1, $2, $3... | First, second, third field | | $NF | Last field | | $(NF-1) | Second-to-last field |

Built-in Variables

| Variable | Meaning | |----------|---------| | NR | Current record (line) number | | NF | Number of fields in current record | | FS | Input field separator (default: whitespace) | | OFS | Output field separator (default: space) | | RS | Record separator (default: newline) | | ORS | Output record separator (default: newline) |

Operators

| Operator | Meaning | |----------|---------| | == | Equal | | != | Not equal | | <, >, <=, >= | Comparison | | ~ | Matches regex | | !~ | Doesn't match regex | | && | Logical AND | | || | Logical OR | | ! | Logical NOT |

Common Patterns

# All lines
awk '{print}' file

# Lines matching pattern
awk '/pattern/ {print}' file

# Lines NOT matching
awk '!/pattern/ {print}' file

# Specific field matches
awk '$1 == "value"' file

# Numeric comparison
awk '$3 > 100' file

# Multiple conditions
awk '$1 == "A" && $2 > 50' file

# Field matches regex
awk '$1 ~ /^A/' file

Common Actions

# Print fields
awk '{print $1, $2}'

# Print with formatting
awk '{printf "%-10s %5d\n", $1, $2}'

# Calculate sum
awk '{sum += $1} END {print sum}'

# Calculate average
awk '{sum += $1} END {print sum/NR}'

# Count matches
awk '/pattern/ {count++} END {print count}'

# Find max
awk 'BEGIN{max=0} $1>max {max=$1} END {print max}'

Key Takeaways

awk is a programming language — Not just a command-line tool
Fields are automatic — awk splits lines into fields for you
Patterns filter lines — Actions run only on matching lines
BEGIN and END are special — Execute before and after main processing
Arrays are powerful — Use associative arrays to aggregate data
Built-in variables — NR, NF, FS, etc. provide essential info
printf for formatting — Better control than print
Combine with pipes — awk works great with other commands
Test incrementally — Build complex awk scripts step by step
Script files for complexity — Use -f for multi-line programs

What's Next?

Congratulations! You've learned awk, one of the most powerful text-processing tools in Linux. You can now:

Extract and manipulate fields from structured data
Perform calculations on columns
Aggregate and summarize data
Create formatted reports
Process system files like /etc/passwd
Analyze logs and command output

In the next post, we'll explore sed (stream editor) for in-place text transformations and substitutions. Combined with grep and awk, sed completes the holy trinity of Linux text processing!

Practice Challenge: Use awk to analyze your system's /etc/passwd file:

awk -F':' 'BEGIN {print "System User Summary"} $3<1000 {sys++} $3>=1000 {usr++} END {print "System users:", sys; print "Regular users:", usr}' /etc/passwd

How many system vs regular users do you have? 📊