Pandas Data Analysis for Absolute Beginners: Reading CSV Files, Data Exploration & Missing Values

Data analysis is one of the most powerful applications of Python, and pandas is the go-to library for working with structured data. In this comprehensive tutorial, we'll walk through a real terminal session where we explore a CSV dataset, understand its structure, and handle missing values - all explained step by step for absolute beginners.

💡

🎯 What You'll Learn: In this hands-on tutorial, you'll discover:

How to set up your workspace for pandas data analysis
Reading CSV files into pandas DataFrames
Exploring data structure with head() and describe() methods
Understanding and interpreting data statistics
Identifying missing values in your dataset
Cleaning data by removing rows with missing values
Interpreting terminal outputs and understanding what each result means
Best practices for initial data exploration

🚀 Setting Up Our Data Analysis Environment

Understanding Our Starting Point

Let's begin by examining our working directory to see what data we have available:

ls

Terminal Output:

data.csv

Command Explanation:

ls lists all files and directories in the current location
Output Analysis: We can see there's a file called data.csv - this is our dataset that we'll be analyzing
File Type: CSV stands for "Comma-Separated Values" - a common format for storing tabular data

Prerequisites

Before we dive in, make sure you have:

Python 3.x installed on your system
Pandas library installed (pip install pandas)
Basic understanding of Python variables and functions
Familiarity with terminal/command line operations
A CSV file to work with (or use our example data structure)

📝 Creating Our First Pandas Script

Setting Up the Analysis File

nano pandas_lab.py

Command Explanation:

nano opens a text editor in the terminal
pandas_lab.py is the name of our Python script file
The .py extension tells us this is a Python file

Let's create our initial script to read and display the data:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

Breaking Down the Pandas Code

Code Line	Purpose	Explanation
`import pandas as pd`	Import library	Brings pandas functionality into our script, 'pd' is a common alias
`df = pd.read_csv('data.csv')`	Read CSV file	Creates a DataFrame (df) from the CSV file
`print(df.head())`	Display data	Shows the first 5 rows of the dataset

Viewing Our Script Content

cat pandas_lab.py

Terminal Output:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

Command Explanation:

cat displays the contents of a file in the terminal
This confirms our script was saved correctly
We can see the exact code that will be executed

🔍 First Data Exploration: Understanding df.head()

Running Our Initial Script

python pandas_lab.py

Terminal Output:

            name   age   salary   department  experience_years  performance_score           city   join_date
0  Alice Johnson  28.0  65000.0  Engineering               3.0                8.5       New York  2021-03-15
1      Bob Smith  34.0  72000.0    Marketing               7.0                7.2  San Francisco  2019-01-20
2    Carol Davis  29.0      NaN  Engineering               4.0                9.1         Boston  2020-07-10
3   David Wilson  45.0  95000.0      Finance              12.0                6.8        Chicago  2015-05-30
4     Emma Brown  31.0  68000.0    Marketing               5.0                NaN        Seattle  2018-11-12

Understanding the Output Structure

Column	Data Type	Description	Sample Value
name	String	Employee full name	Alice Johnson
age	Float	Employee age in years	28.0
salary	Float	Annual salary in dollars	65000.0
department	String	Work department	Engineering
experience_years	Float	Years of work experience	3.0
performance_score	Float	Employee performance rating	8.5
city	String	Employee location	New York
join_date	String	Date employee joined company	2021-03-15

⚠️

⚠️ Important Observation: Notice the NaN values in row 2 (Carol Davis - salary) and row 4 (Emma Brown - performance_score). NaN stands for "Not a Number" and represents missing data.

📊 Viewing the Complete Dataset

Modifying Our Script to See All Data

nano pandas_lab.py

Let's add a line to see the complete dataset:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

print(df)

Checking Our Updated Script

cat pandas_lab.py

Terminal Output:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

print(df)

Running the Updated Script

python pandas_lab.py

Terminal Output (Full Dataset):

            name   age   salary   department  experience_years  performance_score           city   join_date
0  Alice Johnson  28.0  65000.0  Engineering               3.0                8.5       New York  2021-03-15
1      Bob Smith  34.0  72000.0    Marketing               7.0                7.2  San Francisco  2019-01-20
2    Carol Davis  29.0      NaN  Engineering               4.0                9.1         Boston  2020-07-10
3   David Wilson  45.0  95000.0      Finance              12.0                6.8        Chicago  2015-05-30
4     Emma Brown  31.0  68000.0    Marketing               5.0                NaN        Seattle  2018-11-12
              name   age   salary   department  experience_years  performance_score           city   join_date
0    Alice Johnson  28.0  65000.0  Engineering               3.0                8.5       New York  2021-03-15
1        Bob Smith  34.0  72000.0    Marketing               7.0                7.2  San Francisco  2019-01-20
2      Carol Davis  29.0      NaN  Engineering               4.0                9.1         Boston  2020-07-10
3     David Wilson  45.0  95000.0      Finance              12.0                6.8        Chicago  2015-05-30
4       Emma Brown  31.0  68000.0    Marketing               5.0                NaN        Seattle  2018-11-12
5     Frank Miller  38.0  82000.0  Engineering               9.0                8.9         Austin  2017-02-28
6        Grace Lee   NaN  75000.0      Finance               6.0                7.5         Denver  2019-08-14
7     Henry Garcia  27.0  58000.0    Marketing               2.0                8.0          Miami  2022-01-05
8         Ivy Chen  33.0  88000.0  Engineering               8.0                9.3       Portland  2016-09-22
9      Jack Taylor  42.0  91000.0      Finance               NaN                7.9      Nashville  2014-12-01
10   Kate Anderson  26.0  55000.0    Marketing               1.0                6.9        Phoenix  2023-03-18
11  Liam Rodriguez  36.0  79000.0  Engineering               7.0                8.7         Dallas  2018-06-25
12      Maya Patel  30.0  71000.0      Finance               5.0                8.2            NaN  2020-02-14
13        Noah Kim  29.0  66000.0    Marketing               3.0                7.8    Los Angeles  2021-04-30
14    Olivia White  35.0  85000.0  Engineering               9.0                9.0      San Diego  2016-10-12
15   Paul Thompson  41.0  89000.0      Finance              11.0                7.4        Houston  2013-08-20
16     Quinn Davis  28.0      NaN    Marketing               2.0                8.1        Atlanta  2022-05-15
17    Rachel Green  32.0  77000.0  Engineering               6.0                8.8   Philadelphia  2018-12-03
18      Sam Wilson  39.0  83000.0      Finance              10.0                7.6        Detroit  2015-11-28
19      Tina Lopez  27.0  62000.0    Marketing               3.0                NaN      Las Vegas  2021-09-10

Understanding the Complete Dataset

Key Observations:

Total Records: 20 employees (indexed 0-19)
Duplicate Display: The data appears twice because we have both df.head() and print(df) in our script
Missing Values: We can spot several NaN values throughout the dataset:
- Grace Lee (row 6): Missing age
- Jack Taylor (row 9): Missing experience_years
- Maya Patel (row 12): Missing city
- Quinn Davis (row 16): Missing salary
- Tina Lopez (row 19): Missing performance_score

Department	Count	Salary Range
Engineering	7 employees	$65,000 - $88,000
Marketing	6 employees	$55,000 - $72,000
Finance	7 employees	$71,000 - $95,000

📈 Statistical Analysis with describe()

Adding Statistical Summary

nano pandas_lab.py

Let's replace the full data display with statistical summary:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

print(df.describe())

Viewing the Updated Script

cat pandas_lab.py

Terminal Output:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

print(df.describe())

Running Statistical Analysis

python pandas_lab.py

Terminal Output:

            name   age   salary   department  experience_years  performance_score           city   join_date
0  Alice Johnson  28.0  65000.0  Engineering               3.0                8.5       New York  2021-03-15
1      Bob Smith  34.0  72000.0    Marketing               7.0                7.2  San Francisco  2019-01-20
2    Carol Davis  29.0      NaN  Engineering               4.0                9.1         Boston  2020-07-10
3   David Wilson  45.0  95000.0      Finance              12.0                6.8        Chicago  2015-05-30
4     Emma Brown  31.0  68000.0    Marketing               5.0                NaN        Seattle  2018-11-12
             age        salary  experience_years  performance_score
count  19.000000     18.000000         19.000000          18.000000
mean   33.157895  75611.111111          5.947368           8.094444
std     5.679037  11783.033110          3.257021           0.770387
min    26.000000  55000.000000          1.000000           6.800000
25%    28.500000  66500.000000          3.000000           7.525000
50%    32.000000  76000.000000          6.000000           8.050000
75%    37.000000  84500.000000          8.500000           8.775000
max    45.000000  95000.000000         12.000000           9.300000

Understanding the Statistical Summary

Statistic	Age	Salary	Experience Years	Performance Score
count	19	18	19	18
mean	33.16 years	$75,611	5.95 years	8.09/10
std	5.68 years	$11,783	3.26 years	0.77 points
min	26 years	$55,000	1 year	6.8/10
25%	28.5 years	$66,500	3 years	7.53/10
50%	32 years	$76,000	6 years	8.05/10
75%	37 years	$84,500	8.5 years	8.78/10
max	45 years	$95,000	12 years	9.3/10

Key Insights from the Statistics:

Count Discrepancies: Notice how salary and performance_score have only 18 valid entries instead of 20, indicating missing values
Age Distribution: Employees range from 26-45 years, with an average of 33 years
Salary Range: $55,000 to $95,000, with median at $76,000
Experience: 1-12 years of experience, median is 6 years
Performance: Scores range from 6.8 to 9.3 out of 10, with most employees performing well (mean: 8.09)

💡

ℹ️ Statistical Terms Explained:

count: Number of non-missing values
mean: Average value
std: Standard deviation (measure of spread)
min/max: Smallest and largest values
25%, 50%, 75%: Quartiles (percentiles showing data distribution)

🔍 Investigating Missing Values

Identifying Missing Data

nano pandas_lab.py

Let's modify our script to focus on missing values:

import pandas as pd

df = pd.read_csv('data.csv')

#print(df.head())

#print(df.describe())

print(df.isnull().sum())

Running Missing Values Analysis

python pandas_lab.py

Terminal Output:

name                 0
age                  1
salary               2
department           0
experience_years     1
performance_score    2
city                 1
join_date            0
dtype: int64

Understanding Missing Values Output

Column	Missing Values	Impact	Percentage Missing
name	0	No issues - complete data	0%
age	1	Minor - age analysis affected	5%
salary	2	Moderate - compensation analysis affected	10%
department	0	No issues - complete data	0%
experience_years	1	Minor - experience analysis affected	5%
performance_score	2	Moderate - performance analysis affected	10%
city	1	Minor - location analysis affected	5%
join_date	0	No issues - complete data	0%

Understanding the Method:

df.isnull() creates a DataFrame of True/False values (True where data is missing)
.sum() counts the True values in each column
Result: Number of missing values per column

⚠️

⚠️ Data Quality Assessment: Our dataset has missing values in 5 out of 8 columns. This is common in real-world data and requires careful handling to avoid biased analysis.

🧹 Cleaning Data: Removing Missing Values

Creating a Clean Dataset

nano pandas_lab.py

Let's modify our script to remove rows with missing values:

import pandas as pd

df = pd.read_csv('data.csv')

df_cleaned = df.dropna()
print(df_cleaned.isnull().sum())

#print(df.head())

#print(df.describe())

#print(df.isnull().sum())

Viewing the Final Script

cat pandas_lab.py

Terminal Output:

import pandas as pd

df = pd.read_csv('data.csv')

df_cleaned = df.dropna()
print(df_cleaned.isnull().sum())

#print(df.head())

#print(df.describe())

#print(df.isnull().sum())

Running the Cleaned Data Analysis

python pandas_lab.py

Terminal Output:

name                 0
age                  0
salary               0
department           0
experience_years     0
performance_score    0
city                 0
join_date            0
dtype: int64

Understanding Data Cleaning Results

What Happened:

df.dropna() removed any row that contained at least one missing value
Result: All columns now show 0 missing values
New DataFrame: df_cleaned contains only complete records

Impact Analysis:

Metric	Original Dataset	Cleaned Dataset	Data Loss
Total Rows	20 employees	~14 employees*	~30% data loss
Missing Values	7 total missing values	0 missing values	100% missing values removed
Data Quality	Incomplete records	Complete records only	Higher quality, smaller sample

*Estimated based on the missing value pattern we observed

Removed Employees:

Carol Davis (missing salary)
Emma Brown (missing performance_score)
Grace Lee (missing age)
Jack Taylor (missing experience_years)
Maya Patel (missing city)
Quinn Davis (missing salary)
Tina Lopez (missing performance_score)

💡

ℹ️ Alternative Approaches: Instead of dropping rows, we could also:

Fill missing values with averages (fillna(df.mean()))
Fill with specific values (fillna(0) or fillna('Unknown'))
Use forward/backward fill methods
Apply domain-specific logic for missing data

🎯 Best Practices for Pandas Data Analysis

Essential Commands Summary

Command	Purpose	When to Use
`pd.read_csv()`	Load CSV data	First step in any analysis
`df.head()`	Preview first 5 rows	Initial data exploration
`df.describe()`	Statistical summary	Understanding data distribution
`df.isnull().sum()`	Count missing values	Data quality assessment
`df.dropna()`	Remove missing values	Data cleaning

Data Analysis Workflow

import pandas as pd

# Step 1: Load Data
df = pd.read_csv('your_data.csv')

# Step 2: Initial Exploration
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

# Step 3: Data Types and Info
print("\nDataset Info:")
print(df.info())

# Step 4: Statistical Summary
print("\nStatistical Summary:")
print(df.describe())

# Step 5: Missing Values Analysis
print("\nMissing Values:")
print(df.isnull().sum())

# Step 6: Data Cleaning (if needed)
df_cleaned = df.dropna()  # or other cleaning methods

# Step 7: Verify Cleaned Data
print("\nCleaned Data Info:")
print(df_cleaned.isnull().sum())

🚨 Common Mistakes to Avoid

1. Not Checking Data Types

# Don't assume salary is numeric
total_salary = df['salary'].sum()  # Might fail if salary is text

# Check data types first
print(df.dtypes)
# Convert if needed
df['salary'] = pd.to_numeric(df['salary'], errors='coerce')

2. Dropping Too Much Data

# Don't always drop missing values
df_cleaned = df.dropna()  # Might lose too much data

# Consider alternatives
df['age'].fillna(df['age'].mean())  # Fill with average
df['department'].fillna('Unknown')  # Fill with placeholder

3. Not Saving Your Work

# Save cleaned dataset
df_cleaned.to_csv('cleaned_data.csv', index=False)

# Save analysis results
analysis_summary = df.describe()
analysis_summary.to_csv('analysis_summary.csv')

🎯 Key Takeaways

✅ Remember These Points

Always Start with Exploration: Use head(), info(), and describe() to understand your data
Check for Missing Values: Use isnull().sum() before any analysis
Understand Your Data: Look at data types, ranges, and distributions
Clean Thoughtfully: Don't just drop missing values - consider the impact
Document Your Process: Comment your code and save intermediate results
Verify Results: Always check your cleaned data to ensure it makes sense
Save Your Work: Export cleaned datasets and analysis results

✅

🎉 Congratulations! You've mastered the fundamentals of pandas data analysis. You can now confidently load CSV files, explore data structure, generate statistical summaries, identify missing values, and clean datasets. These skills form the foundation for all data analysis and data science work in Python.

This tutorial demonstrated real terminal commands and pandas operations with detailed explanations of every step and output. Each command was explained to help beginners understand not just what to do, but why and how pandas works for data analysis.

🚀 Setting Up Our Data Analysis Environment

Understanding Our Starting Point

Prerequisites

📝 Creating Our First Pandas Script

Setting Up the Analysis File

Breaking Down the Pandas Code

Viewing Our Script Content

🔍 First Data Exploration: Understanding df.head()

Running Our Initial Script

Understanding the Output Structure

📊 Viewing the Complete Dataset

Modifying Our Script to See All Data

Checking Our Updated Script

Running the Updated Script

Understanding the Complete Dataset

📈 Statistical Analysis with describe()

Adding Statistical Summary

Viewing the Updated Script

Running Statistical Analysis

Understanding the Statistical Summary

🔍 Investigating Missing Values

Identifying Missing Data

Running Missing Values Analysis

Understanding Missing Values Output

🧹 Cleaning Data: Removing Missing Values

Creating a Clean Dataset

Viewing the Final Script

Running the Cleaned Data Analysis

Understanding Data Cleaning Results

🎯 Best Practices for Pandas Data Analysis

Essential Commands Summary

Data Analysis Workflow

🚨 Common Mistakes to Avoid

1. Not Checking Data Types

2. Dropping Too Much Data

3. Not Saving Your Work

🎯 Key Takeaways

✅ Remember These Points

Written by Owais

Related Articles

Testing Python Data Science Packages: Practical Examples with NumPy, Pandas, and Scikit-learn

Installing Python Data Science Packages on CentOS: Complete Step-by-Step Guide

Matplotlib Data Visualization for Absolute Beginners: Installation, Line Plots, Bar Charts & Histograms

More Reading

NumPy Fundamentals for Absolute Beginners: Arrays, Slicing, and Mathematical Operations with Terminal Examples