Data analysis is one of the most powerful applications of Python, and pandas is the go-to library for working with structured data. In this comprehensive tutorial, we'll walk through a real terminal session where we explore a CSV dataset, understand its structure, and handle missing values - all explained step by step for absolute beginners.
๐ฏ What You'll Learn: In this hands-on tutorial, you'll discover:
- How to set up your workspace for pandas data analysis
- Reading CSV files into pandas DataFrames
- Exploring data structure with head() and describe() methods
- Understanding and interpreting data statistics
- Identifying missing values in your dataset
- Cleaning data by removing rows with missing values
- Interpreting terminal outputs and understanding what each result means
- Best practices for initial data exploration
๐ Setting Up Our Data Analysis Environment
Understanding Our Starting Point
Let's begin by examining our working directory to see what data we have available:
ls
Terminal Output:
data.csv
Command Explanation:
ls
lists all files and directories in the current location- Output Analysis: We can see there's a file called
data.csv
- this is our dataset that we'll be analyzing - File Type: CSV stands for "Comma-Separated Values" - a common format for storing tabular data
Prerequisites
Before we dive in, make sure you have:
- Python 3.x installed on your system
- Pandas library installed (
pip install pandas
) - Basic understanding of Python variables and functions
- Familiarity with terminal/command line operations
- A CSV file to work with (or use our example data structure)
๐ Creating Our First Pandas Script
Setting Up the Analysis File
nano pandas_lab.py
Command Explanation:
nano
opens a text editor in the terminalpandas_lab.py
is the name of our Python script file- The
.py
extension tells us this is a Python file
Let's create our initial script to read and display the data:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
Breaking Down the Pandas Code
Code Line | Purpose | Explanation |
---|---|---|
import pandas as pd | Import library | Brings pandas functionality into our script, 'pd' is a common alias |
df = pd.read_csv('data.csv') | Read CSV file | Creates a DataFrame (df) from the CSV file |
print(df.head()) | Display data | Shows the first 5 rows of the dataset |
Viewing Our Script Content
cat pandas_lab.py
Terminal Output:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
Command Explanation:
cat
displays the contents of a file in the terminal- This confirms our script was saved correctly
- We can see the exact code that will be executed
๐ First Data Exploration: Understanding df.head()
Running Our Initial Script
python pandas_lab.py
Terminal Output:
name age salary department experience_years performance_score city join_date
0 Alice Johnson 28.0 65000.0 Engineering 3.0 8.5 New York 2021-03-15
1 Bob Smith 34.0 72000.0 Marketing 7.0 7.2 San Francisco 2019-01-20
2 Carol Davis 29.0 NaN Engineering 4.0 9.1 Boston 2020-07-10
3 David Wilson 45.0 95000.0 Finance 12.0 6.8 Chicago 2015-05-30
4 Emma Brown 31.0 68000.0 Marketing 5.0 NaN Seattle 2018-11-12
Understanding the Output Structure
Column | Data Type | Description | Sample Value |
---|---|---|---|
name | String | Employee full name | Alice Johnson |
age | Float | Employee age in years | 28.0 |
salary | Float | Annual salary in dollars | 65000.0 |
department | String | Work department | Engineering |
experience_years | Float | Years of work experience | 3.0 |
performance_score | Float | Employee performance rating | 8.5 |
city | String | Employee location | New York |
join_date | String | Date employee joined company | 2021-03-15 |
โ ๏ธ Important Observation: Notice the NaN
values in row 2 (Carol Davis - salary) and row 4 (Emma Brown - performance_score). NaN
stands for "Not a Number" and represents missing data.
๐ Viewing the Complete Dataset
Modifying Our Script to See All Data
nano pandas_lab.py
Let's add a line to see the complete dataset:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
print(df)
Checking Our Updated Script
cat pandas_lab.py
Terminal Output:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
print(df)
Running the Updated Script
python pandas_lab.py
Terminal Output (Full Dataset):
name age salary department experience_years performance_score city join_date
0 Alice Johnson 28.0 65000.0 Engineering 3.0 8.5 New York 2021-03-15
1 Bob Smith 34.0 72000.0 Marketing 7.0 7.2 San Francisco 2019-01-20
2 Carol Davis 29.0 NaN Engineering 4.0 9.1 Boston 2020-07-10
3 David Wilson 45.0 95000.0 Finance 12.0 6.8 Chicago 2015-05-30
4 Emma Brown 31.0 68000.0 Marketing 5.0 NaN Seattle 2018-11-12
name age salary department experience_years performance_score city join_date
0 Alice Johnson 28.0 65000.0 Engineering 3.0 8.5 New York 2021-03-15
1 Bob Smith 34.0 72000.0 Marketing 7.0 7.2 San Francisco 2019-01-20
2 Carol Davis 29.0 NaN Engineering 4.0 9.1 Boston 2020-07-10
3 David Wilson 45.0 95000.0 Finance 12.0 6.8 Chicago 2015-05-30
4 Emma Brown 31.0 68000.0 Marketing 5.0 NaN Seattle 2018-11-12
5 Frank Miller 38.0 82000.0 Engineering 9.0 8.9 Austin 2017-02-28
6 Grace Lee NaN 75000.0 Finance 6.0 7.5 Denver 2019-08-14
7 Henry Garcia 27.0 58000.0 Marketing 2.0 8.0 Miami 2022-01-05
8 Ivy Chen 33.0 88000.0 Engineering 8.0 9.3 Portland 2016-09-22
9 Jack Taylor 42.0 91000.0 Finance NaN 7.9 Nashville 2014-12-01
10 Kate Anderson 26.0 55000.0 Marketing 1.0 6.9 Phoenix 2023-03-18
11 Liam Rodriguez 36.0 79000.0 Engineering 7.0 8.7 Dallas 2018-06-25
12 Maya Patel 30.0 71000.0 Finance 5.0 8.2 NaN 2020-02-14
13 Noah Kim 29.0 66000.0 Marketing 3.0 7.8 Los Angeles 2021-04-30
14 Olivia White 35.0 85000.0 Engineering 9.0 9.0 San Diego 2016-10-12
15 Paul Thompson 41.0 89000.0 Finance 11.0 7.4 Houston 2013-08-20
16 Quinn Davis 28.0 NaN Marketing 2.0 8.1 Atlanta 2022-05-15
17 Rachel Green 32.0 77000.0 Engineering 6.0 8.8 Philadelphia 2018-12-03
18 Sam Wilson 39.0 83000.0 Finance 10.0 7.6 Detroit 2015-11-28
19 Tina Lopez 27.0 62000.0 Marketing 3.0 NaN Las Vegas 2021-09-10
Understanding the Complete Dataset
Key Observations:
- Total Records: 20 employees (indexed 0-19)
- Duplicate Display: The data appears twice because we have both
df.head()
andprint(df)
in our script - Missing Values: We can spot several
NaN
values throughout the dataset:- Grace Lee (row 6): Missing age
- Jack Taylor (row 9): Missing experience_years
- Maya Patel (row 12): Missing city
- Quinn Davis (row 16): Missing salary
- Tina Lopez (row 19): Missing performance_score
Department | Count | Salary Range |
---|---|---|
Engineering | 7 employees | $65,000 - $88,000 |
Marketing | 6 employees | $55,000 - $72,000 |
Finance | 7 employees | $71,000 - $95,000 |
๐ Statistical Analysis with describe()
Adding Statistical Summary
nano pandas_lab.py
Let's replace the full data display with statistical summary:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
print(df.describe())
Viewing the Updated Script
cat pandas_lab.py
Terminal Output:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
print(df.describe())
Running Statistical Analysis
python pandas_lab.py
Terminal Output:
name age salary department experience_years performance_score city join_date
0 Alice Johnson 28.0 65000.0 Engineering 3.0 8.5 New York 2021-03-15
1 Bob Smith 34.0 72000.0 Marketing 7.0 7.2 San Francisco 2019-01-20
2 Carol Davis 29.0 NaN Engineering 4.0 9.1 Boston 2020-07-10
3 David Wilson 45.0 95000.0 Finance 12.0 6.8 Chicago 2015-05-30
4 Emma Brown 31.0 68000.0 Marketing 5.0 NaN Seattle 2018-11-12
age salary experience_years performance_score
count 19.000000 18.000000 19.000000 18.000000
mean 33.157895 75611.111111 5.947368 8.094444
std 5.679037 11783.033110 3.257021 0.770387
min 26.000000 55000.000000 1.000000 6.800000
25% 28.500000 66500.000000 3.000000 7.525000
50% 32.000000 76000.000000 6.000000 8.050000
75% 37.000000 84500.000000 8.500000 8.775000
max 45.000000 95000.000000 12.000000 9.300000
Understanding the Statistical Summary
Statistic | Age | Salary | Experience Years | Performance Score |
---|---|---|---|---|
count | 19 | 18 | 19 | 18 |
mean | 33.16 years | $75,611 | 5.95 years | 8.09/10 |
std | 5.68 years | $11,783 | 3.26 years | 0.77 points |
min | 26 years | $55,000 | 1 year | 6.8/10 |
25% | 28.5 years | $66,500 | 3 years | 7.53/10 |
50% | 32 years | $76,000 | 6 years | 8.05/10 |
75% | 37 years | $84,500 | 8.5 years | 8.78/10 |
max | 45 years | $95,000 | 12 years | 9.3/10 |
Key Insights from the Statistics:
- Count Discrepancies: Notice how salary and performance_score have only 18 valid entries instead of 20, indicating missing values
- Age Distribution: Employees range from 26-45 years, with an average of 33 years
- Salary Range: $55,000 to $95,000, with median at $76,000
- Experience: 1-12 years of experience, median is 6 years
- Performance: Scores range from 6.8 to 9.3 out of 10, with most employees performing well (mean: 8.09)
โน๏ธ Statistical Terms Explained:
- count: Number of non-missing values
- mean: Average value
- std: Standard deviation (measure of spread)
- min/max: Smallest and largest values
- 25%, 50%, 75%: Quartiles (percentiles showing data distribution)
๐ Investigating Missing Values
Identifying Missing Data
nano pandas_lab.py
Let's modify our script to focus on missing values:
import pandas as pd
df = pd.read_csv('data.csv')
#print(df.head())
#print(df.describe())
print(df.isnull().sum())
Running Missing Values Analysis
python pandas_lab.py
Terminal Output:
name 0
age 1
salary 2
department 0
experience_years 1
performance_score 2
city 1
join_date 0
dtype: int64
Understanding Missing Values Output
Column | Missing Values | Impact | Percentage Missing |
---|---|---|---|
name | 0 | No issues - complete data | 0% |
age | 1 | Minor - age analysis affected | 5% |
salary | 2 | Moderate - compensation analysis affected | 10% |
department | 0 | No issues - complete data | 0% |
experience_years | 1 | Minor - experience analysis affected | 5% |
performance_score | 2 | Moderate - performance analysis affected | 10% |
city | 1 | Minor - location analysis affected | 5% |
join_date | 0 | No issues - complete data | 0% |
Understanding the Method:
df.isnull()
creates a DataFrame of True/False values (True where data is missing).sum()
counts the True values in each column- Result: Number of missing values per column
โ ๏ธ Data Quality Assessment: Our dataset has missing values in 5 out of 8 columns. This is common in real-world data and requires careful handling to avoid biased analysis.
๐งน Cleaning Data: Removing Missing Values
Creating a Clean Dataset
nano pandas_lab.py
Let's modify our script to remove rows with missing values:
import pandas as pd
df = pd.read_csv('data.csv')
df_cleaned = df.dropna()
print(df_cleaned.isnull().sum())
#print(df.head())
#print(df.describe())
#print(df.isnull().sum())
Viewing the Final Script
cat pandas_lab.py
Terminal Output:
import pandas as pd
df = pd.read_csv('data.csv')
df_cleaned = df.dropna()
print(df_cleaned.isnull().sum())
#print(df.head())
#print(df.describe())
#print(df.isnull().sum())
Running the Cleaned Data Analysis
python pandas_lab.py
Terminal Output:
name 0
age 0
salary 0
department 0
experience_years 0
performance_score 0
city 0
join_date 0
dtype: int64
Understanding Data Cleaning Results
What Happened:
df.dropna()
removed any row that contained at least one missing value- Result: All columns now show 0 missing values
- New DataFrame:
df_cleaned
contains only complete records
Impact Analysis:
Metric | Original Dataset | Cleaned Dataset | Data Loss |
---|---|---|---|
Total Rows | 20 employees | ~14 employees* | ~30% data loss |
Missing Values | 7 total missing values | 0 missing values | 100% missing values removed |
Data Quality | Incomplete records | Complete records only | Higher quality, smaller sample |
*Estimated based on the missing value pattern we observed
Removed Employees:
- Carol Davis (missing salary)
- Emma Brown (missing performance_score)
- Grace Lee (missing age)
- Jack Taylor (missing experience_years)
- Maya Patel (missing city)
- Quinn Davis (missing salary)
- Tina Lopez (missing performance_score)
โน๏ธ Alternative Approaches: Instead of dropping rows, we could also:
- Fill missing values with averages (
fillna(df.mean())
) - Fill with specific values (
fillna(0)
orfillna('Unknown')
) - Use forward/backward fill methods
- Apply domain-specific logic for missing data
๐ฏ Best Practices for Pandas Data Analysis
Essential Commands Summary
Command | Purpose | When to Use |
---|---|---|
pd.read_csv() | Load CSV data | First step in any analysis |
df.head() | Preview first 5 rows | Initial data exploration |
df.describe() | Statistical summary | Understanding data distribution |
df.isnull().sum() | Count missing values | Data quality assessment |
df.dropna() | Remove missing values | Data cleaning |
Data Analysis Workflow
import pandas as pd
# Step 1: Load Data
df = pd.read_csv('your_data.csv')
# Step 2: Initial Exploration
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
# Step 3: Data Types and Info
print("\nDataset Info:")
print(df.info())
# Step 4: Statistical Summary
print("\nStatistical Summary:")
print(df.describe())
# Step 5: Missing Values Analysis
print("\nMissing Values:")
print(df.isnull().sum())
# Step 6: Data Cleaning (if needed)
df_cleaned = df.dropna() # or other cleaning methods
# Step 7: Verify Cleaned Data
print("\nCleaned Data Info:")
print(df_cleaned.isnull().sum())
๐จ Common Mistakes to Avoid
1. Not Checking Data Types
# Don't assume salary is numeric
total_salary = df['salary'].sum() # Might fail if salary is text
# Check data types first
print(df.dtypes)
# Convert if needed
df['salary'] = pd.to_numeric(df['salary'], errors='coerce')
2. Dropping Too Much Data
# Don't always drop missing values
df_cleaned = df.dropna() # Might lose too much data
# Consider alternatives
df['age'].fillna(df['age'].mean()) # Fill with average
df['department'].fillna('Unknown') # Fill with placeholder
3. Not Saving Your Work
# Save cleaned dataset
df_cleaned.to_csv('cleaned_data.csv', index=False)
# Save analysis results
analysis_summary = df.describe()
analysis_summary.to_csv('analysis_summary.csv')
๐ฏ Key Takeaways
โ Remember These Points
- Always Start with Exploration: Use
head()
,info()
, anddescribe()
to understand your data - Check for Missing Values: Use
isnull().sum()
before any analysis - Understand Your Data: Look at data types, ranges, and distributions
- Clean Thoughtfully: Don't just drop missing values - consider the impact
- Document Your Process: Comment your code and save intermediate results
- Verify Results: Always check your cleaned data to ensure it makes sense
- Save Your Work: Export cleaned datasets and analysis results
๐ Congratulations! You've mastered the fundamentals of pandas data analysis. You can now confidently load CSV files, explore data structure, generate statistical summaries, identify missing values, and clean datasets. These skills form the foundation for all data analysis and data science work in Python.
This tutorial demonstrated real terminal commands and pandas operations with detailed explanations of every step and output. Each command was explained to help beginners understand not just what to do, but why and how pandas works for data analysis.