Pandas Data Analysis for Absolute Beginners: Reading CSV Files, Data Exploration & Missing Values

Master pandas fundamentals with step-by-step terminal demonstrations. Learn to read CSV files, explore data structure, generate statistics, and handle missing values with detailed command explanations and output analysis.

19 min read

Data analysis is one of the most powerful applications of Python, and pandas is the go-to library for working with structured data. In this comprehensive tutorial, we'll walk through a real terminal session where we explore a CSV dataset, understand its structure, and handle missing values - all explained step by step for absolute beginners.

๐Ÿ’ก

๐ŸŽฏ What You'll Learn: In this hands-on tutorial, you'll discover:

  • How to set up your workspace for pandas data analysis
  • Reading CSV files into pandas DataFrames
  • Exploring data structure with head() and describe() methods
  • Understanding and interpreting data statistics
  • Identifying missing values in your dataset
  • Cleaning data by removing rows with missing values
  • Interpreting terminal outputs and understanding what each result means
  • Best practices for initial data exploration

๐Ÿš€ Setting Up Our Data Analysis Environment

Understanding Our Starting Point

Let's begin by examining our working directory to see what data we have available:

ls

Terminal Output:

data.csv

Command Explanation:

  • ls lists all files and directories in the current location
  • Output Analysis: We can see there's a file called data.csv - this is our dataset that we'll be analyzing
  • File Type: CSV stands for "Comma-Separated Values" - a common format for storing tabular data

Prerequisites

Before we dive in, make sure you have:

  • Python 3.x installed on your system
  • Pandas library installed (pip install pandas)
  • Basic understanding of Python variables and functions
  • Familiarity with terminal/command line operations
  • A CSV file to work with (or use our example data structure)

๐Ÿ“ Creating Our First Pandas Script

Setting Up the Analysis File

nano pandas_lab.py

Command Explanation:

  • nano opens a text editor in the terminal
  • pandas_lab.py is the name of our Python script file
  • The .py extension tells us this is a Python file

Let's create our initial script to read and display the data:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

Breaking Down the Pandas Code

Code LinePurposeExplanation
import pandas as pdImport libraryBrings pandas functionality into our script, 'pd' is a common alias
df = pd.read_csv('data.csv')Read CSV fileCreates a DataFrame (df) from the CSV file
print(df.head())Display dataShows the first 5 rows of the dataset

Viewing Our Script Content

cat pandas_lab.py

Terminal Output:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

Command Explanation:

  • cat displays the contents of a file in the terminal
  • This confirms our script was saved correctly
  • We can see the exact code that will be executed

๐Ÿ” First Data Exploration: Understanding df.head()

Running Our Initial Script

python pandas_lab.py

Terminal Output:

            name   age   salary   department  experience_years  performance_score           city   join_date
0  Alice Johnson  28.0  65000.0  Engineering               3.0                8.5       New York  2021-03-15
1      Bob Smith  34.0  72000.0    Marketing               7.0                7.2  San Francisco  2019-01-20
2    Carol Davis  29.0      NaN  Engineering               4.0                9.1         Boston  2020-07-10
3   David Wilson  45.0  95000.0      Finance              12.0                6.8        Chicago  2015-05-30
4     Emma Brown  31.0  68000.0    Marketing               5.0                NaN        Seattle  2018-11-12

Understanding the Output Structure

ColumnData TypeDescriptionSample Value
nameStringEmployee full nameAlice Johnson
ageFloatEmployee age in years28.0
salaryFloatAnnual salary in dollars65000.0
departmentStringWork departmentEngineering
experience_yearsFloatYears of work experience3.0
performance_scoreFloatEmployee performance rating8.5
cityStringEmployee locationNew York
join_dateStringDate employee joined company2021-03-15
โš ๏ธ

โš ๏ธ Important Observation: Notice the NaN values in row 2 (Carol Davis - salary) and row 4 (Emma Brown - performance_score). NaN stands for "Not a Number" and represents missing data.

๐Ÿ“Š Viewing the Complete Dataset

Modifying Our Script to See All Data

nano pandas_lab.py

Let's add a line to see the complete dataset:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

print(df)

Checking Our Updated Script

cat pandas_lab.py

Terminal Output:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

print(df)

Running the Updated Script

python pandas_lab.py

Terminal Output (Full Dataset):

            name   age   salary   department  experience_years  performance_score           city   join_date
0  Alice Johnson  28.0  65000.0  Engineering               3.0                8.5       New York  2021-03-15
1      Bob Smith  34.0  72000.0    Marketing               7.0                7.2  San Francisco  2019-01-20
2    Carol Davis  29.0      NaN  Engineering               4.0                9.1         Boston  2020-07-10
3   David Wilson  45.0  95000.0      Finance              12.0                6.8        Chicago  2015-05-30
4     Emma Brown  31.0  68000.0    Marketing               5.0                NaN        Seattle  2018-11-12
              name   age   salary   department  experience_years  performance_score           city   join_date
0    Alice Johnson  28.0  65000.0  Engineering               3.0                8.5       New York  2021-03-15
1        Bob Smith  34.0  72000.0    Marketing               7.0                7.2  San Francisco  2019-01-20
2      Carol Davis  29.0      NaN  Engineering               4.0                9.1         Boston  2020-07-10
3     David Wilson  45.0  95000.0      Finance              12.0                6.8        Chicago  2015-05-30
4       Emma Brown  31.0  68000.0    Marketing               5.0                NaN        Seattle  2018-11-12
5     Frank Miller  38.0  82000.0  Engineering               9.0                8.9         Austin  2017-02-28
6        Grace Lee   NaN  75000.0      Finance               6.0                7.5         Denver  2019-08-14
7     Henry Garcia  27.0  58000.0    Marketing               2.0                8.0          Miami  2022-01-05
8         Ivy Chen  33.0  88000.0  Engineering               8.0                9.3       Portland  2016-09-22
9      Jack Taylor  42.0  91000.0      Finance               NaN                7.9      Nashville  2014-12-01
10   Kate Anderson  26.0  55000.0    Marketing               1.0                6.9        Phoenix  2023-03-18
11  Liam Rodriguez  36.0  79000.0  Engineering               7.0                8.7         Dallas  2018-06-25
12      Maya Patel  30.0  71000.0      Finance               5.0                8.2            NaN  2020-02-14
13        Noah Kim  29.0  66000.0    Marketing               3.0                7.8    Los Angeles  2021-04-30
14    Olivia White  35.0  85000.0  Engineering               9.0                9.0      San Diego  2016-10-12
15   Paul Thompson  41.0  89000.0      Finance              11.0                7.4        Houston  2013-08-20
16     Quinn Davis  28.0      NaN    Marketing               2.0                8.1        Atlanta  2022-05-15
17    Rachel Green  32.0  77000.0  Engineering               6.0                8.8   Philadelphia  2018-12-03
18      Sam Wilson  39.0  83000.0      Finance              10.0                7.6        Detroit  2015-11-28
19      Tina Lopez  27.0  62000.0    Marketing               3.0                NaN      Las Vegas  2021-09-10

Understanding the Complete Dataset

Key Observations:

  • Total Records: 20 employees (indexed 0-19)
  • Duplicate Display: The data appears twice because we have both df.head() and print(df) in our script
  • Missing Values: We can spot several NaN values throughout the dataset:
    • Grace Lee (row 6): Missing age
    • Jack Taylor (row 9): Missing experience_years
    • Maya Patel (row 12): Missing city
    • Quinn Davis (row 16): Missing salary
    • Tina Lopez (row 19): Missing performance_score
DepartmentCountSalary Range
Engineering7 employees$65,000 - $88,000
Marketing6 employees$55,000 - $72,000
Finance7 employees$71,000 - $95,000

๐Ÿ“ˆ Statistical Analysis with describe()

Adding Statistical Summary

nano pandas_lab.py

Let's replace the full data display with statistical summary:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

print(df.describe())

Viewing the Updated Script

cat pandas_lab.py

Terminal Output:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

print(df.describe())

Running Statistical Analysis

python pandas_lab.py

Terminal Output:

            name   age   salary   department  experience_years  performance_score           city   join_date
0  Alice Johnson  28.0  65000.0  Engineering               3.0                8.5       New York  2021-03-15
1      Bob Smith  34.0  72000.0    Marketing               7.0                7.2  San Francisco  2019-01-20
2    Carol Davis  29.0      NaN  Engineering               4.0                9.1         Boston  2020-07-10
3   David Wilson  45.0  95000.0      Finance              12.0                6.8        Chicago  2015-05-30
4     Emma Brown  31.0  68000.0    Marketing               5.0                NaN        Seattle  2018-11-12
             age        salary  experience_years  performance_score
count  19.000000     18.000000         19.000000          18.000000
mean   33.157895  75611.111111          5.947368           8.094444
std     5.679037  11783.033110          3.257021           0.770387
min    26.000000  55000.000000          1.000000           6.800000
25%    28.500000  66500.000000          3.000000           7.525000
50%    32.000000  76000.000000          6.000000           8.050000
75%    37.000000  84500.000000          8.500000           8.775000
max    45.000000  95000.000000         12.000000           9.300000

Understanding the Statistical Summary

StatisticAgeSalaryExperience YearsPerformance Score
count19181918
mean33.16 years$75,6115.95 years8.09/10
std5.68 years$11,7833.26 years0.77 points
min26 years$55,0001 year6.8/10
25%28.5 years$66,5003 years7.53/10
50%32 years$76,0006 years8.05/10
75%37 years$84,5008.5 years8.78/10
max45 years$95,00012 years9.3/10

Key Insights from the Statistics:

  1. Count Discrepancies: Notice how salary and performance_score have only 18 valid entries instead of 20, indicating missing values
  2. Age Distribution: Employees range from 26-45 years, with an average of 33 years
  3. Salary Range: $55,000 to $95,000, with median at $76,000
  4. Experience: 1-12 years of experience, median is 6 years
  5. Performance: Scores range from 6.8 to 9.3 out of 10, with most employees performing well (mean: 8.09)
๐Ÿ’ก

โ„น๏ธ Statistical Terms Explained:

  • count: Number of non-missing values
  • mean: Average value
  • std: Standard deviation (measure of spread)
  • min/max: Smallest and largest values
  • 25%, 50%, 75%: Quartiles (percentiles showing data distribution)

๐Ÿ” Investigating Missing Values

Identifying Missing Data

nano pandas_lab.py

Let's modify our script to focus on missing values:

import pandas as pd

df = pd.read_csv('data.csv')

#print(df.head())

#print(df.describe())

print(df.isnull().sum())

Running Missing Values Analysis

python pandas_lab.py

Terminal Output:

name                 0
age                  1
salary               2
department           0
experience_years     1
performance_score    2
city                 1
join_date            0
dtype: int64

Understanding Missing Values Output

ColumnMissing ValuesImpactPercentage Missing
name0No issues - complete data0%
age1Minor - age analysis affected5%
salary2Moderate - compensation analysis affected10%
department0No issues - complete data0%
experience_years1Minor - experience analysis affected5%
performance_score2Moderate - performance analysis affected10%
city1Minor - location analysis affected5%
join_date0No issues - complete data0%

Understanding the Method:

  • df.isnull() creates a DataFrame of True/False values (True where data is missing)
  • .sum() counts the True values in each column
  • Result: Number of missing values per column
โš ๏ธ

โš ๏ธ Data Quality Assessment: Our dataset has missing values in 5 out of 8 columns. This is common in real-world data and requires careful handling to avoid biased analysis.

๐Ÿงน Cleaning Data: Removing Missing Values

Creating a Clean Dataset

nano pandas_lab.py

Let's modify our script to remove rows with missing values:

import pandas as pd

df = pd.read_csv('data.csv')

df_cleaned = df.dropna()
print(df_cleaned.isnull().sum())

#print(df.head())

#print(df.describe())

#print(df.isnull().sum())

Viewing the Final Script

cat pandas_lab.py

Terminal Output:

import pandas as pd

df = pd.read_csv('data.csv')

df_cleaned = df.dropna()
print(df_cleaned.isnull().sum())

#print(df.head())

#print(df.describe())

#print(df.isnull().sum())

Running the Cleaned Data Analysis

python pandas_lab.py

Terminal Output:

name                 0
age                  0
salary               0
department           0
experience_years     0
performance_score    0
city                 0
join_date            0
dtype: int64

Understanding Data Cleaning Results

What Happened:

  • df.dropna() removed any row that contained at least one missing value
  • Result: All columns now show 0 missing values
  • New DataFrame: df_cleaned contains only complete records

Impact Analysis:

MetricOriginal DatasetCleaned DatasetData Loss
Total Rows20 employees~14 employees*~30% data loss
Missing Values7 total missing values0 missing values100% missing values removed
Data QualityIncomplete recordsComplete records onlyHigher quality, smaller sample

*Estimated based on the missing value pattern we observed

Removed Employees:

  • Carol Davis (missing salary)
  • Emma Brown (missing performance_score)
  • Grace Lee (missing age)
  • Jack Taylor (missing experience_years)
  • Maya Patel (missing city)
  • Quinn Davis (missing salary)
  • Tina Lopez (missing performance_score)
๐Ÿ’ก

โ„น๏ธ Alternative Approaches: Instead of dropping rows, we could also:

  • Fill missing values with averages (fillna(df.mean()))
  • Fill with specific values (fillna(0) or fillna('Unknown'))
  • Use forward/backward fill methods
  • Apply domain-specific logic for missing data

๐ŸŽฏ Best Practices for Pandas Data Analysis

Essential Commands Summary

CommandPurposeWhen to Use
pd.read_csv()Load CSV dataFirst step in any analysis
df.head()Preview first 5 rowsInitial data exploration
df.describe()Statistical summaryUnderstanding data distribution
df.isnull().sum()Count missing valuesData quality assessment
df.dropna()Remove missing valuesData cleaning

Data Analysis Workflow

import pandas as pd

# Step 1: Load Data
df = pd.read_csv('your_data.csv')

# Step 2: Initial Exploration
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

# Step 3: Data Types and Info
print("\nDataset Info:")
print(df.info())

# Step 4: Statistical Summary
print("\nStatistical Summary:")
print(df.describe())

# Step 5: Missing Values Analysis
print("\nMissing Values:")
print(df.isnull().sum())

# Step 6: Data Cleaning (if needed)
df_cleaned = df.dropna()  # or other cleaning methods

# Step 7: Verify Cleaned Data
print("\nCleaned Data Info:")
print(df_cleaned.isnull().sum())

๐Ÿšจ Common Mistakes to Avoid

1. Not Checking Data Types

# Don't assume salary is numeric
total_salary = df['salary'].sum()  # Might fail if salary is text
# Check data types first
print(df.dtypes)
# Convert if needed
df['salary'] = pd.to_numeric(df['salary'], errors='coerce')

2. Dropping Too Much Data

# Don't always drop missing values
df_cleaned = df.dropna()  # Might lose too much data

# Consider alternatives
df['age'].fillna(df['age'].mean())  # Fill with average
df['department'].fillna('Unknown')  # Fill with placeholder

3. Not Saving Your Work

# Save cleaned dataset
df_cleaned.to_csv('cleaned_data.csv', index=False)

# Save analysis results
analysis_summary = df.describe()
analysis_summary.to_csv('analysis_summary.csv')

๐ŸŽฏ Key Takeaways

โœ… Remember These Points

  1. Always Start with Exploration: Use head(), info(), and describe() to understand your data
  2. Check for Missing Values: Use isnull().sum() before any analysis
  3. Understand Your Data: Look at data types, ranges, and distributions
  4. Clean Thoughtfully: Don't just drop missing values - consider the impact
  5. Document Your Process: Comment your code and save intermediate results
  6. Verify Results: Always check your cleaned data to ensure it makes sense
  7. Save Your Work: Export cleaned datasets and analysis results
โœ…

๐ŸŽ‰ Congratulations! You've mastered the fundamentals of pandas data analysis. You can now confidently load CSV files, explore data structure, generate statistical summaries, identify missing values, and clean datasets. These skills form the foundation for all data analysis and data science work in Python.


This tutorial demonstrated real terminal commands and pandas operations with detailed explanations of every step and output. Each command was explained to help beginners understand not just what to do, but why and how pandas works for data analysis.

Owais

Written by Owais

I'm an AIOps Engineer with a passion for AI, Operating Systems, Cloud, and Securityโ€”sharing insights that matter in today's tech world.

I completed the UK's Eduqual Level 6 Diploma in AIOps from Al Nafi International College, a globally recognized program that's changing careers worldwide. This diploma is:

  • โœ… Available online in 17+ languages
  • โœ… Includes free student visa guidance for Master's programs in Computer Science fields across the UK, USA, Canada, and more
  • โœ… Comes with job placement support and a 90-day success plan once you land a role
  • โœ… Offers a 1-year internship experience letter while you studyโ€”all with no hidden costs

It's not just a diplomaโ€”it's a career accelerator.

๐Ÿ‘‰ Start your journey today with a 7-day free trial

Related Articles

Continue exploring with these handpicked articles that complement what you just read

More Reading

One more article you might find interesting