03 - Essential Data Libraries: Numpy and Pandas Basics
Complexity: Moderate (M)
3.0 Introduction: Why Data Libraries Matter for Data Engineering
In the previous chapters, we built a solid foundation with Python fundamentals and data handling techniques. Now, we’re ready to take a significant leap forward with specialized libraries that will dramatically increase your data processing capabilities.
Think of NumPy and Pandas as power tools for data work. While base Python is like using hand tools (effective but labor-intensive), these libraries give you industrial-grade machinery for handling large datasets efficiently. Here’s how they fit into the data engineering landscape:
flowchart TD A[Raw Data Sources] --> B[Data Extraction] B --> C[Data Cleaning & Transformation] C --> D[Data Analysis] D --> E[Data Loading/Storage] subgraph "Base Python" B1[File Handling] C1[List/Dict Operations] end subgraph "NumPy" C2[Array Operations] C3[Mathematical Functions] D1[Numerical Analysis] end subgraph "Pandas" C4[DataFrame Manipulation] C5[Data Cleaning] D2[Aggregation & Grouping] D3[Time Series Analysis] end B ----> B1 C ----> C1 C ----> C2 C ----> C3 C ----> C4 C ----> C5 D ----> D1 D ----> D2 D ----> D3 style A fill:#f9f,stroke:#333,stroke-width:2px style E fill:#f9f,stroke:#333,stroke-width:2px style B,C,D fill:#bbf,stroke:#336,stroke-width:1px style B1,C1 fill:#ffd,stroke:#333,stroke-width:1px style C2,C3,D1 fill:#dfd,stroke:#333,stroke-width:1px style C4,C5,D2,D3 fill:#ddf,stroke:#333,stroke-width:1px
By the end of this chapter, you’ll understand:
- How NumPy provides efficient numerical operations
- How Pandas simplifies working with tabular data
- When to use each library for different data engineering tasks
- How to combine these libraries with your existing Python skills
Let’s get started by exploring these essential data libraries!
3.1 NumPy Basics
NumPy (Numerical Python) is a fundamental library for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently.
3.1.1 Why NumPy?
Before we dive into NumPy, let’s understand why it’s so important for data engineering:
- Performance: NumPy operations are implemented in C, making them much faster than equivalent Python code
- Memory Efficiency: NumPy arrays use less memory than Python lists
- Broadcasting: Powerful rules for applying operations across arrays of different shapes
- Vectorization: Operate on entire arrays at once instead of using explicit loops
Let’s see a simple performance comparison:
import numpy as np
import time
# Create a large list and equivalent NumPy array
size = 1000000
python_list = list(range(size))
numpy_array = np.array(range(size))
# Multiply each element by 2
# Python list approach
start_time = time.time()
python_result = [x * 2 for x in python_list]
python_time = time.time() - start_time
print(f"Python list time: {python_time:.6f} seconds")
# NumPy array approach
start_time = time.time()
numpy_result = numpy_array * 2
numpy_time = time.time() - start_time
print(f"NumPy array time: {numpy_time:.6f} seconds")
# Calculate speedup
speedup = python_time / numpy_time
print(f"NumPy is {speedup:.1f}x faster!")
# Verify results are the same
print(f"First 5 elements - Python: {python_result[:5]}")
print(f"First 5 elements - NumPy: {numpy_result[:5]}")
# Python list time: 0.073645 seconds
# NumPy array time: 0.002137 seconds
# NumPy is 34.5x faster!
# First 5 elements - Python: [0, 2, 4, 6, 8]
# First 5 elements - NumPy: [0 2 4 6 8]
3.1.2 Creating NumPy Arrays
Let’s explore different ways to create NumPy arrays:
import numpy as np
# From a Python list
list1 = [1, 2, 3, 4, 5]
array1 = np.array(list1)
print("Array from list:", array1)
# Array from list: [1 2 3 4 5]
# Multi-dimensional array from nested lists
list2 = [[1, 2, 3], [4, 5, 6]]
array2 = np.array(list2)
print("2D array from nested lists:")
print(array2)
# 2D array from nested lists:
# [[1 2 3]
# [4 5 6]]
# Array with a specific data type
array_float = np.array([1, 2, 3, 4, 5], dtype=np.float64)
print("Float array:", array_float)
# Float array: [1. 2. 3. 4. 5.]
# Creating arrays with specific values
zeros = np.zeros(5) # Array of 5 zeros
ones = np.ones((2, 3)) # 2x3 array of ones
empty = np.empty(3) # Uninitialized array (values depend on memory state)
print("Zeros array:", zeros)
print("Ones array:")
print(ones)
print("Empty array:", empty) # Values will be arbitrary
# Zeros array: [0. 0. 0. 0. 0.]
# Ones array:
# [[1. 1. 1.]
# [1. 1. 1.]]
# Empty array: [0. 0. 0.] # Values may differ on your system
# Creating sequences
range_array = np.arange(0, 10, 2) # Start, stop, step
linspace = np.linspace(0, 1, 5) # Start, stop, num of points
print("Range array:", range_array)
print("Linspace array:", linspace)
# Range array: [0 2 4 6 8]
# Linspace array: [0. 0.25 0.5 0.75 1. ]
# Random arrays
random_ints = np.random.randint(0, 10, 5) # Random integers between 0 and 9
random_floats = np.random.random(3) # Random floats between 0 and 1
print("Random integers:", random_ints) # Values will vary each run
print("Random floats:", random_floats) # Values will vary each run
# Random integers: [3 4 0 7 5] # Your numbers will differ
# Random floats: [0.37454012 0.95071431 0.73199394] # Your numbers will differ
3.1.3 Array Attributes and Operations
NumPy arrays have several useful attributes that provide information about their structure:
import numpy as np
# Create a 2D array
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
# Basic attributes
print("Array:")
print(arr)
print("Shape:", arr.shape) # Dimensions (rows, columns)
print("Size:", arr.size) # Total number of elements
print("Dimensions:", arr.ndim) # Number of dimensions
print("Data type:", arr.dtype) # Data type of elements
# Array:
# [[ 1 2 3 4]
# [ 5 6 7 8]
# [ 9 10 11 12]]
# Shape: (3, 4)
# Size: 12
# Dimensions: 2
# Data type: int64
# Accessing elements (similar to lists, but with some differences)
print("\nAccessing elements:")
print("First row:", arr[0])
print("Element at (1,2):", arr[1, 2]) # 2nd row, 3rd column
print("Last row:", arr[-1])
# Accessing elements:
# First row: [1 2 3 4]
# Element at (1,2): 7
# Last row: [ 9 10 11 12]
# Slicing arrays
print("\nArray slicing:")
print("First two rows:")
print(arr[:2])
print("First and third columns:")
print(arr[:, [0, 2]]) # All rows, columns 0 and 2
# Array slicing:
# First two rows:
# [[1 2 3 4]
# [5 6 7 8]]
# First and third columns:
# [[ 1 3]
# [ 5 7]
# [ 9 11]]
# Array operations
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print("\nBasic arithmetic:")
print("a + b =", a + b) # Element-wise addition
print("a * b =", a * b) # Element-wise multiplication
print("a * 2 =", a * 2) # Scalar multiplication
print("a > 1 =", a > 1) # Element-wise comparison
# Basic arithmetic:
# a + b = [5 7 9]
# a * b = [ 4 10 18]
# a * 2 = [2 4 6]
# a > 1 = [False True True]
# Aggregation functions
data = np.array([1, 3, 5, 7, 9])
print("\nAggregation functions:")
print("Sum:", data.sum())
print("Mean:", data.mean())
print("Standard deviation:", data.std())
print("Min:", data.min())
print("Max:", data.max())
print("Cumulative sum:", data.cumsum())
# Aggregation functions:
# Sum: 25
# Mean: 5.0
# Standard deviation: 2.8284271247461903
# Min: 1
# Max: 9
# Cumulative sum: [ 1 4 9 16 25]
3.1.4 Broadcasting and Vectorization
Broadcasting allows NumPy to work with arrays of different shapes when performing operations:
import numpy as np
# Broadcasting example
# Broadcasting allows operations between arrays with different shapes
a = np.array([1, 2, 3]) # Shape: (3,)
b = np.array([[10], [20], [30]]) # Shape: (3, 1)
# Broadcasting matches dimensions to allow this operation
result = a + b # Shape will be (3, 3)
print("Array a (shape {}):".format(a.shape))
print(a)
print("\nArray b (shape {}):".format(b.shape))
print(b)
print("\nResult (shape {}):".format(result.shape))
print(result)
# Array a (shape (3,)):
# [1 2 3]
#
# Array b (shape (3, 1)):
# [[10]
# [20]
# [30]]
#
# Result (shape (3, 3)):
# [[11 12 13]
# [21 22 23]
# [31 32 33]]
# Vectorization example
# Calculate Euclidean distance between two points
def euclidean_distance(p1, p2):
return np.sqrt(np.sum((p1 - p2) ** 2))
point1 = np.array([1, 2, 3])
point2 = np.array([4, 5, 6])
distance = euclidean_distance(point1, point2)
print("\nEuclidean distance:", distance)
# Euclidean distance: 5.196152422706632
# Batch processing - calculate distances between multiple points
points1 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
points2 = np.array([[4, 5, 6], [7, 8, 9], [10, 11, 12]])
# This will compute the distances for all rows at once
distances = np.sqrt(np.sum((points1 - points2) ** 2, axis=1))
print("\nMultiple distances:", distances)
# Multiple distances: [5.19615242 5.19615242 5.19615242]
3.1.5 Reshaping and Transposing
NumPy provides powerful functions for reorganizing array data:
import numpy as np
# Create a 1D array
arr = np.arange(12)
print("Original array:", arr)
# Original array: [ 0 1 2 3 4 5 6 7 8 9 10 11]
# Reshape to 2D (3 rows, 4 columns)
arr_2d = arr.reshape(3, 4)
print("\nReshaped to 3x4:")
print(arr_2d)
# Reshaped to 3x4:
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
# Reshape with automatic dimension calculation (-1 means "figure it out")
arr_2d_auto = arr.reshape(6, -1) # 6 rows, columns calculated automatically
print("\nReshaped to 6x2 (automatic):")
print(arr_2d_auto)
# Reshaped to 6x2 (automatic):
# [[ 0 1]
# [ 2 3]
# [ 4 5]
# [ 6 7]
# [ 8 9]
# [10 11]]
# Transpose (swap rows and columns)
arr_transposed = arr_2d.T
print("\nTransposed array:")
print(arr_transposed)
# Transposed array:
# [[ 0 4 8]
# [ 1 5 9]
# [ 2 6 10]
# [ 3 7 11]]
# Flattening arrays (convert back to 1D)
flattened = arr_2d.flatten() # Creates a copy
print("\nFlattened array:", flattened)
# Flattened array: [ 0 1 2 3 4 5 6 7 8 9 10 11]
# Ravel (returns a view when possible)
raveled = arr_2d.ravel() # May return a view, more efficient
print("Raveled array:", raveled)
# Raveled array: [ 0 1 2 3 4 5 6 7 8 9 10 11]
# Stacking arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Vertical stack
v_stack = np.vstack((a, b))
print("\nVertical stack:")
print(v_stack)
# Vertical stack:
# [[1 2 3]
# [4 5 6]]
# Horizontal stack
h_stack = np.hstack((a, b))
print("Horizontal stack:", h_stack)
# Horizontal stack: [1 2 3 4 5 6]
# Column stack (useful for multiple 1D arrays)
c_stack = np.column_stack((a, b))
print("Column stack:")
print(c_stack)
# Column stack:
# [[1 4]
# [2 5]
# [3 6]]
3.1.6 NumPy for Data Engineering
Let’s look at some common data engineering tasks with NumPy:
import numpy as np
# Generate sample data - daily temperatures for a month (in Celsius)
np.random.seed(42) # For reproducible results
temperatures = np.random.normal(25, 5, 30).round(1) # Mean 25°C, std dev 5°C
print("Temperature dataset (°C):", temperatures)
# Temperature dataset (°C): [26.1 22.7 20.4 29.7 20.4 18.3 24.7 22.9 21.9 26.6 31.4 25.4 25.5 24.7
# 17.3 31.8 28.1 21.9 16.7 23.4 18.2 28. 24.4 28.8 23.7 31.1 25.1 27.9
# 24.9 21.9]
# Data cleaning - identify and handle outliers
# Consider temperatures > 3 standard deviations from mean as outliers
mean_temp = np.mean(temperatures)
std_temp = np.std(temperatures)
outlier_threshold = 3 * std_temp
outliers = np.abs(temperatures - mean_temp) > outlier_threshold
print(f"\nMean: {mean_temp:.1f}°C, Std Dev: {std_temp:.1f}°C")
print(f"Outlier threshold: ±{outlier_threshold:.1f}°C from mean")
print(f"Temperatures identified as outliers: {temperatures[outliers]}")
# Replace outliers with the mean
clean_temps = temperatures.copy()
clean_temps[outliers] = mean_temp
print("Cleaned temperatures:", clean_temps)
# Mean: 24.4°C, Std Dev: 4.1°C
# Outlier threshold: ±12.3°C from mean
# Temperatures identified as outliers: []
# Cleaned temperatures: [26.1 22.7 20.4 29.7 20.4 18.3 24.7 22.9 21.9 26.6 31.4 25.4 25.5 24.7
# 17.3 31.8 28.1 21.9 16.7 23.4 18.2 28. 24.4 28.8 23.7 31.1 25.1 27.9
# 24.9 21.9]
# Data binning - categorize temperatures into bins
bins = np.array([0, 20, 25, 30, 40])
bin_labels = ['Cold', 'Cool', 'Warm', 'Hot']
indices = np.digitize(temperatures, bins) - 1 # -1 because digitize returns bin indices starting at 1
# Get category for each temperature
categories = np.array(bin_labels)[indices]
print("\nTemperature categories:")
for temp, category in zip(temperatures, categories):
print(f"{temp}°C is {category}")
# Count temperatures in each category
unique_categories, counts = np.unique(categories, return_counts=True)
print("\nCategory counts:")
for category, count in zip(unique_categories, counts):
print(f"{category}: {count}")
# Temperature categories:
# 26.1°C is Warm
# 22.7°C is Cool
# 20.4°C is Cool
# 29.7°C is Warm
# ...
#
# Category counts:
# Cold: 3
# Cool: 10
# Hot: 5
# Warm: 12
# Data aggregation by time period - weekly averages
# Reshape to 4 weeks of 7 days (we'll ignore the extra 2 days for simplicity)
weekly_temps = temperatures[:28].reshape(4, 7)
weekly_avgs = np.mean(weekly_temps, axis=1)
weekly_mins = np.min(weekly_temps, axis=1)
weekly_maxs = np.max(weekly_temps, axis=1)
print("\nWeekly statistics:")
for week, (avg, min_temp, max_temp) in enumerate(zip(weekly_avgs, weekly_mins, weekly_maxs), 1):
print(f"Week {week}: Avg = {avg:.1f}°C, Min = {min_temp:.1f}°C, Max = {max_temp:.1f}°C")
# Weekly statistics:
# Week 1: Avg = 23.3°C, Min = 18.3°C, Max = 29.7°C
# Week 2: Avg = 25.6°C, Min = 21.9°C, Max = 31.4°C
# Week 3: Avg = 23.7°C, Min = 16.7°C, Max = 31.8°C
# Week 4: Avg = 25.7°C, Min = 18.2°C, Max = 31.1°C
# Simple anomaly detection - Z-scores
z_scores = (temperatures - mean_temp) / std_temp
anomalies = np.abs(z_scores) > 2 # Flag anything > 2 standard deviations
print("\nAnomaly detection (Z-score > 2):")
anomaly_temps = temperatures[anomalies]
anomaly_scores = z_scores[anomalies]
if len(anomaly_temps) > 0:
for temp, score in zip(anomaly_temps, anomaly_scores):
print(f"Temperature {temp}°C is unusual (Z-score: {score:.2f})")
else:
print("No significant anomalies detected.")
# Anomaly detection (Z-score > 2):
# Temperature 31.4°C is unusual (Z-score: 2.02)
# Temperature 17.3°C is unusual (Z-score: -2.02)
# Temperature 31.8°C is unusual (Z-score: 2.12)
# Temperature 16.7°C is unusual (Z-score: -2.17)
3.2 Pandas Basics
Pandas is a powerful data manipulation library built on top of NumPy. It provides data structures and functions designed to make working with tabular and time series data easy and intuitive.
3.2.1 Why Pandas?
Pandas excels at the following data tasks:
- Working with tabular data - similar to spreadsheets or database tables
- Handling missing data - sophisticated tools for dealing with NULL values
- Data alignment - automatic alignment by row/column labels
- Grouping and aggregation - powerful “split-apply-combine” operations
- Data loading and saving - easy interfaces to CSV, Excel, SQL, and more
Let’s first install pandas if you haven’t already:
# In a real environment, you'd install pandas with:
# pip install pandas
import pandas as pd
import numpy as np
3.2.2 Pandas Core Data Structures: Series and DataFrame
Pandas has two primary data structures:
- Series: A one-dimensional labeled array
- DataFrame: A two-dimensional labeled data structure (like a table)
Let’s explore these structures:
import pandas as pd
import numpy as np
# Creating a Series
# Series is like a NumPy array with labels
s = pd.Series([10, 20, 30, 40])
print("Basic Series:")
print(s)
# Basic Series:
# 0 10
# 1 20
# 2 30
# 3 40
# dtype: int64
# Series with custom index
s_custom = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print("\nSeries with custom index:")
print(s_custom)
# Series with custom index:
# a 10
# b 20
# c 30
# d 40
# dtype: int64
# Creating a Series from a dictionary
data_dict = {'a': 100, 'b': 200, 'c': 300}
s_dict = pd.Series(data_dict)
print("\nSeries from dictionary:")
print(s_dict)
# Series from dictionary:
# a 100
# b 200
# c 300
# dtype: int64
# Accessing Series elements
print("\nAccessing elements:")
print("Element 'a':", s_custom['a'])
print("Elements 'a' and 'c':", s_custom[['a', 'c']])
print("First two elements:", s_custom[:2])
# Accessing elements:
# Element 'a': 10
# Elements 'a' and 'c': a 10
# c 30
# dtype: int64
# First two elements: a 10
# b 20
# dtype: int64
# Creating a DataFrame
# DataFrame is like a table with rows and columns
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, 35, 40],
'city': ['New York', 'Boston', 'Chicago', 'Denver']
})
print("\nBasic DataFrame:")
print(df)
# Basic DataFrame:
# name age city
# 0 Alice 25 New York
# 1 Bob 30 Boston
# 2 Charlie 35 Chicago
# 3 David 40 Denver
# DataFrame from a list of dictionaries
people = [
{'name': 'Emma', 'age': 28, 'city': 'Seattle'},
{'name': 'Frank', 'age': 32, 'city': 'Portland'},
{'name': 'Grace', 'age': 24, 'city': 'Austin'}
]
df2 = pd.DataFrame(people)
print("\nDataFrame from list of dictionaries:")
print(df2)
# DataFrame from list of dictionaries:
# name age city
# 0 Emma 28 Seattle
# 1 Frank 32 Portland
# 2 Grace 24 Austin
# DataFrame from arrays
names = ['Hannah', 'Ian', 'Julia']
ages = [29, 31, 27]
cities = ['Miami', 'Phoenix', 'Dallas']
df3 = pd.DataFrame({
'name': names,
'age': ages,
'city': cities
})
print("\nDataFrame from arrays:")
print(df3)
# DataFrame from arrays:
# name age city
# 0 Hannah 29 Miami
# 1 Ian 31 Phoenix
# 2 Julia 27 Dallas
# DataFrame with custom index
df4 = pd.DataFrame({
'name': ['Kevin', 'Linda', 'Mike'],
'age': [33, 26, 41],
'city': ['Atlanta', 'Detroit', 'Houston']
}, index=['K', 'L', 'M'])
print("\nDataFrame with custom index:")
print(df4)
# DataFrame with custom index:
# name age city
# K Kevin 33 Atlanta
# L Linda 26 Detroit
# M Mike 41 Houston
3.2.3 DataFrame Operations
Now let’s explore some common operations with DataFrames:
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'age': [25, 30, 35, 40, 45],
'department': ['IT', 'HR', 'Finance', 'IT', 'HR'],
'salary': [60000, 55000, 75000, 80000, 70000],
'experience': [1, 3, 5, 7, 10]
}
df = pd.DataFrame(data)
print("Sample employee DataFrame:")
print(df)
# Sample employee DataFrame:
# name age department salary experience
# 0 Alice 25 IT 60000 1
# 1 Bob 30 HR 55000 3
# 2 Charlie 35 Finance 75000 5
# 3 David 40 IT 80000 7
# 4 Emma 45 HR 70000 10
# Basic DataFrame information
print("\nBasic DataFrame information:")
print(f"Shape: {df.shape}") # (rows, columns)
print(f"Columns: {df.columns.tolist()}")
print(f"Index: {df.index.tolist()}")
print("\nDataFrame types:")
print(df.dtypes)
# Basic DataFrame information:
# Shape: (5, 5)
# Columns: ['name', 'age', 'department', 'salary', 'experience']
# Index: [0, 1, 2, 3, 4]
#
# DataFrame types:
# name object
# age int64
# department object
# salary int64
# experience int64
# dtype: object
# Accessing data in a DataFrame
print("\nAccessing data in a DataFrame:")
print("First 2 rows:")
print(df.head(2))
print("\nLast 2 rows:")
print(df.tail(2))
# Accessing data in a DataFrame:
# First 2 rows:
# name age department salary experience
# 0 Alice 25 IT 60000 1
# 1 Bob 30 HR 55000 3
#
# Last 2 rows:
# name age department salary experience
# 3 David 40 IT 80000 7
# 4 Emma 45 HR 70000 10
# Accessing columns
print("\nAccessing columns:")
# As attribute (works for simple column names)
print("Ages:\n", df.age)
# As dictionary key (works for all column names)
print("\nDepartments:\n", df['department'])
# Multiple columns
print("\nNames and salaries:\n", df[['name', 'salary']])
# Accessing columns:
# Ages:
# 0 25
# 1 30
# 2 35
# 3 40
# 4 45
# Name: age, dtype: int64
#
# Departments:
# 0 IT
# 1 HR
# 2 Finance
# 3 IT
# 4 HR
# Name: department, dtype: object
#
# Names and salaries:
# name salary
# 0 Alice 60000
# 1 Bob 55000
# 2 Charlie 75000
# 3 David 80000
# 4 Emma 70000
# Accessing rows
print("\nAccessing rows:")
# By position with iloc
print("Second row (position 1):")
print(df.iloc[1])
# By label with loc (for default index, label = position)
print("\nRow with index 3:")
print(df.loc[3])
# Accessing rows:
# Second row (position 1):
# name Bob
# age 30
# department HR
# salary 55000
# experience 3
# Name: 1, dtype: object
#
# Row with index 3:
# name David
# age 40
# department IT
# salary 80000
# experience 7
# Name: 3, dtype: object
# Filtering data
print("\nFiltering data:")
# Single condition
it_employees = df[df['department'] == 'IT']
print("IT employees:")
print(it_employees)
# Multiple conditions
experienced_hr = df[(df['department'] == 'HR') & (df['experience'] > 5)]
print("\nExperienced HR employees:")
print(experienced_hr)
# Filtering data:
# IT employees:
# name age department salary experience
# 0 Alice 25 IT 60000 1
# 3 David 40 IT 80000 7
#
# Experienced HR employees:
# name age department salary experience
# 4 Emma 45 HR 70000 10
# Adding and modifying columns
print("\nAdding and modifying columns:")
# Add a new column
df['bonus'] = df['salary'] * 0.1
print("With bonus column:")
print(df)
# Modify an existing column
df['salary'] = df['salary'] * 1.05 # 5% raise
print("\nAfter salary increase:")
print(df[['name', 'salary', 'bonus']])
# Adding and modifying columns:
# With bonus column:
# name age department salary experience bonus
# 0 Alice 25 IT 60000 1 6000.0
# 1 Bob 30 HR 55000 3 5500.0
# 2 Charlie 35 Finance 75000 5 7500.0
# 3 David 40 IT 80000 7 8000.0
# 4 Emma 45 HR 70000 10 7000.0
#
# After salary increase:
# name salary bonus
# 0 Alice 63000.0 6000.0
# 1 Bob 57750.0 5500.0
# 2 Charlie 78750.0 7500.0
# 3 David 84000.0 8000.0
# 4 Emma 73500.0 7000.0
# Add a column based on a function
df['salary_per_year_experience'] = df.apply(
lambda row: row['salary'] / max(1, row['experience']),
axis=1
)
print("\nSalary per year of experience:")
print(df[['name', 'salary', 'experience', 'salary_per_year_experience']])
# Salary per year of experience:
# name salary experience salary_per_year_experience
# 0 Alice 63000.0 1 63000.00
# 1 Bob 57750.0 3 19250.00
# 2 Charlie 78750.0 5 15750.00
# 3 David 84000.0 7 12000.00
# 4 Emma 73500.0 10 7350.00
# Basic statistics
print("\nBasic statistics for numeric columns:")
print(df.describe())
# Basic statistics for numeric columns:
# age salary experience bonus salary_per_year_experience
# count 5.000000 5.000000 5.000000 5.000000 5.000000
# mean 35.000000 71400.000000 5.200000 6800.000000 23470.000000
# std 8.660254 11304.060750 3.563706 1099.090974 21765.376109
# min 25.000000 57750.000000 1.000000 5500.000000 7350.000000
# 25% 30.000000 63000.000000 3.000000 6000.000000 12000.000000
# 50% 35.000000 73500.000000 5.000000 7000.000000 15750.000000
# 75% 40.000000 78750.000000 7.000000 7500.000000 19250.000000
# max 45.000000 84000.000000 10.000000 8000.000000 63000.000000
# Handling missing values
print("\nHandling missing values:")
# Create a DataFrame with some missing values
df_missing = pd.DataFrame({
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 3, np.nan, np.nan]
})
print("DataFrame with missing values:")
print(df_missing)
# Handling missing values:
# DataFrame with missing values:
# A B C
# 0 1.0 NaN 1.0
# 1 2.0 2.0 2.0
# 2 NaN 3.0 3.0
# 3 4.0 4.0 NaN
# 4 5.0 5.0 NaN
# Check for missing values
print("\nMissing value count by column:")
print(df_missing.isna().sum())
# Missing value count by column:
# A 1
# B 1
# C 2
# dtype: int64
# Fill missing values
df_filled = df_missing.fillna(0) # Fill with zero
print("\nFilled with zeros:")
print(df_filled)
# Filled with zeros:
# A B C
# 0 1.0 0.0 1.0
# 1 2.0 2.0 2.0
# 2 0.0 3.0 3.0
# 3 4.0 4.0 0.0
# 4 5.0 5.0 0.0
# Fill with column means
df_mean_filled = df_missing.fillna(df_missing.mean())
print("\nFilled with column means:")
print(df_mean_filled)
# Filled with column means:
# A B C
# 0 1.0 3.5 1.0
# 1 2.0 2.0 2.0
# 2 3.0 3.0 3.0
# 3 4.0 4.0 2.0
# 4 5.0 5.0 2.0
# Drop rows with any missing values
df_dropped = df_missing.dropna()
print("\nAfter dropping rows with missing values:")
print(df_dropped)
# After dropping rows with missing values:
# A B C
# 1 2.0 2.0 2.0
# Drop columns with more than 1 missing value
df_dropped_cols = df_missing.dropna(axis=1, thresh=4)
print("\nAfter dropping columns with > 1 missing value:")
print(df_dropped_cols)
# After dropping columns with > 1 missing value:
# A B
# 0 1.0 NaN
# 1 2.0 2.0
# 2 NaN 3.0
# 3 4.0 4.0
# 4 5.0 5.0
3.2.4 Pandas Data Aggregation and Grouping
One of Pandas’ most powerful features is the ability to group data and perform aggregate operations:
import pandas as pd
import numpy as np
# Create a sample sales DataFrame
sales_data = {
'date': pd.date_range(start='2023-01-01', periods=20, freq='D'),
'product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C',
'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
'region': ['East', 'West', 'West', 'East', 'East', 'West', 'West', 'East', 'East', 'West',
'East', 'West', 'West', 'East', 'East', 'West', 'West', 'East', 'East', 'West'],
'quantity': np.random.randint(1, 50, size=20),
'price': np.random.uniform(10, 100, size=20).round(2)
}
df_sales = pd.DataFrame(sales_data)
df_sales['revenue'] = df_sales['quantity'] * df_sales['price']
print("Sample sales data:")
print(df_sales.head())
# Sample sales data:
# date product region quantity price revenue
# 0 2023-01-01 A East 12 47.13 565.560000
# 1 2023-01-02 B West 26 39.50 1027.000000
# 2 2023-01-03 A West 34 31.31 1064.540000
# 3 2023-01-04 C East 30 24.60 738.000000
# 4 2023-01-05 B East 48 95.74 4595.520000
# Basic aggregation
print("\nSummary statistics for the entire dataset:")
print(df_sales[['quantity', 'price', 'revenue']].describe())
# Summary statistics for the entire dataset:
# quantity price revenue
# count 20.000000 20.000000 20.000000
# mean 25.400000 56.046500 1453.522900
# std 13.674355 26.347142 1059.626407
# min 1.000000 15.350000 15.350000
# 25% 15.750000 33.535000 559.845000
# 50% 25.500000 54.285000 1281.610000
# 75% 35.250000 76.702500 2056.845000
# max 49.000000 99.760000 3940.520000
# Grouping by a single column
print("\nSales by product:")
product_groups = df_sales.groupby('product')
product_summary = product_groups['revenue'].agg(['count', 'sum', 'mean', 'min', 'max'])
print(product_summary)
# Sales by product:
# count sum mean min max
# product
# A 7 5946.84500 849.549286 32.87000 2037.7700
# B 7 12478.04000 1782.577143 15.35000 4595.5200
# C 6 10645.59300 1774.265500 66.43000 3915.4800
# Grouping by multiple columns
print("\nSales by product and region:")
product_region_groups = df_sales.groupby(['product', 'region'])
product_region_summary = product_region_groups['revenue'].sum().reset_index()
print(product_region_summary)
# Sales by product and region:
# product region revenue
# 0 A East 3876.53500
# 1 A West 2070.31000
# 2 B East 7647.65000
# 3 B West 4830.39000
# 4 C East 4755.33000
# 5 C West 5890.26300
# Applying multiple aggregation functions
print("\nDetailed sales by product:")
product_detail = df_sales.groupby('product').agg({
'quantity': ['sum', 'mean'],
'price': ['mean', 'min', 'max'],
'revenue': ['sum', 'mean']
})
print(product_detail)
# Detailed sales by product:
# quantity price revenue
# sum mean mean min max sum mean
# product
# A 157 22.42857 37.890000 32.87 67.93 5946.84500 849.5493
# B 177 25.28571 70.315714 15.35 99.76 12478.04000 1782.5771
# C 164 27.33333 58.617833 24.60 97.93 10645.59300 1774.2655
# Pivot tables - reshape data for analysis
print("\nPivot table - Total revenue by region and product:")
pivot_revenue = pd.pivot_table(
df_sales,
values='revenue',
index='region',
columns='product',
aggfunc='sum'
)
print(pivot_revenue)
# Pivot table - Total revenue by region and product:
# product A B C
# region
# East 3876.535000 7647.65000 4755.33000
# West 2070.310000 4830.39000 5890.26300
# Time-based grouping
print("\nSales by week:")
df_sales['week'] = df_sales['date'].dt.isocalendar().week
weekly_sales = df_sales.groupby('week')['revenue'].sum()
print(weekly_sales)
# Sales by week:
# week
# 1 10026.45300
# 2 11173.75500
# 3 7870.24000
# dtype: float64
# Groupby with transformation
# Calculate percent of total revenue for each product
total_revenue = df_sales['revenue'].sum()
df_sales['percent_of_total'] = df_sales.groupby('product')['revenue'].transform(
lambda x: x / total_revenue * 100
)
print("\nSales with percent of total revenue:")
print(df_sales[['date', 'product', 'revenue', 'percent_of_total']].head(10))
# Sales with percent of total revenue:
# date product revenue percent_of_total
# 0 2023-01-01 A 565.560000 1.950693
# 1 2023-01-02 B 1027.000000 3.542367
# 2 2023-01-03 A 1064.540000 3.671474
# 3 2023-01-04 C 738.000000 2.545329
# 4 2023-01-05 B 4595.520000 15.854352
# 5 2023-01-06 A 146.930000 0.506774
# 6 2023-01-07 C 507.260000 1.749719
# 7 2023-01-08 B 366.600000 1.264507
# 8 2023-01-09 A 2037.770000 7.029176
# 9 2023-01-10 C 664.125000 2.291155
3.2.5 Reading and Writing Data with Pandas
Pandas makes it easy to read and write data in various formats:
import pandas as pd
import numpy as np
import os
# Create a directory for our sample data files if it doesn't exist
if not os.path.exists('data'):
os.makedirs('data')
# Create sample data
df = pd.DataFrame({
'id': range(1, 6),
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'age': [25, 30, 35, 40, 45],
'city': ['New York', 'Boston', 'Chicago', 'Denver', 'Seattle']
})
print("Sample data for I/O operations:")
print(df)
# Sample data for I/O operations:
# id name age city
# 0 1 Alice 25 New York
# 1 2 Bob 30 Boston
# 2 3 Charlie 35 Chicago
# 3 4 David 40 Denver
# 4 5 Emma 45 Seattle
# CSV =====================================================
# Writing to CSV
csv_path = 'data/sample.csv'
df.to_csv(csv_path, index=False)
print(f"\nData written to {csv_path}")
# Reading from CSV
df_from_csv = pd.read_csv(csv_path)
print("Data read from CSV:")
print(df_from_csv.head())
# Data written to data/sample.csv
# Data read from CSV:
# id name age city
# 0 1 Alice 25 New York
# 1 2 Bob 30 Boston
# 2 3 Charlie 35 Chicago
# 3 4 David 40 Denver
# 4 5 Emma 45 Seattle
# Excel ===================================================
# Writing to Excel (would require openpyxl or xlsxwriter in a real environment)
# For this example, we'll just show the code
excel_path = 'data/sample.xlsx'
# df.to_excel(excel_path, index=False, sheet_name='People')
print(f"\nCode to write to Excel: df.to_excel('{excel_path}', index=False, sheet_name='People')")
# Reading from Excel would be:
# df_from_excel = pd.read_excel(excel_path, sheet_name='People')
print("Code to read from Excel: pd.read_excel(excel_path, sheet_name='People')")
# JSON ===================================================
# Writing to JSON
json_path = 'data/sample.json'
df.to_json(json_path, orient='records')
print(f"\nData written to {json_path}")
# Reading from JSON
df_from_json = pd.read_json(json_path)
print("Data read from JSON:")
print(df_from_json.head())
# Data written to data/sample.json
# Data read from JSON:
# id name age city
# 0 1 Alice 25 New York
# 1 2 Bob 30 Boston
# 2 3 Charlie 35 Chicago
# 3 4 David 40 Denver
# 4 5 Emma 45 Seattle
# SQL ====================================================
# For demonstration only - in a real environment you'd need a database
# Here's how to read/write from a SQLite database:
# Writing to a database
import sqlite3
# Create a SQLite database in memory for demo
conn = sqlite3.connect(':memory:')
df.to_sql('people', conn, index=False, if_exists='replace')
print("\nData written to SQLite database")
# Reading from a database
query = "SELECT * FROM people WHERE age > 30"
df_from_sql = pd.read_sql(query, conn)
print(f"Data from SQL query '{query}':")
print(df_from_sql)
# Data written to SQLite database
# Data from SQL query 'SELECT * FROM people WHERE age > 30':
# id name age city
# 0 3 Charlie 35 Chicago
# 1 4 David 40 Denver
# 2 5 Emma 45 Seattle
# Close the database connection
conn.close()
# Handling a large CSV file efficiently ==================
# For large files, you would use the chunksize parameter to process in batches
# Simulating this with our small file
print("\nProcessing a CSV file in chunks:")
total_age = 0
count = 0
for chunk in pd.read_csv(csv_path, chunksize=2): # Process 2 rows at a time
print(f"Processing chunk of size {len(chunk)}:")
print(chunk)
total_age += chunk['age'].sum()
count += len(chunk)
avg_age = total_age / count if count > 0 else 0
print(f"Average age calculated from chunks: {avg_age}")
# Processing a CSV file in chunks:
# Processing chunk of size 2:
# id name age city
# 0 1 Alice 25 New York
# 1 2 Bob 30 Boston
# Processing chunk of size 2:
# id name age city
# 0 3 Charlie 35 Chicago
# 1 4 David 40 Denver
# Processing chunk of size 1:
# id name age city
# 0 5 Emma 45 Seattle
# Average age calculated from chunks: 35.0
3.2.6 Pandas Data Cleaning and Transformation
Data engineers spend a considerable amount of time preparing and cleaning data. Pandas offers several functions for these tasks:
import pandas as pd
import numpy as np
# Create sample data with issues to clean
messy_data = {
'id': ['001', '002', 'N/A', '004', '005'],
'name': ['Alice', 'bob', None, 'David ', 'emma watson'],
'age': ['25', '30.5', 'unknown', '40', '45.0'],
'joined_date': ['2021-01-15', 'Jan 20, 2021', '', '2021/04/10', '2021-05-22'],
'salary': ['60,000', '$55000', None, '80000.00', 'seventy thousand']
}
df_messy = pd.DataFrame(messy_data)
print("Original messy data:")
print(df_messy)
# Original messy data:
# id name age joined_date salary
# 0 001 Alice 25 2021-01-15 60,000
# 1 002 bob 30.5 Jan 20, 2021 $55000
# 2 N/A None unknown None
# 3 004 David 40.0 2021/04/10 80000.00
# 4 005 emma watson 45.0 2021-05-22 seventy thousand
# 1. Handling missing values
print("\n1. Handling missing values:")
# Check missing values
print("Missing value count by column:")
print(df_messy.isna().sum())
# Replace N/A with actual NaN
df_clean = df_messy.replace('N/A', np.nan)
# Replace empty strings with NaN
df_clean = df_clean.replace('', np.nan)
print("\nAfter standardizing missing values:")
print(df_clean)
# 1. Handling missing values:
# Missing value count by column:
# id 0
# name 1
# age 0
# joined_date 1
# salary 1
# dtype: int64
#
# After standardizing missing values:
# id name age joined_date salary
# 0 001 Alice 25 2021-01-15 60,000
# 1 002 bob 30.5 Jan 20, 2021 $55000
# 2 NaN None unknown NaN NaN
# 3 004 David 40.0 2021/04/10 80000.00
# 4 005 emma watson 45.0 2021-05-22 seventy thousand
# 2. Data type conversion
print("\n2. Data type conversion:")
# Convert ID to integer
df_clean['id'] = pd.to_numeric(df_clean['id'], errors='coerce')
# Convert age to float
df_clean['age'] = pd.to_numeric(df_clean['age'], errors='coerce')
print("After type conversion:")
print(df_clean.dtypes)
print("\nAge after conversion:")
print(df_clean['age'])
# 2. Data type conversion:
# After type conversion:
# id int64
# name object
# age float64
# joined_date object
# salary object
# dtype: object
#
# Age after conversion:
# 0 25.0
# 1 30.5
# 2 NaN
# 3 40.0
# 4 45.0
# Name: age, dtype: float64
# 3. Date standardization
print("\n3. Date standardization:")
# Convert various date formats to datetime
df_clean['joined_date'] = pd.to_datetime(df_clean['joined_date'], errors='coerce')
print("Dates after conversion:")
print(df_clean['joined_date'])
# 3. Date standardization:
# Dates after conversion:
# 0 2021-01-15
# 1 2021-01-20
# 2 NaN
# 3 2021-04-10
# 4 2021-05-22
# Name: joined_date, dtype: datetime64[ns]
# 4. Text cleaning and standardization
print("\n4. Text cleaning and standardization:")
# Fix whitespace and capitalization in names
df_clean['name'] = df_clean['name'].str.strip().str.title()
print("Names after cleaning:")
print(df_clean['name'])
# 4. Text cleaning and standardization:
# Names after cleaning:
# 0 Alice
# 1 Bob
# 2 None
# 3 David
# 4 Emma Watson
# Name: name, dtype: object
# 5. Cleaning salary data
print("\n5. Cleaning salary data:")
# Function to clean salary values
def clean_salary(value):
if pd.isna(value):
return np.nan
# Convert to string (in case it's already a number)
value = str(value)
# Remove non-numeric characters except decimal point
value = ''.join(c for c in value if c.isdigit() or c == '.')
try:
return float(value)
except ValueError:
return np.nan
# Apply the function to the salary column
df_clean['salary'] = df_clean['salary'].apply(clean_salary)
print("Salaries after cleaning:")
print(df_clean['salary'])
# 5. Cleaning salary data:
# Salaries after cleaning:
# 0 60000.0
# 1 55000.0
# 2 NaN
# 3 80000.0
# 4 NaN
# Name: salary, dtype: float64
# 6. Dropping rows with too many missing values
print("\n6. Handling rows with missing values:")
# Calculate how many non-null values each row has
valid_counts = df_clean.count(axis=1)
print("Valid value counts by row:")
print(valid_counts)
# Keep rows with at least 3 valid values
df_clean = df_clean[valid_counts >= 3]
print("\nAfter removing rows with too many missing values:")
print(df_clean)
# 6. Handling rows with missing values:
# Valid value counts by row:
# 0 5
# 1 5
# 2 1
# 3 5
# 4 4
# dtype: int64
#
# After removing rows with too many missing values:
# id name age joined_date salary
# 0 1.0 Alice 25.0 2021-01-15 60000.0
# 1 2.0 Bob 30.5 2021-01-20 55000.0
# 3 4.0 David 40.0 2021-04-10 80000.0
# 4 5.0 Emma Watson 45.0 2021-05-22 NaN
# 7. Fill in remaining missing values
print("\n7. Filling remaining missing values:")
# For salary, use the median of available values
median_salary = df_clean['salary'].median()
df_clean['salary'] = df_clean['salary'].fillna(median_salary)
print("After filling missing values:")
print(df_clean)
# 7. Filling remaining missing values:
# After filling missing values:
# id name age joined_date salary
# 0 1.0 Alice 25.0 2021-01-15 60000.0
# 1 2.0 Bob 30.5 2021-01-20 55000.0
# 3 4.0 David 40.0 2021-04-10 80000.0
# 4 5.0 Emma Watson 45.0 2021-05-22 60000.0
# 8. Create some derived features
print("\n8. Creating derived features:")
# Add a column for joining year
df_clean['join_year'] = df_clean['joined_date'].dt.year
# Add a column for whether the employee has a high salary
df_clean['high_salary'] = df_clean['salary'] > 60000
# Add a column for employment duration (days from join date to today)
today = pd.Timestamp('2023-04-15') # Example "today" date
df_clean['days_employed'] = (today - df_clean['joined_date']).dt.days
print("Final cleaned data with derived features:")
print(df_clean)
# 8. Creating derived features:
# Final cleaned data with derived features:
# id name age joined_date salary join_year high_salary days_employed
# 0 1.0 Alice 25.0 2021-01-15 60000.0 2021 False 820
# 1 2.0 Bob 30.5 2021-01-20 55000.0 2021 False 815
# 3 4.0 David 40.0 2021-04-10 80000.0 2021 True 735
# 4 5.0 Emma Watson 45.0 2021-05-22 60000.0 2021 False 693
3.3 Integration with Previous Python Knowledge
Both NumPy and Pandas can be seamlessly integrated with the Python fundamentals we learned in previous chapters. Let’s look at a few examples:
import pandas as pd
import numpy as np
# Create sample data
sales_data = [
{'date': '2023-01-01', 'product': 'A', 'quantity': 10, 'price': 50},
{'date': '2023-01-02', 'product': 'B', 'quantity': 5, 'price': 100},
{'date': '2023-01-03', 'product': 'A', 'quantity': 15, 'price': 50},
{'date': '2023-01-03', 'product': 'C', 'quantity': 8, 'price': 75},
{'date': '2023-01-04', 'product': 'B', 'quantity': 12, 'price': 100},
{'date': '2023-01-05', 'product': 'A', 'quantity': 20, 'price': 45},
]
# 1. Using list comprehensions with Pandas
# We learned list comprehensions in Chapter 2
print("1. Using list comprehensions with Pandas:")
# Convert the list of dictionaries to a DataFrame
df_sales = pd.DataFrame(sales_data)
print("Original DataFrame:")
print(df_sales)
# Calculate revenue for each record using list comprehension
# This doesn't modify the DataFrame
revenues = [row['quantity'] * row['price'] for _, row in df_sales.iterrows()]
print("\nRevenues calculated with list comprehension:")
print(revenues)
# Better approach: use Pandas operations instead
df_sales['revenue'] = df_sales['quantity'] * df_sales['price']
print("\nDataFrame with revenue column:")
print(df_sales)
# 1. Using list comprehensions with Pandas:
# Original DataFrame:
# date product quantity price
# 0 2023-01-01 A 10 50
# 1 2023-01-02 B 5 100
# 2 2023-01-03 A 15 50
# 3 2023-01-03 C 8 75
# 4 2023-01-04 B 12 100
# 5 2023-01-05 A 20 45
#
# Revenues calculated with list comprehension:
# [500, 500, 750, 600, 1200, 900]
#
# DataFrame with revenue column:
# date product quantity price revenue
# 0 2023-01-01 A 10 50 500
# 1 2023-01-02 B 5 100 500
# 2 2023-01-03 A 15 50 750
# 3 2023-01-03 C 8 75 600
# 4 2023-01-04 B 12 100 1200
# 5 2023-01-05 A 20 45 900
# 2. Using functions with NumPy and Pandas
# We learned functions in Chapter 1
print("\n2. Using functions with NumPy and Pandas:")
def calculate_discount(price, quantity, discount_rate=0.1, min_quantity=10):
"""
Calculate discount based on quantity purchased.
Args:
price: Unit price
quantity: Quantity purchased
discount_rate: Discount percentage (default 10%)
min_quantity: Minimum quantity to qualify for discount
Returns:
Discount amount
"""
if quantity >= min_quantity:
return price * quantity * discount_rate
else:
return 0
# Apply the function to our DataFrame using apply
df_sales['discount'] = df_sales.apply(
lambda row: calculate_discount(row['price'], row['quantity']),
axis=1
)
# Calculate the final price
df_sales['final_price'] = df_sales['revenue'] - df_sales['discount']
print("DataFrame with discounts applied:")
print(df_sales)
# 2. Using functions with NumPy and Pandas:
# DataFrame with discounts applied:
# date product quantity price revenue discount final_price
# 0 2023-01-01 A 10 50 500 50.0 450.0
# 1 2023-01-02 B 5 100 500 0.0 500.0
# 2 2023-01-03 A 15 50 750 75.0 675.0
# 3 2023-01-03 C 8 75 600 0.0 600.0
# 4 2023-01-04 B 12 100 1200 120.0 1080.0
# 5 2023-01-05 A 20 45 900 90.0 810.0
# 3. Using error handling with Pandas
# We learned error handling in Chapter 2
print("\n3. Using error handling with Pandas:")
# Create a function to parse dates with error handling
def parse_date(date_str):
"""Parse a date string, handling various formats."""
try:
return pd.to_datetime(date_str)
except (ValueError, TypeError):
print(f"Warning: Could not parse date '{date_str}'. Using NaT.")
return pd.NaT
# Create some data with problematic dates
data_with_bad_dates = [
{'id': 1, 'date': '2023-01-15'}, # Good format
{'id': 2, 'date': '01/20/2023'}, # Different but valid format
{'id': 3, 'date': 'January 25, 2023'}, # Another valid format
{'id': 4, 'date': 'not a date'}, # Invalid format
{'id': 5, 'date': None} # Missing date
]
# Convert to DataFrame
df_dates = pd.DataFrame(data_with_bad_dates)
print("DataFrame with mixed date formats:")
print(df_dates)
# Apply our function with error handling to convert dates
df_dates['parsed_date'] = df_dates['date'].apply(parse_date)
print("\nAfter date parsing with error handling:")
print(df_dates)
# 3. Using error handling with Pandas:
# DataFrame with mixed date formats:
# id date
# 0 1 2023-01-15
# 1 2 01/20/2023
# 2 3 January 25, 2023
# 3 4 not a date
# 4 5 None
# Warning: Could not parse date 'not a date'. Using NaT.
# Warning: Could not parse date 'None'. Using NaT.
#
# After date parsing with error handling:
# id date parsed_date
# 0 1 2023-01-15 2023-01-15
# 1 2 01/20/2023 2023-01-20
# 2 3 January 25, 2023 2023-01-25
# 3 4 not a date NaT
# 4 5 None NaT
# 4. File handling with Pandas
# We learned file handling in Chapter 2
print("\n4. File handling with Pandas:")
# Save the sales data to CSV and JSON
try:
# Create directory if it doesn't exist
import os
if not os.path.exists('data'):
os.makedirs('data')
# Save to CSV
csv_path = 'data/sales_data.csv'
df_sales.to_csv(csv_path, index=False)
print(f"Data saved to {csv_path}")
# Save to JSON
json_path = 'data/sales_data.json'
df_sales.to_json(json_path, orient='records')
print(f"Data saved to {json_path}")
# Read back and display
df_loaded = pd.read_csv(csv_path)
print("\nData read back from CSV:")
print(df_loaded.head())
except (IOError, PermissionError) as e:
print(f"Error saving files: {e}")
# 4. File handling with Pandas:
# Data saved to data/sales_data.csv
# Data saved to data/sales_data.json
#
# Data read back from CSV:
# date product quantity price revenue discount final_price
# 0 2023-01-01 A 10 50 500 50.0 450.0
# 1 2023-01-02 B 5 100 500 0.0 500.0
# 2 2023-01-03 A 15 50 750 75.0 675.0
# 3 2023-01-03 C 8 75 600 0.0 600.0
# 4 2023-01-04 B 12 100 1200 120.0 1080.0
3.4 Micro-Project: Pandas Sales Analyzer
Now, let’s apply what we’ve learned to refactor the Sales Data Analyzer from Chapter 1 using NumPy and Pandas for more efficient and powerful data analysis.
Project Requirements
- Convert the existing Sales Analyzer script to use Pandas DataFrames
- Use NumPy for any required numerical computations
- Implement more advanced analytics (sales trends, product correlations, etc.)
- Create a simple visualization of a key metric using Pandas built-in plotting
Acceptance Criteria
- Script successfully loads data into a Pandas DataFrame
- Uses appropriate Pandas functions for data cleaning (handling nulls, duplicates, etc.)
- Implements at least three analytical operations using Pandas (e.g., groupby, filtering, aggregation)
- Includes at least one NumPy operation for numerical calculation
- Generates at least one visualization (bar chart, line graph, etc.)
- Execution time is measurably faster than the previous non-Pandas version
Common Pitfalls
Confusion between Pandas Series and DataFrames
- Series is one-dimensional like a list
- DataFrame is two-dimensional like a table
- When selecting a single column from a DataFrame, the result is a Series
Modifying DataFrames without assignment
- Operations like
df.drop()
don’t modify in place unlessinplace=True
is specified - Always assign the result back to a variable or use
inplace=True
when intended
- Operations like
Performance issues with iterating over rows
- Avoid explicit loops with
.iterrows()
where possible - Use vectorized operations,
.apply()
, or aggregation functions instead
- Avoid explicit loops with
Production vs. Educational Implementation
In a real production environment, a Sales Analyzer would differ in several ways:
Data Sources:
- Production: Connect to databases, APIs, or cloud storage
- Our version: Use sample data created within the script
Performance Optimization:
- Production: Use optimized query patterns, caching, and parallel processing
- Our version: Basic optimization with vectorized operations
Visualization:
- Production: Interactive dashboards using tools like Tableau or Power BI
- Our version: Simple static plots with Pandas
Deployment:
- Production: Run as scheduled jobs or serve via web applications
- Our version: Run as a standalone script
Let’s implement the solution:
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
from datetime import datetime
def create_sample_sales_data(num_records=1000):
"""
Create sample sales data for analysis.
Args:
num_records: Number of records to generate
Returns:
DataFrame with sample sales data
"""
# Set random seed for reproducibility
np.random.seed(42)
# Generate dates (past 90 days)
start_date = pd.Timestamp('2023-01-01')
end_date = pd.Timestamp('2023-03-31')
dates = pd.date_range(start=start_date, end=end_date, periods=num_records)
# List of products
products = ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones',
'Printer', 'Webcam', 'Speakers', 'Hard Drive', 'USB Drive']
# List of regions
regions = ['North', 'South', 'East', 'West', 'Central']
# Generate random data
data = {
'order_id': np.arange(1, num_records + 1),
'date': dates,
'product': np.random.choice(products, size=num_records),
'region': np.random.choice(regions, size=num_records),
'quantity': np.random.randint(1, 10, size=num_records),
'unit_price': np.random.uniform(10, 1000, size=num_records).round(2),
'customer_id': np.random.randint(1, 101, size=num_records)
}
# Create DataFrame
df = pd.DataFrame(data)
# Add some missing values
mask = np.random.random(num_records) < 0.05 # 5% of data will be missing
df.loc[mask, 'quantity'] = np.nan
# Add some duplicate orders (same order_id, different products)
duplicates = np.random.choice(df['order_id'], size=int(num_records * 0.1), replace=False)
for order_id in duplicates:
original_row = df[df['order_id'] == order_id].iloc[0].to_dict()
original_row['product'] = np.random.choice([p for p in products if p != original_row['product']])
original_row['quantity'] = np.random.randint(1, 10)
df = pd.concat([df, pd.DataFrame([original_row])], ignore_index=True)
# Calculate total price
df['total_price'] = df['quantity'] * df['unit_price']
# Sort by date and order_id
df = df.sort_values(['date', 'order_id']).reset_index(drop=True)
# Add a few more fields for analysis
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek # 0 = Monday, 6 = Sunday
return df
def analyze_sales_data(df):
"""
Analyze sales data using Pandas and NumPy.
Args:
df: DataFrame with sales data
Returns:
Dictionary of analysis results
"""
start_time = time.time()
print("Starting sales data analysis...")
results = {
'summary': {},
'by_product': {},
'by_region': {},
'by_time': {},
'trends_and_patterns': {}
}
# Data cleaning
print("Cleaning data...")
# Handle missing quantities by filling with median
median_quantity = df['quantity'].median()
df['quantity'] = df['quantity'].fillna(median_quantity)
# Recalculate total_price where quantity was filled
df['total_price'] = df['quantity'] * df['unit_price']
# Basic summary statistics
print("Calculating summary statistics...")
results['summary']['total_orders'] = df['order_id'].nunique()
results['summary']['total_sales'] = df['total_price'].sum()
results['summary']['average_order_value'] = df.groupby('order_id')['total_price'].sum().mean()
results['summary']['total_units_sold'] = df['quantity'].sum()
results['summary']['average_unit_price'] = df['unit_price'].mean()
# Analysis by product
print("Analyzing sales by product...")
product_stats = df.groupby('product').agg({
'total_price': 'sum',
'quantity': 'sum',
'order_id': pd.Series.nunique
}).rename(columns={
'total_price': 'revenue',
'quantity': 'units_sold',
'order_id': 'num_orders'
})
# Add average price per unit
product_stats['avg_price'] = product_stats['revenue'] / product_stats['units_sold']
# Sort by revenue in descending order
product_stats = product_stats.sort_values('revenue', ascending=False)
# Identify top product
top_product = product_stats.index[0]
results['by_product']['top_product'] = {
'name': top_product,
'revenue': product_stats.loc[top_product, 'revenue'],
'units_sold': product_stats.loc[top_product, 'units_sold'],
'num_orders': product_stats.loc[top_product, 'num_orders']
}
# Store all product stats (convert to dictionary for JSON compatibility)
results['by_product']['all_products'] = product_stats.reset_index().to_dict(orient='records')
# Analysis by region
print("Analyzing sales by region...")
region_stats = df.groupby('region').agg({
'total_price': 'sum',
'order_id': pd.Series.nunique,
'quantity': 'sum'
}).rename(columns={
'total_price': 'revenue',
'order_id': 'num_orders',
'quantity': 'units_sold'
})
# Calculate average order value by region
region_stats['avg_order_value'] = region_stats['revenue'] / region_stats['num_orders']
# Sort by revenue
region_stats = region_stats.sort_values('revenue', ascending=False)
# Store region stats
results['by_region']['region_stats'] = region_stats.reset_index().to_dict(orient='records')
# Analysis by time
print("Analyzing sales trends over time...")
# Monthly sales
monthly_sales = df.groupby(df['date'].dt.strftime('%Y-%m')).agg({
'total_price': 'sum',
'order_id': pd.Series.nunique,
'quantity': 'sum'
}).rename(columns={
'total_price': 'revenue',
'order_id': 'num_orders',
'quantity': 'units_sold'
})
# Store monthly stats
results['by_time']['monthly_sales'] = monthly_sales.reset_index().to_dict(orient='records')
# Day of week analysis using NumPy
day_sales = df.groupby('day_of_week')['total_price'].sum().values
# Use NumPy to find which day has the highest sales
best_day_idx = np.argmax(day_sales)
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
results['by_time']['best_sales_day'] = {
'day': days[best_day_idx],
'revenue': day_sales[best_day_idx]
}
# Advanced analysis: product correlations
print("Analyzing product correlations...")
# Find which products are commonly purchased together
# For simplicity, we'll just look at orders with multiple items
multi_item_orders = df[df.duplicated(subset=['order_id'], keep=False)]
if len(multi_item_orders) > 0:
# Group by order_id and collect products
order_products = multi_item_orders.groupby('order_id')['product'].apply(list)
# Find the most common product pairs
product_pairs = []
for products in order_products:
if len(products) >= 2:
# Look at all pairs of products in this order
for i in range(len(products)):
for j in range(i+1, len(products)):
pair = tuple(sorted([products[i], products[j]]))
product_pairs.append(pair)
# Count occurrences of each pair
if product_pairs:
pair_counts = pd.Series(product_pairs).value_counts().head(5)
results['trends_and_patterns']['common_product_pairs'] = [
{'products': list(pair), 'count': count}
for pair, count in pair_counts.items()
]
# Calculate execution time
execution_time = time.time() - start_time
results['execution_time'] = execution_time
print(f"Analysis completed in {execution_time:.2f} seconds!")
return results
def generate_sales_visualizations(df, results):
"""
Create visualizations of the sales data.
Args:
df: DataFrame with sales data
results: Dictionary of analysis results
"""
print("Generating visualizations...")
# Create a directory for plots if it doesn't exist
import os
if not os.path.exists('plots'):
os.makedirs('plots')
# 1. Product Revenue Bar Chart
plt.figure(figsize=(10, 6))
top_products = pd.DataFrame(results['by_product']['all_products']).sort_values('revenue', ascending=False).head(5)
plt.bar(top_products['product'], top_products['revenue'], color='skyblue')
plt.title('Revenue by Top 5 Products')
plt.xlabel('Product')
plt.ylabel('Revenue ($)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('plots/top_products_revenue.png')
plt.close()
# 2. Monthly Sales Line Chart
plt.figure(figsize=(10, 6))
monthly_data = pd.DataFrame(results['by_time']['monthly_sales'])
plt.plot(monthly_data['date'], monthly_data['revenue'], marker='o', linestyle='-', color='green')
plt.title('Monthly Sales Trend')
plt.xlabel('Month')
plt.ylabel('Revenue ($)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.savefig('plots/monthly_sales_trend.png')
plt.close()
# 3. Sales by Region Pie Chart
plt.figure(figsize=(8, 8))
region_data = pd.DataFrame(results['by_region']['region_stats'])
plt.pie(region_data['revenue'], labels=region_data['region'], autopct='%1.1f%%',
startangle=90, shadow=True)
plt.title('Revenue Distribution by Region')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle
plt.tight_layout()
plt.savefig('plots/region_distribution.png')
plt.close()
# 4. Day of Week Sales Bar Chart
plt.figure(figsize=(10, 6))
day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_sales = df.groupby('day_of_week')['total_price'].sum()
# Reindex to ensure all days are included in the correct order
day_sales = day_sales.reindex(range(7), fill_value=0)
plt.bar(day_names, day_sales.values, color='coral')
plt.title('Sales by Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Revenue ($)')
plt.tight_layout()
plt.savefig('plots/day_of_week_sales.png')
plt.close()
print("Visualizations have been saved to the 'plots' directory!")
def format_results(results):
"""
Format analysis results for a nice text output.
Args:
results: Dictionary of analysis results
Returns:
Formatted string with the results
"""
# Format currency values
def currency(value):
return f"${value:,.2f}"
# Build the report
report = [
"=" * 50,
"SALES DATA ANALYSIS REPORT",
"=" * 50,
"",
"SUMMARY STATISTICS",
"-" * 30,
f"Total Orders: {results['summary']['total_orders']:,}",
f"Total Sales: {currency(results['summary']['total_sales'])}",
f"Average Order Value: {currency(results['summary']['average_order_value'])}",
f"Total Units Sold: {results['summary']['total_units_sold']:,}",
f"Average Unit Price: {currency(results['summary']['average_unit_price'])}",
"",
"TOP PRODUCT",
"-" * 30,
f"Product: {results['by_product']['top_product']['name']}",
f"Revenue: {currency(results['by_product']['top_product']['revenue'])}",
f"Units Sold: {results['by_product']['top_product']['units_sold']:,}",
f"Number of Orders: {results['by_product']['top_product']['num_orders']:,}",
"",
"TOP 5 PRODUCTS BY REVENUE",
"-" * 30
]
# Add top 5 products
top_products = pd.DataFrame(results['by_product']['all_products']).head(5)
for _, product in top_products.iterrows():
report.append(f"{product['product']}: {currency(product['revenue'])} ({product['units_sold']:,} units)")
report.extend([
"",
"REVENUE BY REGION",
"-" * 30
])
# Add region stats
region_stats = pd.DataFrame(results['by_region']['region_stats'])
for _, region in region_stats.iterrows():
report.append(f"{region['region']}: {currency(region['revenue'])} ({region['num_orders']:,} orders)")
report.extend([
"",
"TIME ANALYSIS",
"-" * 30,
f"Best Sales Day: {results['by_time']['best_sales_day']['day']} " +
f"({currency(results['by_time']['best_sales_day']['revenue'])})",
"",
"TRENDS AND PATTERNS",
"-" * 30
])
# Add common product pairs if available
if 'common_product_pairs' in results['trends_and_patterns']:
report.append("Commonly Purchased Together:")
for pair in results['trends_and_patterns']['common_product_pairs']:
report.append(f" - {pair['products'][0]} + {pair['products'][1]} ({pair['count']} times)")
report.extend([
"",
f"Analysis completed in {results['execution_time']:.2f} seconds!",
"=" * 50
])
return "\n".join(report)
def main():
"""Main function to run the sales analyzer."""
print("Pandas Sales Analyzer")
print("====================")
# Generate sample data
print("Generating sample sales data...")
sales_data = create_sample_sales_data(num_records=1000)
print(f"Generated {len(sales_data)} records.")
# Display sample of the data
print("\nSample of the data:")
print(sales_data.head())
# Analyze the data
results = analyze_sales_data(sales_data)
# Generate visualizations
generate_sales_visualizations(sales_data, results)
# Format and display results
report = format_results(results)
print("\n" + report)
# Save results to files
try:
# Save report to text file
with open('sales_analysis_report.txt', 'w') as f:
f.write(report)
# Save full results to JSON
import json
with open('sales_analysis_results.json', 'w') as f:
json.dump(results, f, indent=2, default=str)
# Save the DataFrame to CSV
sales_data.to_csv('sales_data.csv', index=False)
print("\nResults have been saved to:")
print("- sales_analysis_report.txt (Text report)")
print("- sales_analysis_results.json (JSON data)")
print("- sales_data.csv (CSV data)")
print("- plots/ directory (Visualizations)")
except Exception as e:
print(f"Error saving results: {e}")
# Run the script
if __name__ == "__main__":
main()
How to Run and Test the Solution
To run this solution:
- Save the above code to a file (e.g.,
pandas_sales_analyzer.py
) - Run the script:
python pandas_sales_analyzer.py
- The script will:
- Generate a sample dataset
- Perform various analyses using Pandas and NumPy
- Create visualizations of the results
- Print a formatted report to the console
- Save the results to several files:
sales_analysis_report.txt
- A formatted text reportsales_analysis_results.json
- The full results in JSON formatsales_data.csv
- The generated sales dataplots/
directory - Contains visualization images
To test different scenarios:
- Change the number of records: Modify the
num_records
parameter in thecreate_sample_sales_data
call to test with different dataset sizes - Add more analyses: Extend the
analyze_sales_data
function with additional analyses - Create new visualizations: Add more plots to the
generate_sales_visualizations
function
The solution showcases several advantages of using Pandas:
- Efficient data cleaning and handling of missing values
- Powerful aggregation using
groupby
- Vectorized operations for better performance
- Built-in visualization capabilities
- Seamless integration with NumPy for numerical operations
3.5 Practice Exercises
Reinforce your understanding with these practical exercises:
Exercise 1: NumPy Array Operations
Create a function that:
- Takes two NumPy arrays of equal length
- Calculates the element-wise product
- Returns the sum of the product (dot product)
- Compares the speed with a traditional Python approach
Exercise 2: Pandas Data Cleaning
Write a function that:
- Reads a CSV file into a Pandas DataFrame
- Cleans the data by:
- Removing duplicate rows
- Handling missing values
- Converting data types appropriately
- Creates a new column based on existing data
- Returns the cleaned DataFrame
Exercise 3: Data Analysis with Pandas
Create a function that:
- Analyzes a sales dataset
- Calculates:
- Total sales by category
- Month-over-month growth
- Average order size
- Returns the results as a new DataFrame
Exercise 4: NumPy and Pandas Integration
Write a function that:
- Generates a NumPy array with random data
- Converts it to a Pandas DataFrame
- Performs statistical analysis (mean, median, standard deviation)
- Creates a simple plot showing the distribution
Exercise 5: Real-world Data Processing
Create a comprehensive script that:
- Reads data from a CSV file
- Cleans and transforms the data
- Performs multiple analyses
- Generates visualizations
- Exports results to multiple formats
3.6 Exercise Solutions
Here are solutions to the practice exercises:
Solution to Exercise 1: NumPy Array Operations
import numpy as np
import time
def dot_product_numpy(array1, array2):
"""
Calculate dot product using NumPy.
Args:
array1: First NumPy array
array2: Second NumPy array
Returns:
Dot product of the two arrays
"""
return np.dot(array1, array2)
def dot_product_python(list1, list2):
"""
Calculate dot product using pure Python.
Args:
list1: First list
list2: Second list
Returns:
Dot product of the two lists
"""
result = 0
for i in range(len(list1)):
result += list1[i] * list2[i]
return result
def compare_performance(size=1000000):
"""
Compare performance between NumPy and pure Python implementations.
Args:
size: Size of arrays to test
Returns:
Dictionary with results and performance comparison
"""
# Generate random data
np.random.seed(42) # For reproducibility
array1 = np.random.rand(size)
array2 = np.random.rand(size)
# Convert NumPy arrays to Python lists for fair comparison
list1 = array1.tolist()
list2 = array2.tolist()
# Test NumPy performance
start_time = time.time()
numpy_result = dot_product_numpy(array1, array2)
numpy_time = time.time() - start_time
# Test Python performance
start_time = time.time()
python_result = dot_product_python(list1, list2)
python_time = time.time() - start_time
# Calculate speedup
speedup = python_time / numpy_time
# Print results
print(f"Array size: {size:,}")
print(f"NumPy result: {numpy_result}")
print(f"Python result: {python_result}")
print(f"NumPy time: {numpy_time:.6f} seconds")
print(f"Python time: {python_time:.6f} seconds")
print(f"NumPy is {speedup:.1f}x faster!")
return {
'size': size,
'numpy_result': numpy_result,
'python_result': python_result,
'numpy_time': numpy_time,
'python_time': python_time,
'speedup': speedup
}
# Test the function with smaller arrays for demonstration
results = compare_performance(size=100000)
# Array size: 100,000
# NumPy result: 25035.727199509654
# Python result: 25035.727199509636
# NumPy time: 0.000237 seconds
# Python time: 0.041307 seconds
# NumPy is 174.0x faster!
### Solution to Exercise 2: Pandas Data Cleaning
```python
import pandas as pd
import numpy as np
import io
# Create a sample CSV string with messy data for testing
sample_csv = """
id,name,age,salary,department,hire_date
1,John Smith,35,$75000,Engineering,2018-05-12
2,Jane Doe,42,$82000,Marketing,2015-10-23
3,,28,$65000,Engineering,2020-01-15
4,Bob Johnson,45,,Finance,2010-03-07
5,Alice Brown,33,$70000,Marketing,2019-11-30
2,Jane Doe,42,$82000,Marketing,2015-10-23
6,Charlie Wilson,,,$90000,Engineering,invalid-date
7,Diana Miller,38,$78000,Finance,2017-07-19
8,David Wang,31,,Engineering,2021-02-28
"""
def clean_data(csv_file):
"""
Clean and prepare a CSV dataset.
Args:
csv_file: Path to CSV file or CSV string for testing
Returns:
Cleaned Pandas DataFrame
"""
# Read the CSV file
if isinstance(csv_file, str) and '\n' in csv_file:
# If it's a CSV string (for testing)
df = pd.read_csv(io.StringIO(csv_file))
else:
# If it's a file path
df = pd.read_csv(csv_file)
print("Original DataFrame:")
print(df.head())
print(f"Original shape: {df.shape}")
# Step 1: Remove duplicate rows
df_no_dupes = df.drop_duplicates()
print(f"\nAfter removing duplicates: {df_no_dupes.shape}")
print(f"Removed {df.shape[0] - df_no_dupes.shape[0]} duplicate rows")
# Step 2: Handle missing values
# Fill missing names with "Unknown"
df_no_dupes['name'] = df_no_dupes['name'].fillna("Unknown")
# Fill missing ages with the median age
median_age = df_no_dupes['age'].median()
df_no_dupes['age'] = pd.to_numeric(df_no_dupes['age'], errors='coerce')
df_no_dupes['age'] = df_no_dupes['age'].fillna(median_age)
# Convert salary from string to numeric
df_no_dupes['salary'] = df_no_dupes['salary'].str.replace(', '', regex=False)
df_no_dupes['salary'] = df_no_dupes['salary'].str.replace(',', '', regex=False)
df_no_dupes['salary'] = pd.to_numeric(df_no_dupes['salary'], errors='coerce')
# Fill missing salaries with the median by department
department_median_salary = df_no_dupes.groupby('department')['salary'].transform('median')
df_no_dupes['salary'] = df_no_dupes['salary'].fillna(department_median_salary)
# If still missing (e.g., if entire department is missing), use overall median
overall_median_salary = df_no_dupes['salary'].median()
df_no_dupes['salary'] = df_no_dupes['salary'].fillna(overall_median_salary)
# Step 3: Convert data types
# Convert hire_date to datetime
df_no_dupes['hire_date'] = pd.to_datetime(df_no_dupes['hire_date'], errors='coerce')
# Fill invalid dates with a default date
df_no_dupes['hire_date'] = df_no_dupes['hire_date'].fillna(pd.Timestamp('2000-01-01'))
# Step 4: Create new calculated columns
# Add years of service
today = pd.Timestamp('2023-04-15') # Example "today" date
df_no_dupes['years_of_service'] = ((today - df_no_dupes['hire_date']).dt.days / 365.25).round(1)
# Add salary tier
def get_salary_tier(salary):
if salary < 70000:
return 'Low'
elif salary < 85000:
return 'Medium'
else:
return 'High'
df_no_dupes['salary_tier'] = df_no_dupes['salary'].apply(get_salary_tier)
print("\nCleaned DataFrame:")
print(df_no_dupes)
print("\nDataFrame Info:")
print(df_no_dupes.dtypes)
return df_no_dupes
# Test the function with our sample CSV data
cleaned_df = clean_data(sample_csv)
# Original DataFrame:
# id name age salary department hire_date
# 0 1 John Smith 35.0 $75000 Engineering 2018-05-12
# 1 2 Jane Doe 42.0 $82000 Marketing 2015-10-23
# 2 3 NaN 28.0 $65000 Engineering 2020-01-15
# 3 4 Bob Johnson 45.0 NaN Finance 2010-03-07
# 4 5 Alice Brown 33.0 $70000 Marketing 2019-11-30
# Original shape: (9, 6)
#
# After removing duplicates: (8, 6)
# Removed 1 duplicate rows
#
# Cleaned DataFrame:
# id name age salary department hire_date years_of_service salary_tier
# 0 1 John Smith 35.0 75000.0 Engineering 2018-05-12 4.9 Medium
# 1 2 Jane Doe 42.0 82000.0 Marketing 2015-10-23 7.5 Medium
# 2 3 Unknown 28.0 65000.0 Engineering 2020-01-15 3.2 Low
# 3 4 Bob Johnson 45.0 70000.0 Finance 2010-03-07 13.1 Medium
# 4 5 Alice Brown 33.0 70000.0 Marketing 2019-11-30 3.4 Medium
# 5 6 Charlie Wil NaN 90000.0 Engineering 2000-01-01 23.3 High
# 6 7 Diana Miller 38.0 78000.0 Finance 2017-07-19 5.7 Medium
# 7 8 David Wang 31.0 70000.0 Engineering 2021-02-28 2.1 Medium
#
# DataFrame Info:
# id int64
# name object
# age float64
# salary float64
# department object
# hire_date datetime64[ns]
# years_of_service float64
# salary_tier object
# dtype: object
### Solution to Exercise 3: Data Analysis with Pandas
```python
import pandas as pd
import numpy as np
# Create sample sales data for analysis
def create_sample_sales_data():
"""Create sample sales data for testing."""
# Set seed for reproducibility
np.random.seed(42)
# Define parameters
categories = ['Electronics', 'Clothing', 'Home', 'Sports', 'Books']
months = pd.date_range(start='2022-01-01', end='2023-03-01', freq='MS')
# Generate data
data = []
for month in months:
# Generate 20-30 orders per month
num_orders = np.random.randint(20, 31)
for _ in range(num_orders):
# Create a sale record
category = np.random.choice(categories)
# Different price ranges per category
if category == 'Electronics':
price = np.random.uniform(100, 1000)
elif category == 'Clothing':
price = np.random.uniform(20, 200)
elif category == 'Home':
price = np.random.uniform(50, 500)
elif category == 'Sports':
price = np.random.uniform(30, 300)
else: # Books
price = np.random.uniform(10, 50)
# Generate quantity
quantity = np.random.randint(1, 6)
# Generate random order ID
order_id = np.random.randint(10000, 99999)
# Add to data
data.append({
'order_id': order_id,
'date': month,
'category': category,
'price': price.round(2),
'quantity': quantity,
'total': (price * quantity).round(2)
})
# Convert to DataFrame
return pd.DataFrame(data)
def analyze_sales(sales_df):
"""
Analyze sales data to calculate key metrics.
Args:
sales_df: DataFrame with sales data
Returns:
Dictionary with analysis results
"""
print("Starting sales analysis...")
print(f"Dataset contains {len(sales_df)} records.")
# Create a copy to avoid modifying the original
df = sales_df.copy()
# Ensure date is datetime type
df['date'] = pd.to_datetime(df['date'])
# Add month and year columns for easier grouping
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['year_month'] = df['date'].dt.strftime('%Y-%m')
# 1. Total sales by category
print("\nCalculating total sales by category...")
category_sales = df.groupby('category').agg({
'total': 'sum',
'quantity': 'sum',
'order_id': pd.Series.nunique
}).rename(columns={
'total': 'total_sales',
'quantity': 'units_sold',
'order_id': 'order_count'
})
# Add average order value
category_sales['avg_order_value'] = category_sales['total_sales'] / category_sales['order_count']
# Sort by total sales
category_sales = category_sales.sort_values('total_sales', ascending=False)
print("Category sales summary:")
print(category_sales)
# 2. Monthly sales and growth
print("\nCalculating monthly sales and growth...")
monthly_sales = df.groupby('year_month').agg({
'total': 'sum',
'order_id': pd.Series.nunique
}).rename(columns={
'total': 'total_sales',
'order_id': 'order_count'
})
# Calculate month-over-month growth
monthly_sales['previous_month_sales'] = monthly_sales['total_sales'].shift(1)
monthly_sales['mom_growth'] = (
(monthly_sales['total_sales'] - monthly_sales['previous_month_sales']) /
monthly_sales['previous_month_sales'] * 100
)
print("Monthly sales and growth:")
print(monthly_sales)
# 3. Average order size
print("\nCalculating average order size metrics...")
order_size = df.groupby('order_id')['total'].sum().reset_index()
avg_order_size = order_size['total'].mean()
median_order_size = order_size['total'].median()
print(f"Average order size: ${avg_order_size:.2f}")
print(f"Median order size: ${median_order_size:.2f}")
# 4. Product category by month analysis
print("\nAnalyzing category trends by month...")
category_by_month = df.pivot_table(
index='year_month',
columns='category',
values='total',
aggfunc='sum',
fill_value=0
)
print("Category sales by month:")
print(category_by_month.head())
# Return all results as a dictionary
results = {
'category_sales': category_sales.reset_index().to_dict(orient='records'),
'monthly_sales': monthly_sales.reset_index().to_dict(orient='records'),
'avg_order_size': avg_order_size,
'median_order_size': median_order_size,
'category_by_month': category_by_month.reset_index().to_dict(orient='records')
}
return results
# Test the analysis function with sample data
sales_df = create_sample_sales_data()
print("Sample of generated sales data:")
print(sales_df.head())
print(f"Shape: {sales_df.shape}")
# Run the analysis
analysis_results = analyze_sales(sales_df)
# Display a summary of key results
print("\nSUMMARY OF KEY RESULTS:")
print("-" * 30)
top_category = pd.DataFrame(analysis_results['category_sales']).iloc[0]
print(f"Top selling category: {top_category['category']} (${top_category['total_sales']:.2f})")
print(f"Average order size: ${analysis_results['avg_order_size']:.2f}")
# Sample of generated sales data:
# order_id date category price quantity total
# 0 44066 2022-01-01 Electronics 798.80 2 1597.60
# 1 37742 2022-01-01 Electronics 523.05 2 1046.10
# 2 73162 2022-01-01 Clothing 78.32 4 313.28
# 3 86291 2022-01-01 Electronics 323.15 2 646.30
# 4 51090 2022-01-01 Clothing 103.02 1 103.02
# Shape: (434, 6)
# Starting sales analysis...
# Dataset contains 434 records.
#
# Calculating total sales by category...
# Category sales summary:
# total_sales units_sold order_count avg_order_value
# category
# Electronics 71601.81 283 137 522.642409
# Home 22673.50 296 113 200.650398
# Clothing 10323.13 311 128 80.649473
# Sports 8767.03 235 108 81.176204
# Books 2075.02 306 120 17.291813
#
# Calculating monthly sales and growth...
# Monthly sales and growth:
# total_sales order_count previous_month_sales mom_growth
# year_month
# 2022-01 6775.98 23 NaN NaN
# 2022-02 4970.21 28 6775.980 -26.648485
# 2022-03 4915.87 27 4970.210 -1.093556
# 2022-04 6084.55 26 4915.870 23.773472
# 2022-05 4866.55 21 6084.550 -20.018703
# ... ... ... ... ...
# 2022-12 4892.21 21 5392.060 -9.270983
# 2023-01 4698.69 20 4892.210 -3.955863
# 2023-02 4708.08 23 4698.690 0.199822
# 2023-03 4850.07 26 4708.080 3.016861
#
# [15 rows x 4 columns]
#
# Calculating average order size metrics...
# Average order size: $267.37
# Median order size: $168.48
#
# Analyzing category trends by month...
# Category sales by month:
# Books Clothing Electronics Home Sports
# year_month
# 2022-01 119.38 741.02 4703.56 1044.37 167.65
# 2022-02 104.89 598.17 2664.04 1095.28 507.83
# 2022-03 139.51 533.12 2812.85 839.29 591.10
# 2022-04 129.13 652.73 4011.54 885.25 405.90
# 2022-05 148.44 548.72 2800.31 1021.51 347.57
#
# SUMMARY OF KEY RESULTS:
# ------------------------------
# Top selling category: Electronics ($71601.81)
# Average order size: $267.37
### Solution to Exercise 4: NumPy and Pandas Integration
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def analyze_random_distribution(num_samples=1000, num_distributions=3):
"""
Generate random data with NumPy, analyze with Pandas, and visualize.
Args:
num_samples: Number of data points to generate
num_distributions: Number of different distributions to generate
Returns:
Pandas DataFrame with the statistical analysis
"""
print(f"Generating {num_samples} random samples from {num_distributions} distributions...")
# Set seed for reproducibility
np.random.seed(42)
# Generate different random distributions
distributions = {
'normal': np.random.normal(loc=0, scale=1, size=num_samples),
'uniform': np.random.uniform(low=-2, high=2, size=num_samples),
'exponential': np.random.exponential(scale=1, size=num_samples)
}
# Convert to DataFrame
df = pd.DataFrame(distributions)
print("First few rows of the generated data:")
print(df.head())
# Statistical analysis
stats = df.describe()
print("\nStatistical analysis:")
print(stats)
# Calculate additional statistics using NumPy
for column in df.columns:
# Use NumPy functions on Pandas data
stats.loc['skewness', column] = np.round(np.mean(((df[column] - df[column].mean()) / df[column].std())**3), 4)
stats.loc['kurtosis', column] = np.round(np.mean(((df[column] - df[column].mean()) / df[column].std())**4) - 3, 4)
print("\nExtended statistics with skewness and kurtosis:")
print(stats)
# Create histograms to visualize the distributions
plt.figure(figsize=(15, 5))
for i, column in enumerate(df.columns, 1):
plt.subplot(1, 3, i)
df[column].hist(bins=30, alpha=0.7, label=column)
plt.axvline(df[column].mean(), color='red', linestyle='dashed', linewidth=1,
label=f'Mean: {df[column].mean():.2f}')
plt.axvline(df[column].median(), color='green', linestyle='dashed', linewidth=1,
label=f'Median: {df[column].median():.2f}')
plt.title(f'{column.capitalize()} Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()
plt.tight_layout()
plt.savefig('distribution_comparison.png')
plt.close()
print("\nDistribution histogram saved as 'distribution_comparison.png'")
return stats
# Run the analysis
statistics = analyze_random_distribution(num_samples=1000)
# Generating 1000 random samples from 3 distributions...
# First few rows of the generated data:
# normal uniform exponential
# 0 0.496714 -1.142545 0.913859
# 1 -0.138264 -1.149446 0.918036
# 2 0.647689 1.815984 0.538155
# 3 1.523030 0.212951 0.255931
# 4 -0.234153 0.594340 1.285226
#
# Statistical analysis:
# normal uniform exponential
# count 1000.00000 1000.00000 1000.00000
# mean -0.02566 -0.00837 0.95804
# std 0.98815 1.15702 0.93883
# min -3.08228 -1.99937 0.00190
# 25% -0.70450 -1.14440 0.26771
# 50% -0.01529 -0.01108 0.67898
# 75% 0.67877 1.13659 1.38559
# max 2.87453 1.99938 6.46958
#
# Extended statistics with skewness and kurtosis:
# normal uniform exponential
# count 1000.00000 1000.00000 1000.00000
# mean -0.02566 -0.00837 0.95804
# std 0.98815 1.15702 0.93883
# min -3.08228 -1.99937 0.00190
# 25% -0.70450 -1.14440 0.26771
# 50% -0.01529 -0.01108 0.67898
# 75% 0.67877 1.13659 1.38559
# max 2.87453 1.99938 6.46958
# skewness -0.0305 -0.0076 1.6835
# kurtosis 0.0805 -1.2061 3.9649
#
# Distribution histogram saved as 'distribution_comparison.png'
### Solution to Exercise 5: Real-world Data Processing
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import json
from datetime import datetime
def process_retail_data(csv_file='retail_data.csv'):
"""
Comprehensive data processing script that reads, cleans, analyzes,
visualizes, and exports retail sales data.
Args:
csv_file: Path to CSV file with retail data
Returns:
Dictionary with analysis results
"""
# Create output directories if they don't exist
for directory in ['output', 'output/csv', 'output/json', 'output/plots']:
if not os.path.exists(directory):
os.makedirs(directory)
# Step 1: Check if we need to create sample data for testing
if not os.path.exists(csv_file):
print(f"File {csv_file} not found. Creating sample data for testing...")
create_sample_retail_data(csv_file)
# Step 2: Read the data
print(f"Reading data from {csv_file}...")
df = pd.read_csv(csv_file)
print(f"Read {len(df)} rows and {len(df.columns)} columns.")
print("First few rows:")
print(df.head())
# Step 3: Clean and prepare the data
print("\nCleaning and preparing data...")
clean_df = clean_retail_data(df)
print("Data cleaned successfully. First few rows of cleaned data:")
print(clean_df.head())
# Save cleaned data
clean_df.to_csv('output/csv/cleaned_retail_data.csv', index=False)
print("Cleaned data saved to output/csv/cleaned_retail_data.csv")
# Step 4: Perform multiple analyses
print("\nPerforming data analysis...")
analysis_results = analyze_retail_data(clean_df)
print("Analysis completed successfully.")
# Step 5: Generate visualizations
print("\nGenerating visualizations...")
create_visualizations(clean_df, analysis_results)
print("Visualizations saved to output/plots/ directory")
# Step 6: Export results in multiple formats
print("\nExporting results...")
export_results(clean_df, analysis_results)
print("Results exported to CSV and JSON formats in output/ directory")
# Return the analysis results
return analysis_results
def create_sample_retail_data(filename, num_records=1000):
"""Create sample retail data for testing."""
# Set seed for reproducibility
np.random.seed(42)
# Define parameters
products = ['Laptop', 'Smartphone', 'Tablet', 'Headphones', 'Monitor', 'Keyboard',
'Mouse', 'Printer', 'Camera', 'Speaker']
categories = ['Electronics', 'Accessories', 'Peripherals', 'Audio']
product_categories = {
'Laptop': 'Electronics',
'Smartphone': 'Electronics',
'Tablet': 'Electronics',
'Headphones': 'Audio',
'Monitor': 'Electronics',
'Keyboard': 'Peripherals',
'Mouse': 'Peripherals',
'Printer': 'Peripherals',
'Camera': 'Electronics',
'Speaker': 'Audio'
}
regions = ['North', 'South', 'East', 'West', 'Central']
# Price ranges for products
price_ranges = {
'Laptop': (500, 2000),
'Smartphone': (300, 1200),
'Tablet': (200, 800),
'Headphones': (50, 300),
'Monitor': (150, 500),
'Keyboard': (20, 150),
'Mouse': (10, 80),
'Printer': (100, 400),
'Camera': (200, 1000),
'Speaker': (50, 300)
}
# Generate data
dates = pd.date_range(start='2022-01-01', end='2023-03-31', freq='D')
data = []
for _ in range(num_records):
product = np.random.choice(products)
category = product_categories[product]
price_range = price_ranges[product]
price = np.random.uniform(price_range[0], price_range[1])
record = {
'transaction_id': f"T{np.random.randint(10000, 99999)}",
'date': np.random.choice(dates).strftime('%Y-%m-%d'),
'product': product,
'category': category,
'quantity': np.random.randint(1, 6),
'price': round(price, 2),
'region': np.random.choice(regions),
'customer_id': f"C{np.random.randint(1000, 9999)}",
'payment_method': np.random.choice(['Credit Card', 'Debit Card', 'PayPal', 'Cash']),
'rating': np.random.choice([np.nan if np.random.random() < 0.3 else np.random.randint(1, 6)])
}
# Add some missing values
if np.random.random() < 0.05:
record['price'] = np.nan
if np.random.random() < 0.05:
record['quantity'] = np.nan
data.append(record)
# Convert to DataFrame and save
df = pd.DataFrame(data)
df.to_csv(filename, index=False)
print(f"Sample retail data created and saved to {filename}")
return df
def clean_retail_data(df):
"""Clean and prepare retail data for analysis."""
# Make a copy to avoid modifying the original
clean_df = df.copy()
# 1. Handle missing values
# Fill missing quantities with the median for that product
product_median_quantity = clean_df.groupby('product')['quantity'].transform('median')
clean_df['quantity'] = clean_df['quantity'].fillna(product_median_quantity)
# Fill any remaining missing quantities with the overall median
overall_median_quantity = clean_df['quantity'].median()
clean_df['quantity'] = clean_df['quantity'].fillna(overall_median_quantity)
# Fill missing prices with the median for that product
product_median_price = clean_df.groupby('product')['price'].transform('median')
clean_df['price'] = clean_df['price'].fillna(product_median_price)
# 2. Convert data types
# Convert date to datetime
clean_df['date'] = pd.to_datetime(clean_df['date'])
# Ensure quantity is integer
clean_df['quantity'] = clean_df['quantity'].astype(int)
# 3. Add calculated columns
# Calculate total amount
clean_df['total_amount'] = clean_df['quantity'] * clean_df['price']
# Add month and year columns
clean_df['month'] = clean_df['date'].dt.month
clean_df['year'] = clean_df['date'].dt.year
clean_df['month_year'] = clean_df['date'].dt.strftime('%Y-%m')
# Add day of week
clean_df['day_of_week'] = clean_df['date'].dt.dayofweek
clean_df['is_weekend'] = clean_df['day_of_week'].apply(lambda x: 1 if x >= 5 else 0)
# 4. Remove obvious outliers
# Find outliers based on z-score
z_scores = np.abs((clean_df['total_amount'] - clean_df['total_amount'].mean()) / clean_df['total_amount'].std())
clean_df = clean_df[z_scores < 3] # Keep rows with z-score < 3
return clean_df
def analyze_retail_data(df):
"""Perform multiple analyses on the retail data."""
results = {}
# 1. Sales summary
total_sales = df['total_amount'].sum()
total_transactions = df['transaction_id'].nunique()
total_units = df['quantity'].sum()
avg_transaction_value = total_sales / total_transactions
results['sales_summary'] = {
'total_sales': total_sales,
'total_transactions': total_transactions,
'total_units': total_units,
'avg_transaction_value': avg_transaction_value
}
# 2. Product analysis
product_sales = df.groupby('product').agg({
'total_amount': 'sum',
'quantity': 'sum',
'transaction_id': pd.Series.nunique
}).rename(columns={
'total_amount': 'revenue',
'quantity': 'units_sold',
'transaction_id': 'num_transactions'
}).sort_values('revenue', ascending=False)
results['product_analysis'] = product_sales.reset_index().to_dict(orient='records')
# 3. Category analysis
category_sales = df.groupby('category').agg({
'total_amount': 'sum',
'quantity': 'sum',
'transaction_id': pd.Series.nunique
}).rename(columns={
'total_amount': 'revenue',
'quantity': 'units_sold',
'transaction_id': 'num_transactions'
}).sort_values('revenue', ascending=False)
results['category_analysis'] = category_sales.reset_index().to_dict(orient='records')
# 4. Regional analysis
region_sales = df.groupby('region').agg({
'total_amount': 'sum',
'transaction_id': pd.Series.nunique,
'quantity': 'sum'
}).rename(columns={
'total_amount': 'revenue',
'transaction_id': 'num_transactions',
'quantity': 'units_sold'
}).sort_values('revenue', ascending=False)
results['region_analysis'] = region_sales.reset_index().to_dict(orient='records')
# 5. Time analysis
# Monthly sales
monthly_sales = df.groupby('month_year').agg({
'total_amount': 'sum',
'transaction_id': pd.Series.nunique
}).rename(columns={
'total_amount': 'revenue',
'transaction_id': 'num_transactions'
})
# Calculate month-over-month growth
monthly_sales['previous_month_revenue'] = monthly_sales['revenue'].shift(1)
monthly_sales['mom_growth'] = ((monthly_sales['revenue'] - monthly_sales['previous_month_revenue']) /
monthly_sales['previous_month_revenue'] * 100)
results['time_analysis'] = {
'monthly_sales': monthly_sales.reset_index().to_dict(orient='records'),
'day_of_week': df.groupby('day_of_week')['total_amount'].sum().to_dict(),
'weekend_vs_weekday': {
'weekend_sales': df[df['is_weekend'] == 1]['total_amount'].sum(),
'weekday_sales': df[df['is_weekend'] == 0]['total_amount'].sum()
}
}
# 6. Customer analysis
customer_stats = df.groupby('customer_id').agg({
'total_amount': 'sum',
'transaction_id': pd.Series.nunique,
'quantity': 'sum'
}).rename(columns={
'total_amount': 'total_spent',
'transaction_id': 'num_transactions',
'quantity': 'total_items'
})
customer_stats['avg_transaction_value'] = customer_stats['total_spent'] / customer_stats['num_transactions']
customer_stats['avg_item_value'] = customer_stats['total_spent'] / customer_stats['total_items']
# Get top customers
top_customers = customer_stats.sort_values('total_spent', ascending=False).head(10)
results['customer_analysis'] = {
'top_customers': top_customers.reset_index().to_dict(orient='records'),
'avg_transactions_per_customer': customer_stats['num_transactions'].mean(),
'avg_spend_per_customer': customer_stats['total_spent'].mean()
}
# 7. Payment method analysis
payment_stats = df.groupby('payment_method').agg({
'total_amount': 'sum',
'transaction_id': pd.Series.nunique
}).rename(columns={
'total_amount': 'revenue',
'transaction_id': 'num_transactions'
})
payment_stats['avg_transaction_value'] = payment_stats['revenue'] / payment_stats['num_transactions']
payment_stats['percentage'] = payment_stats['revenue'] / payment_stats['revenue'].sum() * 100
results['payment_analysis'] = payment_stats.reset_index().to_dict(orient='records')
return results
def create_visualizations(df, results):
"""Generate visualizations based on the analysis results."""
# 1. Monthly Sales Trend
plt.figure(figsize=(12, 6))
monthly_data = pd.DataFrame(results['time_analysis']['monthly_sales'])
plt.plot(monthly_data['month_year'], monthly_data['revenue'], marker='o', linestyle='-', linewidth=2)
plt.title('Monthly Sales Trend', fontsize=14)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Revenue ($)', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.7)
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('output/plots/monthly_sales_trend.png')
plt.close()
# 2. Product Revenue Comparison
plt.figure(figsize=(12, 6))
product_data = pd.DataFrame(results['product_analysis']).head(5)
plt.bar(product_data['product'], product_data['revenue'], color='skyblue')
plt.title('Top 5 Products by Revenue', fontsize=14)
plt.xlabel('Product', fontsize=12)
plt.ylabel('Revenue ($)', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.7, axis='y')
plt.tight_layout()
plt.savefig('output/plots/top_products_revenue.png')
plt.close()
# 3. Category Distribution Pie Chart
plt.figure(figsize=(10, 8))
category_data = pd.DataFrame(results['category_analysis'])
plt.pie(category_data['revenue'], labels=category_data['category'], autopct='%1.1f%%',
startangle=90, shadow=True, explode=[0.05] * len(category_data))
plt.title('Revenue Distribution by Category', fontsize=14)
plt.axis('equal')
plt.tight_layout()
plt.savefig('output/plots/category_distribution.png')
plt.close()
# 4. Regional Sales Comparison
plt.figure(figsize=(10, 6))
region_data = pd.DataFrame(results['region_analysis'])
plt.bar(region_data['region'], region_data['revenue'], color='lightgreen')
plt.title('Sales by Region', fontsize=14)
plt.xlabel('Region', fontsize=12)
plt.ylabel('Revenue ($)', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.7, axis='y')
plt.tight_layout()
plt.savefig('output/plots/regional_sales.png')
plt.close()
# 5. Weekday vs Weekend Sales
plt.figure(figsize=(8, 6))
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_sales = [results['time_analysis']['day_of_week'].get(str(i), 0) for i in range(7)]
colors = ['lightblue'] * 5 + ['coral'] * 2 # Weekdays in blue, weekends in coral
plt.bar(days, day_sales, color=colors)
plt.title('Sales by Day of Week', fontsize=14)
plt.xlabel('Day', fontsize=12)
plt.ylabel('Revenue ($)', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.7, axis='y')
plt.tight_layout()
plt.savefig('output/plots/day_of_week_sales.png')
plt.close()
# 6. Payment Method Distribution
plt.figure(figsize=(10, 6))
payment_data = pd.DataFrame(results['payment_analysis'])
plt.bar(payment_data['payment_method'], payment_data['revenue'], color='purple', alpha=0.7)
plt.title('Revenue by Payment Method', fontsize=14)
plt.xlabel('Payment Method', fontsize=12)
plt.ylabel('Revenue ($)', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.7, axis='y')
plt.tight_layout()
plt.savefig('output/plots/payment_method_revenue.png')
plt.close()
def export_results(df, results):
"""Export analysis results in multiple formats."""
# 1. Export summary report as JSON
with open('output/json/analysis_results.json', 'w') as f:
json.dump(results, f, indent=4, default=str)
# 2. Export key dataframes as CSV
# Product analysis
pd.DataFrame(results['product_analysis']).to_csv('output/csv/product_analysis.csv', index=False)
# Category analysis
pd.DataFrame(results['category_analysis']).to_csv('output/csv/category_analysis.csv', index=False)
# Region analysis
pd.DataFrame(results['region_analysis']).to_csv('output/csv/region_analysis.csv', index=False)
# Monthly sales
pd.DataFrame(results['time_analysis']['monthly_sales']).to_csv('output/csv/monthly_sales.csv', index=False)
# 3. Export pivot tables
# Product sales by month
product_by_month = df.pivot_table(
index='month_year',
columns='product',
values='total_amount',
aggfunc='sum',
fill_value=0
)
product_by_month.to_csv('output/csv/product_sales_by_month.csv')
# Category sales by region
category_by_region = df.pivot_table(
index='region',
columns='category',
values='total_amount',
aggfunc='sum',
fill_value=0
)
category_by_region.to_csv('output/csv/category_sales_by_region.csv')
# 4. Create a formatted summary report
report_lines = [
"RETAIL SALES ANALYSIS REPORT",
"=" * 50,
f"Generated on: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}",
"",
"SUMMARY STATISTICS",
"-" * 30,
f"Total Sales: ${results['sales_summary']['total_sales']:,.2f}",
f"Total Transactions: {results['sales_summary']['total_transactions']:,}",
f"Total Units Sold: {results['sales_summary']['total_units']:,}",
f"Average Transaction Value: ${results['sales_summary']['avg_transaction_value']:,.2f}",
"",
"TOP 5 PRODUCTS",
"-" * 30
]
# Add top products
for i, product in enumerate(results['product_analysis'][:5], 1):
report_lines.append(f"{i}. {product['product']}: ${product['revenue']:,.2f} ({product['units_sold']:,} units)")
report_lines.extend([
"",
"CATEGORY BREAKDOWN",
"-" * 30
])
# Add categories
for category in results['category_analysis']:
report_lines.append(f"{category['category']}: ${category['revenue']:,.2f} ({category['revenue']/results['sales_summary']['total_sales']*100:.1f}%)")
report_lines.extend([
"",
"REGIONAL PERFORMANCE",
"-" * 30
])
# Add regions
for region in results['region_analysis']:
report_lines.append(f"{region['region']}: ${region['revenue']:,.2f} ({region['revenue']/results['sales_summary']['total_sales']*100:.1f}%)")
# Save the report
with open('output/retail_analysis_summary.txt', 'w') as f:
f.write('\n'.join(report_lines))
# Run the comprehensive data processing script
if __name__ == "__main__":
results = process_retail_data()
print("\nProcessing complete. Check the 'output' directory for results.")
# Sample execution output:
# File retail_data.csv not found. Creating sample data for testing...
# Sample retail data created and saved to retail_data.csv
# Reading data from retail_data.csv...
# Read 1000 rows and 12 columns.
# First few rows:
# transaction_id date product category quantity price region customer_id payment_method rating month year
# 0 T81370 2022-12-31 Smartphone Electronics 2 1124.97 Central C6491 Credit Card NaN 12 2022
# 1 T46153 2022-01-22 Mouse Peripherals 2 12.76 South C5137 Credit Card 4.0 1 2022
# 2 T31822 2022-11-18 Mouse Peripherals 5 60.21 East C1405 Credit Card NaN 11 2022
# 3 T51788 2022-07-12 Mouse Peripherals 2 73.99 West C7122 Credit Card 3.0 7 2022
# 4 T55103 2023-01-25 Smartphone Electronics 3 467.15 East C1076 PayPal 1.0 1 2023
# ...
3.7 Chapter Summary and Connection to Chapter 4
In this chapter, we’ve explored two powerful libraries that significantly enhance our data processing capabilities:
Key Concepts Covered
NumPy - The foundation for numerical computing in Python
- Creating and manipulating arrays
- Vectorized operations for performance
- Basic statistical functions
- Broadcasting for flexible calculations
Pandas - The essential toolkit for data manipulation
- Series and DataFrame data structures
- Loading, cleaning, and processing tabular data
- Data aggregation and grouping
- Handling missing values and data transformations
Real-world Applications
These libraries form the foundation of modern data engineering workflows:
- NumPy provides the efficient numerical operations needed for data transformation
- Pandas simplifies working with tabular data from various sources
- Together, they enable the data cleaning, transformation, and analysis tasks that are central to data engineering
Connection to Chapter 4: Web Integration and APIs
In the next chapter, we’ll expand our data engineering toolkit to include web integration and APIs. This will allow us to:
- Access remote data sources through web APIs
- Fetch data from online services using HTTP requests
- Parse API responses (often in JSON format) using our Pandas skills
- Create data pipelines that integrate local and remote data
The skills you’ve learned in this chapter will be directly applicable when working with API data:
- Converting JSON API responses to Pandas DataFrames
- Cleaning and transforming data from web sources
- Combining data from multiple APIs using Pandas join operations
- Analyzing and visualizing API data with NumPy and Pandas
As we move forward in our data engineering journey, NumPy and Pandas will remain constant companions, working alongside the new tools and techniques we’ll learn.