09 - Python Testing for Data Engineering
Complexity: Moderate (M)
9.0 Introduction: Why Testing Matters for Data Engineering
In data engineering, your code often serves as the foundation for critical business decisions. Incorrect data transformations, failed pipelines, or inaccurate calculations can lead to significant problems downstream. Testing ensures that your data pipelines function correctly, maintain data integrity, and remain reliable over time.
Let’s visualize how testing fits into the data engineering lifecycle:
flowchart TD A[Raw Data Sources] --> B[Data Extraction] B --> C[Data Transformation] C --> D[Data Loading] D --> E[Data Consumption] T1[Unit Tests] -.-> B T1 -.-> C T1 -.-> D T2[Integration Tests] -.-> |Validates| F((Pipeline Flow)) F -.-> B F -.-> C F -.-> D T3[Data Validation Tests] -.-> G((Data Quality)) G -.-> A G -.-> B G -.-> C G -.-> D classDef process fill:#d0e0ff,stroke:#333,stroke-width:1px classDef test fill:#ffddaa,stroke:#333,stroke-width:1px classDef concern fill:#ddffdd,stroke:#333,stroke-width:1px class A,B,C,D,E process class T1,T2,T3 test class F,G concern
This chapter builds directly on the Python fundamentals and code quality practices we’ve covered in previous chapters. We’ll use the type annotations from Chapter 7 and the code quality tools from Chapter 8 to build a robust testing framework for data engineering applications.
By the end of this chapter, you’ll be able to:
- Create basic unit tests for data functions
- Set up test fixtures for consistent test data
- Use mocking to isolate code from external dependencies
- Test numerical accuracy for data transformations
- Build a comprehensive testing framework for data validation
Good testing practices are essential for production-quality data engineering code. Let’s dive in!
9.1 Test Fundamentals with pytest
Pytest is a popular Python testing framework that makes it easy to write simple and scalable tests. It’s particularly well-suited for data engineering due to its simplicity and powerful features.
9.1.1 Installing pytest
First, let’s install pytest:
# This would be run in your command line
# pip install pytest
# You can verify the installation with:
import pytest
print(f"pytest version: {pytest.__version__}")
# pytest version: 7.3.1
9.1.2 Writing Your First Test
Let’s create a simple function and write a test for it:
# Save this as data_functions.py
from typing import List, Union, Dict, Any
def calculate_average(numbers: List[Union[int, float]]) -> float:
"""Calculate the average of a list of numbers."""
if not numbers:
return 0.0
return sum(numbers) / len(numbers)
# Example usage
if __name__ == "__main__":
sample_data = [10, 15, 20, 25, 30]
avg = calculate_average(sample_data)
print(f"The average of {sample_data} is {avg}")
# The average of [10, 15, 20, 25, 30] is 20.0
Now, let’s write a simple test for this function:
# Save this as test_data_functions.py
from typing import List, Union
import pytest
from data_functions import calculate_average
def test_calculate_average() -> None:
"""Test that the calculate_average function works correctly."""
# Test with whole numbers
test_data: List[int] = [10, 20, 30, 40, 50]
result = calculate_average(test_data)
print(f"Testing with {test_data}, got {result}")
assert result == 30.0
# Test with decimal numbers
test_data_decimal: List[float] = [1.5, 2.5, 3.5]
result_decimal = calculate_average(test_data_decimal)
print(f"Testing with {test_data_decimal}, got {result_decimal}")
assert result_decimal == 2.5
# Test with empty list
empty_list: List[float] = []
result_empty = calculate_average(empty_list)
print(f"Testing with {empty_list}, got {result_empty}")
assert result_empty == 0.0
# This won't be executed when running with pytest,
# but lets us run the test directly if needed
if __name__ == "__main__":
test_calculate_average()
print("All tests passed!")
# Testing with [10, 20, 30, 40, 50], got 30.0
# Testing with [1.5, 2.5, 3.5], got 2.5
# Testing with [], got 0.0
# All tests passed!
9.1.3 Running Tests with pytest
To run the tests, you would use the pytest command in your terminal:
# From the command line
pytest test_data_functions.py -v
The -v flag provides verbose output. When you run this command, pytest will discover and run all functions that start with test_
in the specified file.
9.1.4 Understanding Test Results
Let’s explore what happens when tests fail:
# Save this as test_failures.py
from typing import List
import pytest
def add_numbers(a: int, b: int) -> int:
"""Add two numbers together."""
# Let's introduce a bug to see test failure
return a - b # Bug: subtraction instead of addition
def test_add_numbers() -> None:
"""Test the add_numbers function."""
result = add_numbers(5, 3)
print(f"5 + 3 = {result}")
assert result == 8, f"Expected 8 but got {result}"
def test_add_negative_numbers() -> None:
"""Test adding negative numbers."""
result = add_numbers(-2, -3)
print(f"-2 + -3 = {result}")
assert result == -5, f"Expected -5 but got {result}"
if __name__ == "__main__":
# Running directly will show us the print outputs
test_add_numbers()
# 5 + 3 = 2
# AssertionError: Expected 8 but got 2
When running with pytest, it will provide detailed reports of which tests failed and why.
9.1.5 Organizing Multiple Tests
As your tests grow, it’s helpful to group them:
# Save this as test_calculations.py
from typing import List, Dict, Any, Union
import pytest
# Functions to test
def calculate_sum(numbers: List[Union[int, float]]) -> Union[int, float]:
"""Calculate the sum of a list of numbers."""
return sum(numbers)
def calculate_product(numbers: List[Union[int, float]]) -> Union[int, float]:
"""Calculate the product of a list of numbers."""
if not numbers:
return 0
result = 1
for num in numbers:
result *= num
return result
# Test class for grouping related tests
class TestCalculations:
"""Tests for calculation functions."""
def test_sum_positive_numbers(self) -> None:
"""Test sum with positive numbers."""
data = [1, 2, 3, 4, 5]
result = calculate_sum(data)
print(f"Sum of {data} = {result}")
assert result == 15
def test_sum_negative_numbers(self) -> None:
"""Test sum with negative numbers."""
data = [-1, -2, -3]
result = calculate_sum(data)
print(f"Sum of {data} = {result}")
assert result == -6
def test_product_positive_numbers(self) -> None:
"""Test product with positive numbers."""
data = [2, 3, 4]
result = calculate_product(data)
print(f"Product of {data} = {result}")
assert result == 24
def test_product_with_zero(self) -> None:
"""Test product with zero in the list."""
data = [2, 0, 4]
result = calculate_product(data)
print(f"Product of {data} = {result}")
assert result == 0
def test_product_empty_list(self) -> None:
"""Test product with empty list."""
data: List[int] = []
result = calculate_product(data)
print(f"Product of {data} = {result}")
assert result == 0
if __name__ == "__main__":
# Running the test class directly for demonstration
test = TestCalculations()
test.test_sum_positive_numbers()
test.test_sum_negative_numbers()
test.test_product_positive_numbers()
test.test_product_with_zero()
test.test_product_empty_list()
print("All tests passed!")
# Sum of [1, 2, 3, 4, 5] = 15
# Sum of [-1, -2, -3] = -6
# Product of [2, 3, 4] = 24
# Product of [2, 0, 4] = 0
# Product of [] = 0
# All tests passed!
9.1.6 Assertion Methods in pytest
Pytest offers a rich set of assertion methods that make it easy to check different conditions:
# Save as test_assertions.py
from typing import Dict, List, Any
import pytest
def test_different_assertions() -> None:
"""Demonstrate different pytest assertion methods."""
# Equal
result = 5 + 5
print(f"Testing 5 + 5 = {result} equals 10")
assert result == 10
# Not equal
value = "hello"
print(f"Testing '{value}' is not equal to 'world'")
assert value != "world"
# True/False
condition = 5 > 3
print(f"Testing 5 > 3 is {condition}")
assert condition is True
# In (membership)
items = [1, 2, 3]
print(f"Testing 2 is in {items}")
assert 2 in items
# Instance check
data_dict: Dict[str, int] = {"a": 1, "b": 2}
print(f"Testing {data_dict} is a dictionary")
assert isinstance(data_dict, dict)
# Exception check - using a context manager
def divide_by_zero() -> float:
return 1 / 0
print("Testing that dividing by zero raises ZeroDivisionError")
with pytest.raises(ZeroDivisionError):
divide_by_zero()
if __name__ == "__main__":
test_different_assertions()
print("All assertions passed!")
# Testing 5 + 5 = 10 equals 10
# Testing 'hello' is not equal to 'world'
# Testing 5 > 3 is True
# Testing 2 is in [1, 2, 3]
# Testing {'a': 1, 'b': 2} is a dictionary
# Testing that dividing by zero raises ZeroDivisionError
# All assertions passed!
9.2 Unit Testing Data Functions
Data engineering involves a lot of transformation functions that convert, clean, and manipulate data. Unit testing these functions ensures they work correctly for all expected inputs.
9.2.1 Testing Data Transformation Functions
Let’s create some typical data transformation functions and write tests for them:
# Save as data_transformations.py
from typing import Dict, List, Any, Union, Optional
def clean_numeric_column(values: List[Any]) -> List[Optional[float]]:
"""Convert a list of values to floats, replacing non-numeric values with None."""
cleaned = []
for val in values:
try:
if val is None or val == "":
cleaned.append(None)
else:
cleaned.append(float(val))
except (ValueError, TypeError):
cleaned.append(None)
return cleaned
def calculate_growth_rate(original: float, new: float) -> float:
"""Calculate the growth rate from original to new value."""
if original <= 0:
raise ValueError("Original value must be positive for growth rate calculation")
return (new - original) / original * 100.0
def categorize_values(values: List[float], thresholds: Dict[str, float]) -> List[str]:
"""Categorize values based on thresholds."""
result = []
for val in values:
for category, threshold in sorted(thresholds.items(),
key=lambda x: x[1], reverse=True):
if val >= threshold:
result.append(category)
break
else: # This else belongs to the for loop (executes if no break)
result.append("below_threshold")
return result
# Example usage
if __name__ == "__main__":
# Clean numeric column
raw_data = ["10.5", "", "not a number", "20", None, "30.5"]
cleaned = clean_numeric_column(raw_data)
print(f"Raw data: {raw_data}")
print(f"Cleaned: {cleaned}")
# Growth rate
old_value = 100
new_value = 120
growth = calculate_growth_rate(old_value, new_value)
print(f"Growth from {old_value} to {new_value}: {growth}%")
# Categorize values
values = [15, 35, 75, 5, 50]
thresholds = {"high": 70, "medium": 30, "low": 10}
categories = categorize_values(values, thresholds)
print(f"Values: {values}")
print(f"Categories: {categories}")
# Raw data: ['10.5', '', 'not a number', '20', None, '30.5']
# Cleaned: [10.5, None, None, 20.0, None, 30.5]
# Growth from 100 to 120: 20.0%
# Values: [15, 35, 75, 5, 50]
# Categories: ['low', 'medium', 'high', 'below_threshold', 'medium']
Now, let’s write unit tests for these functions:
# Save as test_data_transformations.py
from typing import Dict, List, Any, Optional
import pytest
from data_transformations import (
clean_numeric_column,
calculate_growth_rate,
categorize_values
)
class TestCleanNumericColumn:
"""Tests for the clean_numeric_column function."""
def test_valid_numbers(self) -> None:
"""Test with valid numeric strings."""
data = ["10", "20.5", "30"]
result = clean_numeric_column(data)
print(f"Cleaning {data} gives {result}")
assert result == [10.0, 20.5, 30.0]
def test_invalid_inputs(self) -> None:
"""Test with invalid inputs."""
data = ["not a number", "", None]
result = clean_numeric_column(data)
print(f"Cleaning {data} gives {result}")
assert result == [None, None, None]
def test_mixed_inputs(self) -> None:
"""Test with a mix of valid and invalid inputs."""
data = ["10.5", "not a number", "20", None]
result = clean_numeric_column(data)
print(f"Cleaning {data} gives {result}")
assert result == [10.5, None, 20.0, None]
class TestCalculateGrowthRate:
"""Tests for the calculate_growth_rate function."""
def test_positive_growth(self) -> None:
"""Test with positive growth."""
original, new = 100, 150
result = calculate_growth_rate(original, new)
print(f"Growth from {original} to {new}: {result}%")
assert result == 50.0
def test_negative_growth(self) -> None:
"""Test with negative growth (decline)."""
original, new = 100, 80
result = calculate_growth_rate(original, new)
print(f"Growth from {original} to {new}: {result}%")
assert result == -20.0
def test_zero_growth(self) -> None:
"""Test with no growth."""
original, new = 100, 100
result = calculate_growth_rate(original, new)
print(f"Growth from {original} to {new}: {result}%")
assert result == 0.0
def test_invalid_original(self) -> None:
"""Test with invalid original value."""
original, new = 0, 100
print(f"Testing growth from {original} to {new} raises ValueError")
with pytest.raises(ValueError):
calculate_growth_rate(original, new)
class TestCategorizeValues:
"""Tests for the categorize_values function."""
def test_basic_categorization(self) -> None:
"""Test basic categorization with distinct values."""
values: List[float] = [10, 40, 80]
thresholds: Dict[str, float] = {"high": 70, "medium": 30, "low": 10}
result = categorize_values(values, thresholds)
print(f"Categorizing {values} with thresholds {thresholds}")
print(f"Results: {result}")
assert result == ["low", "medium", "high"]
def test_below_all_thresholds(self) -> None:
"""Test values below all thresholds."""
values: List[float] = [5, 8, 9]
thresholds: Dict[str, float] = {"high": 70, "medium": 30, "low": 10}
result = categorize_values(values, thresholds)
print(f"Categorizing {values} with thresholds {thresholds}")
print(f"Results: {result}")
assert result == ["below_threshold", "below_threshold", "below_threshold"]
def test_edge_cases(self) -> None:
"""Test edge cases (values exactly at thresholds)."""
values: List[float] = [10, 30, 70]
thresholds: Dict[str, float] = {"high": 70, "medium": 30, "low": 10}
result = categorize_values(values, thresholds)
print(f"Categorizing {values} with thresholds {thresholds}")
print(f"Results: {result}")
assert result == ["low", "medium", "high"]
if __name__ == "__main__":
# Run tests manually (pytest would run them automatically)
# Clean numeric column tests
clean_tests = TestCleanNumericColumn()
clean_tests.test_valid_numbers()
clean_tests.test_invalid_inputs()
clean_tests.test_mixed_inputs()
# Growth rate tests
growth_tests = TestCalculateGrowthRate()
growth_tests.test_positive_growth()
growth_tests.test_negative_growth()
growth_tests.test_zero_growth()
try:
growth_tests.test_invalid_original()
print("Error: test_invalid_original should have raised an exception")
except AssertionError:
print("test_invalid_original correctly caught the ValueError")
# Categorize values tests
cat_tests = TestCategorizeValues()
cat_tests.test_basic_categorization()
cat_tests.test_below_all_thresholds()
cat_tests.test_edge_cases()
print("All tests passed!")
# Cleaning ['10', '20.5', '30'] gives [10.0, 20.5, 30.0]
# Cleaning ['not a number', '', None] gives [None, None, None]
# Cleaning ['10.5', 'not a number', '20', None] gives [10.5, None, 20.0, None]
# Growth from 100 to 150: 50.0%
# Growth from 100 to 80: -20.0%
# Growth from 100 to 100: 0.0%
# Testing growth from 0 to 100 raises ValueError
# test_invalid_original correctly caught the ValueError
# Categorizing [10, 40, 80] with thresholds {'high': 70, 'medium': 30, 'low': 10}
# Results: ['low', 'medium', 'high']
# Categorizing [5, 8, 9] with thresholds {'high': 70, 'medium': 30, 'low': 10}
# Results: ['below_threshold', 'below_threshold', 'below_threshold']
# Categorizing [10, 30, 70] with thresholds {'high': 70, 'medium': 30, 'low': 10}
# Results: ['low', 'medium', 'high']
# All tests passed!
9.2.2 Testing for Expected Exceptions
In data engineering, validating that your code correctly handles invalid inputs is crucial. Pytest makes it easy to test for expected exceptions:
# Save as test_exceptions.py
from typing import List, Any
import pytest
def parse_int_safely(value: Any) -> int:
"""Parse a value to integer, raising a ValueError with a custom message if invalid."""
if value is None:
raise ValueError("Cannot parse None to int")
try:
return int(value)
except (ValueError, TypeError):
raise ValueError(f"Cannot parse '{value}' to int")
def test_parse_int_valid() -> None:
"""Test parsing valid integers."""
assert parse_int_safely(5) == 5
assert parse_int_safely("10") == 10
print("Valid integer parsing works!")
def test_parse_int_none() -> None:
"""Test that passing None raises the correct exception."""
print("Testing that parse_int_safely(None) raises ValueError")
with pytest.raises(ValueError) as excinfo:
parse_int_safely(None)
# Check that the error message is correct
assert "Cannot parse None to int" in str(excinfo.value)
print(f"Caught expected error: {excinfo.value}")
def test_parse_int_invalid() -> None:
"""Test that passing invalid values raises the correct exception."""
invalid_values = ["not a number", {"a": 1}, [1, 2, 3]]
for val in invalid_values:
print(f"Testing that parse_int_safely('{val}') raises ValueError")
with pytest.raises(ValueError) as excinfo:
parse_int_safely(val)
# Verify the error message contains the value
assert str(val) in str(excinfo.value)
print(f"Caught expected error: {excinfo.value}")
if __name__ == "__main__":
test_parse_int_valid()
test_parse_int_none()
test_parse_int_invalid()
print("All exception tests passed!")
# Valid integer parsing works!
# Testing that parse_int_safely(None) raises ValueError
# Caught expected error: Cannot parse None to int
# Testing that parse_int_safely('not a number') raises ValueError
# Caught expected error: Cannot parse 'not a number' to int
# Testing that parse_int_safely('{'a': 1}') raises ValueError
# Caught expected error: Cannot parse '{'a': 1}' to int
# Testing that parse_int_safely('[1, 2, 3]') raises ValueError
# Caught expected error: Cannot parse '[1, 2, 3]' to int
# All exception tests passed!
9.3 Test Fixtures and Factories
When testing data engineering code, you often need consistent test data. Pytest fixtures provide a way to set up and share test data across multiple tests.
9.3.1 Basic Test Fixtures
Fixtures are functions decorated with @pytest.fixture
that provide test data or setup:
# Save as test_fixtures.py
from typing import Dict, List, Any
import pytest
@pytest.fixture
def sample_data() -> List[Dict[str, Any]]:
"""Fixture providing sample customer data for tests."""
return [
{"id": 1, "name": "Alice", "age": 30, "purchases": 150},
{"id": 2, "name": "Bob", "age": 25, "purchases": 200},
{"id": 3, "name": "Charlie", "age": 35, "purchases": 100},
{"id": 4, "name": "Diana", "age": 40, "purchases": 300},
{"id": 5, "name": "Eve", "age": 22, "purchases": 250}
]
# Functions to test
def get_customer_by_id(customers: List[Dict[str, Any]], customer_id: int) -> Dict[str, Any]:
"""Find a customer by their ID."""
for customer in customers:
if customer["id"] == customer_id:
return customer
raise ValueError(f"No customer with ID {customer_id}")
def calculate_total_purchases(customers: List[Dict[str, Any]]) -> float:
"""Calculate the total purchases across all customers."""
return sum(customer["purchases"] for customer in customers)
def get_customers_by_age_range(customers: List[Dict[str, Any]],
min_age: int, max_age: int) -> List[Dict[str, Any]]:
"""Get customers within a specific age range."""
return [c for c in customers if min_age <= c["age"] <= max_age]
# Tests using the fixture
def test_get_customer_by_id(sample_data: List[Dict[str, Any]]) -> None:
"""Test finding a customer by ID."""
customer = get_customer_by_id(sample_data, 3)
print(f"Found customer: {customer}")
assert customer["name"] == "Charlie"
assert customer["age"] == 35
def test_calculate_total_purchases(sample_data: List[Dict[str, Any]]) -> None:
"""Test calculating total purchases."""
total = calculate_total_purchases(sample_data)
print(f"Total purchases: {total}")
expected_total = sum(c["purchases"] for c in sample_data)
assert total == expected_total
assert total == 1000 # 150 + 200 + 100 + 300 + 250
def test_get_customers_by_age_range(sample_data: List[Dict[str, Any]]) -> None:
"""Test filtering customers by age range."""
young_customers = get_customers_by_age_range(sample_data, 20, 30)
print(f"Customers aged 20-30: {[c['name'] for c in young_customers]}")
assert len(young_customers) == 2
assert young_customers[0]["name"] == "Bob"
assert young_customers[1]["name"] == "Eve"
if __name__ == "__main__":
# For direct execution (not with pytest), create the fixture data manually
data = sample_data()
test_get_customer_by_id(data)
test_calculate_total_purchases(data)
test_get_customers_by_age_range(data)
print("All fixture tests passed!")
# Found customer: {'id': 3, 'name': 'Charlie', 'age': 35, 'purchases': 100}
# Total purchases: 1000
# Customers aged 20-30: ['Bob', 'Eve']
# All fixture tests passed!
9.3.2 Fixture Scope and Clean-up
Fixtures can have different scopes (function, class, module, session) and can include clean-up code:
# Save as test_fixture_scope.py
from typing import Dict, List, Any, Iterator
import pytest
import os
@pytest.fixture(scope="module")
def temp_data_file() -> Iterator[str]:
"""
Create a temporary data file for testing.
This fixture has module scope, so it runs once per test module.
"""
filename = "temp_test_data.txt"
print(f"\nCreating temporary file: {filename}")
# Setup - create the file
with open(filename, "w") as f:
f.write("line 1\n")
f.write("line 2\n")
f.write("line 3\n")
# Return the filename to the tests
yield filename
# Teardown - clean up after all tests using this fixture are done
print(f"Cleaning up: Removing {filename}")
if os.path.exists(filename):
os.remove(filename)
def count_lines_in_file(filename: str) -> int:
"""Count the number of lines in a file."""
with open(filename, "r") as f:
return len(f.readlines())
def append_to_file(filename: str, content: str) -> None:
"""Append content to a file."""
with open(filename, "a") as f:
f.write(content)
def test_count_lines(temp_data_file: str) -> None:
"""Test counting lines in the temp file."""
count = count_lines_in_file(temp_data_file)
print(f"File has {count} lines")
assert count == 3
def test_append_to_file(temp_data_file: str) -> None:
"""Test appending to the temp file."""
original_count = count_lines_in_file(temp_data_file)
append_to_file(temp_data_file, "line 4\n")
new_count = count_lines_in_file(temp_data_file)
print(f"File had {original_count} lines, now has {new_count} lines")
assert new_count == original_count + 1
if __name__ == "__main__":
# Manual test execution
file = temp_data_file.__next__() # Get the yielded value
try:
test_count_lines(file)
test_append_to_file(file)
print("All fixture scope tests passed!")
finally:
# Execute the teardown code
try:
temp_data_file.__next__()
except StopIteration:
pass
# Creating temporary file: temp_test_data.txt
# File has 3 lines
# File had 3 lines, now has 4 lines
# All fixture scope tests passed!
# Cleaning up: Removing temp_test_data.txt
9.3.3 Parametrized Fixtures
You can create fixtures that generate multiple test cases:
# Save as test_parametrized.py
from typing import List, Dict, Any, Tuple
import pytest
# Function to test
def get_stats(numbers: List[float]) -> Dict[str, float]:
"""Calculate basic statistics for a list of numbers."""
if not numbers:
return {"count": 0, "sum": 0, "avg": 0, "min": 0, "max": 0}
return {
"count": len(numbers),
"sum": sum(numbers),
"avg": sum(numbers) / len(numbers),
"min": min(numbers),
"max": max(numbers)
}
# Test with parametrization
@pytest.mark.parametrize("test_input,expected", [
([1, 2, 3, 4, 5], {"count": 5, "sum": 15, "avg": 3.0, "min": 1, "max": 5}),
([10], {"count": 1, "sum": 10, "avg": 10.0, "min": 10, "max": 10}),
([], {"count": 0, "sum": 0, "avg": 0, "min": 0, "max": 0})
])
def test_get_stats(test_input: List[float], expected: Dict[str, float]) -> None:
"""Test statistics calculation with different inputs."""
result = get_stats(test_input)
print(f"Stats for {test_input}: {result}")
assert result == expected
if __name__ == "__main__":
# Test with each set of parameters
test_cases = [
([1, 2, 3, 4, 5], {"count": 5, "sum": 15, "avg": 3.0, "min": 1, "max": 5}),
([10], {"count": 1, "sum": 10, "avg": 10.0, "min": 10, "max": 10}),
([], {"count": 0, "sum": 0, "avg": 0, "min": 0, "max": 0})
]
for test_input, expected in test_cases:
test_get_stats(test_input, expected)
print("All parametrized tests passed!")
# Stats for [1, 2, 3, 4, 5]: {'count': 5, 'sum': 15, 'avg': 3.0, 'min': 1, 'max': 5}
# Stats for [10]: {'count': 1, 'sum': 10, 'avg': 10.0, 'min': 10, 'max': 10}
# Stats for []: {'count': 0, 'sum': 0, 'avg': 0, 'min': 0, 'max': 0}
# All parametrized tests passed!
9.3.4 Factory Fixtures for Dynamic Test Data
Sometimes you need to generate test data dynamically based on test parameters:
# Save as test_factory_fixtures.py
from typing import List, Dict, Any, Callable
import pytest
import random
import string
# Define a factory fixture
@pytest.fixture
def generate_customer_data() -> Callable[[int], List[Dict[str, Any]]]:
"""
Factory fixture that returns a function to generate customer data.
This demonstrates how to create parameterized test data.
"""
def _generate(num_customers: int) -> List[Dict[str, Any]]:
"""Generate a specified number of random customer records."""
customers = []
for i in range(1, num_customers + 1):
# Generate random customer data
name_length = random.randint(5, 10)
random_name = ''.join(random.choice(string.ascii_letters)
for _ in range(name_length))
customers.append({
"id": i,
"name": random_name,
"age": random.randint(18, 70),
"purchases": random.randint(50, 500)
})
return customers
return _generate
# Function to test
def get_high_value_customers(customers: List[Dict[str, Any]],
threshold: float = 200.0) -> List[Dict[str, Any]]:
"""Get customers with purchases above the threshold."""
return [c for c in customers if c["purchases"] >= threshold]
# Use the factory fixture in tests
def test_high_value_few_customers(generate_customer_data: Callable[[int], List[Dict[str, Any]]]) -> None:
"""Test with a small number of customers."""
# Generate 5 random customers
customers = generate_customer_data(5)
print(f"Generated {len(customers)} random customers")
# Count how many are high value based on our definition
high_value_count = sum(1 for c in customers if c["purchases"] >= 200)
# Test our function
high_value = get_high_value_customers(customers)
print(f"Found {len(high_value)} high value customers")
# Verify the result
assert len(high_value) == high_value_count
for customer in high_value:
assert customer["purchases"] >= 200
def test_high_value_many_customers(generate_customer_data: Callable[[int], List[Dict[str, Any]]]) -> None:
"""Test with a larger number of customers."""
# Generate 20 random customers
customers = generate_customer_data(20)
print(f"Generated {len(customers)} random customers")
# Count high value customers
high_value_count = sum(1 for c in customers if c["purchases"] >= 200)
# Test with different threshold
threshold = 300
high_value = get_high_value_customers(customers, threshold)
high_value_expected = sum(1 for c in customers if c["purchases"] >= threshold)
print(f"Found {len(high_value)} customers with purchases >= {threshold}")
assert len(high_value) == high_value_expected
if __name__ == "__main__":
# For direct execution, simulate fixture behavior
factory = generate_customer_data()
# Set random seed for reproducibility in example
random.seed(42)
test_high_value_few_customers(factory)
test_high_value_many_customers(factory)
print("All factory fixture tests passed!")
# Generated 5 random customers
# Found 3 high value customers
# Generated 20 random customers
# Found 3 customers with purchases >= 300
# All factory fixture tests passed!
9.4 Mock Objects for External Dependencies
When testing data engineering code, you often need to isolate your tests from external dependencies like databases, APIs, or file systems. Mocking allows you to replace these dependencies with controlled objects.
9.4.1 Basic Mocking with pytest-mock
First, let’s create a module that depends on external systems:
# Save as data_loader.py
from typing import Dict, List, Any, Optional
import os
import csv
import json
import requests
def load_from_csv(filename: str) -> List[Dict[str, Any]]:
"""Load data from a CSV file."""
if not os.path.exists(filename):
raise FileNotFoundError(f"File not found: {filename}")
with open(filename, 'r', newline='') as f:
reader = csv.DictReader(f)
return list(reader)
def load_from_api(url: str, api_key: Optional[str] = None) -> Dict[str, Any]:
"""Load data from an API endpoint."""
headers = {}
if api_key:
headers["Authorization"] = f"Bearer {api_key}"
response = requests.get(url, headers=headers)
response.raise_for_status() # Raise exception for HTTP errors
return response.json()
def save_to_json(data: Dict[str, Any], filename: str) -> None:
"""Save data to a JSON file."""
with open(filename, 'w') as f:
json.dump(data, f, indent=2)
def process_customer_data(source_type: str, source: str) -> Dict[str, Any]:
"""
Process customer data from either CSV file or API.
Returns summary statistics.
"""
customers = []
if source_type == "csv":
customers = load_from_csv(source)
elif source_type == "api":
api_data = load_from_api(source)
customers = api_data.get("customers", [])
else:
raise ValueError(f"Unsupported source type: {source_type}")
# Process the data
total_customers = len(customers)
if total_customers == 0:
return {"total": 0, "avg_age": 0, "avg_purchases": 0}
# Extract and convert numeric fields
ages = []
purchases = []
for customer in customers:
try:
age = float(customer.get("age", 0))
purchase = float(customer.get("purchases", 0))
ages.append(age)
purchases.append(purchase)
except (ValueError, TypeError):
# Skip records with invalid numeric data
pass
# Calculate statistics
avg_age = sum(ages) / len(ages) if ages else 0
avg_purchases = sum(purchases) / len(purchases) if purchases else 0
return {
"total": total_customers,
"avg_age": avg_age,
"avg_purchases": avg_purchases
}
# Example usage (commented out to avoid actual execution)
# if __name__ == "__main__":
# # This would actually try to access external resources
# result = process_customer_data("csv", "customers.csv")
# print(result)
#
# result = process_customer_data("api", "https://api.example.com/customers")
# print(result)
Now, let’s test this module using mocks to avoid actual external dependencies:
# Save as test_data_loader.py
from typing import Dict, List, Any, Optional, Callable
import pytest
from unittest.mock import patch, mock_open, MagicMock
import json
import data_loader
def test_process_customer_data_csv(monkeypatch: pytest.MonkeyPatch) -> None:
"""Test processing CSV data with mocked file."""
# Create mock CSV data
mock_csv_data = [
{"name": "Alice", "age": "30", "purchases": "150"},
{"name": "Bob", "age": "25", "purchases": "200"},
{"name": "Charlie", "age": "35", "purchases": "100"}
]
# Mock the load_from_csv function
def mock_load_from_csv(filename: str) -> List[Dict[str, Any]]:
print(f"Mocked load_from_csv called with {filename}")
return mock_csv_data
# Apply the mock
monkeypatch.setattr(data_loader, "load_from_csv", mock_load_from_csv)
# Call the function we're testing
result = data_loader.process_customer_data("csv", "fake_file.csv")
print(f"Process result: {result}")
# Verify results
assert result["total"] == 3
assert result["avg_age"] == (30 + 25 + 35) / 3
assert result["avg_purchases"] == (150 + 200 + 100) / 3
def test_process_customer_data_api(monkeypatch: pytest.MonkeyPatch) -> None:
"""Test processing API data with mocked requests."""
# Create mock API response
mock_api_data = {
"customers": [
{"name": "Diana", "age": "40", "purchases": "300"},
{"name": "Eve", "age": "22", "purchases": "250"}
]
}
# Mock the load_from_api function
def mock_load_from_api(url: str, api_key: Optional[str] = None) -> Dict[str, Any]:
print(f"Mocked load_from_api called with {url}")
return mock_api_data
# Apply the mock
monkeypatch.setattr(data_loader, "load_from_api", mock_load_from_api)
# Call the function we're testing
result = data_loader.process_customer_data("api", "https://fake-api.example.com")
print(f"Process result: {result}")
# Verify results
assert result["total"] == 2
assert result["avg_age"] == (40 + 22) / 2
assert result["avg_purchases"] == (300 + 250) / 2
def test_invalid_source_type() -> None:
"""Test that an invalid source type raises ValueError."""
with pytest.raises(ValueError) as excinfo:
data_loader.process_customer_data("invalid_type", "some_source")
print(f"Caught expected error: {excinfo.value}")
assert "Unsupported source type" in str(excinfo.value)
if __name__ == "__main__":
# Run tests manually
test_process_customer_data_csv(pytest.MonkeyPatch())
test_process_customer_data_api(pytest.MonkeyPatch())
test_invalid_source_type()
print("All mock tests passed!")
# Mocked load_from_csv called with fake_file.csv
# Process result: {'total': 3, 'avg_age': 30.0, 'avg_purchases': 150.0}
# Mocked load_from_api called with https://fake-api.example.com
# Process result: {'total': 2, 'avg_age': 31.0, 'avg_purchases': 275.0}
# Caught expected error: Unsupported source type: invalid_type
# All mock tests passed!
9.4.2 Mocking with patch Decorator
The patch
decorator provides another way to mock objects:
# Save as test_patch_decorator.py
from typing import Dict, List, Any, Optional, Callable
import pytest
from unittest.mock import patch, MagicMock
import requests
import data_loader
from typing import Dict, List, Union, Optional, TypeVar, Callable, Protocol, cast
import pytest
from unittest.mock import patch, MagicMock
import requests
import data_loader
# Define a protocol for the mock_load_csv parameter instead of using Any
# A Protocol defines an interface that the mock will satisfy at runtime
class CSVLoaderProtocol(Protocol):
"""Protocol defining the interface for the CSV loader function mock."""
def __call__(self, filename: str) -> List[Dict[str, str]]: ...
def assert_called_once_with(self, filename: str) -> None: ...
@patch('data_loader.load_from_csv')
def test_csv_with_patch_decorator(mock_load_csv: CSVLoaderProtocol) -> None:
"""Test using the patch decorator to mock CSV loading."""
# Configure the mock
# We use MagicMock at runtime, but typehint with our Protocol
mock_load_csv.return_value = [ # type: ignore
{"name": "Alice", "age": "30", "purchases": "150"},
{"name": "Bob", "age": "25", "purchases": "200"}
]
# Call the function under test
result = data_loader.process_customer_data("csv", "test.csv")
# Verify the mock was called with the right arguments
mock_load_csv.assert_called_once_with("test.csv")
print(f"Mock called with: {mock_load_csv.call_args}")
# Check the result
print(f"Process result: {result}")
assert result["total"] == 2
assert result["avg_age"] == 27.5 # (30 + 25) / 2
assert result["avg_purchases"] == 175.0 # (150 + 200) / 2
# Response protocol for mocking HTTP responses
class HTTPResponseProtocol(Protocol):
"""Protocol defining the interface for HTTP response mocks."""
def json(self) -> Dict[str, Union[List[Dict[str, str]], str, int]]: ...
def raise_for_status(self) -> None: ...
# Requests get function protocol
class RequestsGetProtocol(Protocol):
"""Protocol defining the interface for requests.get mock."""
def __call__(self, url: str, headers: Dict[str, str]) -> HTTPResponseProtocol: ...
def assert_called_once_with(self, url: str, headers: Dict[str, str]) -> None: ...
@patch('requests.get')
def test_api_with_patch_decorator(mock_get: RequestsGetProtocol) -> None:
"""Test using the patch decorator to mock requests.get."""
# Configure the mock response
mock_response = MagicMock()
mock_response.json.return_value = {
"customers": [
{"name": "Charlie", "age": "35", "purchases": "100"}
]
}
mock_response.raise_for_status = MagicMock() # No-op mock
mock_get.return_value = mock_response # type: ignore
# Call the function under test
result = data_loader.load_from_api("https://api.example.com")
# Verify the mock was called correctly
mock_get.assert_called_once_with("https://api.example.com", headers={})
print(f"Mock called with: {mock_get.call_args}")
# Check the result
print(f"API result: {result}")
assert "customers" in result
assert len(result["customers"]) == 1
assert result["customers"][0]["name"] == "Charlie"
if __name__ == "__main__":
# For direct execution, use dummy MagicMock objects
test_csv_with_patch_decorator(MagicMock())
test_api_with_patch_decorator(MagicMock())
print("All patch decorator tests passed!")
# Mock called with: call('test.csv')
# Process result: {'total': 2, 'avg_age': 27.5, 'avg_purchases': 175.0}
# Mock called with: call('https://api.example.com', headers={})
# API result: {'customers': [{'name': 'Charlie', 'age': '35', 'purchases': '100'}]}
# All patch decorator tests passed!
9.4.3 Mocking File Operations
File operations are common in data engineering. Here’s how to mock them:
# Save as test_mock_files.py
from typing import Dict, List, Any, Optional, Callable
import pytest
from unittest.mock import patch, mock_open
import json
import data_loader
@patch('builtins.open', new_callable=mock_open, read_data='name,age,purchases\nAlice,30,150\nBob,25,200')
def test_load_from_csv_with_mock_open(mock_file: Any) -> None:
"""Test loading CSV using mock_open."""
# Call the function under test
result = data_loader.load_from_csv("fake.csv")
# Verify the mock was called correctly
mock_file.assert_called_once_with('fake.csv', 'r', newline='')
print(f"Mock file opened with: {mock_file.call_args}")
# Check the result
print(f"CSV result: {result}")
assert len(result) == 2
assert result[0]['name'] == 'Alice'
assert result[1]['name'] == 'Bob'
@patch('json.dump')
@patch('builtins.open', new_callable=mock_open)
def test_save_to_json(mock_file: Any, mock_json_dump: Any) -> None:
"""Test saving to JSON using mock_open and mocked json.dump."""
# Data to save
test_data = {"key": "value", "numbers": [1, 2, 3]}
# Call the function under test
data_loader.save_to_json(test_data, "output.json")
# Verify the mock was called correctly
mock_file.assert_called_once_with('output.json', 'w')
mock_json_dump.assert_called_once()
# Get the arguments passed to json.dump
args, kwargs = mock_json_dump.call_args
print(f"json.dump called with data: {args[0]}")
print(f"json.dump called with file: {args[1]}")
print(f"json.dump called with kwargs: {kwargs}")
# Check that the right data was passed
assert args[0] == test_data
assert 'indent' in kwargs
assert kwargs['indent'] == 2
if __name__ == "__main__":
# For direct execution
test_load_from_csv_with_mock_open(mock_open(read_data='name,age,purchases\nAlice,30,150\nBob,25,200'))
test_save_to_json(mock_open(), lambda *args, **kwargs: None)
print("All file mock tests passed!")
# Mock file opened with: call('fake.csv', 'r', newline='')
# CSV result: [{'name': 'Alice', 'age': '30', 'purchases': '150'}, {'name': 'Bob', 'age': '25', 'purchases': '200'}]
# json.dump called with data: {'key': 'value', 'numbers': [1, 2, 3]}
# json.dump called with file: <MagicMock name='open()' id='...'>
# json.dump called with kwargs: {'indent': 2}
# All file mock tests passed!
9.5 Testing Numerical Accuracy
Data engineering often involves numerical calculations. Testing them requires special considerations for floating-point accuracy.
9.5.1 Comparing Floating Point Numbers
Direct equality comparison can be problematic with floating-point numbers:
# Save as test_numerical_accuracy.py
from typing import Dict, List, Union, Optional, Tuple, TypeVar, cast
import pytest
import math
def calculate_statistics(values: List[float]) -> Dict[str, float]:
"""Calculate statistics for a list of values."""
n = len(values)
if n == 0:
return {"mean": 0.0, "std_dev": 0.0, "median": 0.0}
# Calculate mean
mean = sum(values) / n
# Calculate standard deviation
variance = sum((x - mean) ** 2 for x in values) / n
std_dev = math.sqrt(variance)
# Calculate median
sorted_values = sorted(values)
if n % 2 == 0:
# Even number of values
median = (sorted_values[n//2 - 1] + sorted_values[n//2]) / 2
else:
# Odd number of values
median = sorted_values[n//2]
return {"mean": mean, "std_dev": std_dev, "median": median}
def test_statistics_exact_comparison() -> None:
"""
Test statistics calculation with exact comparisons.
This might fail due to floating-point precision issues.
"""
values = [1.0, 2.0, 3.0, 4.0, 5.0]
result = calculate_statistics(values)
print(f"Statistics for {values}: {result}")
# These exact comparisons might be risky for floating-point values
# Mean is an exact integer here, so direct comparison works
expected_mean = 3.0
expected_std_dev = math.sqrt(2.0) # This will be an irrational number
expected_median = 3.0 # Also exact in this case
assert result["mean"] == expected_mean
# This might fail due to floating-point precision, so we don't assert it directly:
# assert result["std_dev"] == expected_std_dev
assert result["median"] == expected_median
def test_statistics_approx_comparison() -> None:
"""
Test statistics calculation with approximate comparisons.
pytest.approx is needed whenever comparing floating-point values
to avoid precision errors.
"""
values = [1.0, 2.0, 3.0, 4.0, 5.0]
result = calculate_statistics(values)
print(f"Statistics for {values}: {result}")
# Better approach using pytest.approx for floating-point comparison
expected_mean = 3.0
expected_std_dev = math.sqrt(2.0) # approximately 1.414
expected_median = 3.0
# pytest.approx allows for small floating-point differences
assert result["mean"] == pytest.approx(expected_mean)
assert result["std_dev"] == pytest.approx(expected_std_dev)
assert result["median"] == pytest.approx(expected_median)
# You can also specify tolerance explicitly when needed
assert result["std_dev"] == pytest.approx(1.414, abs=1e-3)
print(f"Expected std_dev: {expected_std_dev}, actual: {result['std_dev']}")
print(f"Difference: {abs(expected_std_dev - result['std_dev'])}")
if __name__ == "__main__":
test_statistics_exact_comparison()
test_statistics_approx_comparison()
print("All numerical tests passed!")
# Statistics for [1.0, 2.0, 3.0, 4.0, 5.0]: {'mean': 3.0, 'std_dev': 1.4142135623730951, 'median': 3.0}
# Statistics for [1.0, 2.0, 3.0, 4.0, 5.0]: {'mean': 3.0, 'std_dev': 1.4142135623730951, 'median': 3.0}
# Expected std_dev: 1.4142135623730951, actual: 1.4142135623730951
# Difference: 0.0
# All numerical tests passed!
9.5.2 Testing with NumPy and Pandas
For more complex numerical calculations, NumPy and Pandas provide useful tools:
# Save as test_numpy_calculations.py
from typing import List, Dict, Union, Optional, TypeVar, Callable, Protocol, Generic
import pytest
import numpy as np
import pandas as pd
# Define a bounded type variable for numeric types
# This constrains our function to work with numeric types only
# This is better than using Any because it enforces type safety for numeric operations
NumericType = TypeVar('NumericType', int, float)
def calculate_correlation(x: List[NumericType], y: List[NumericType]) -> float:
"""
Calculate the Pearson correlation coefficient between two variables.
Using a bounded TypeVar ensures type safety for numeric operations.
"""
if len(x) != len(y) or len(x) == 0:
raise ValueError("Input lists must have the same non-zero length")
# Convert to NumPy arrays for calculation
# NumPy operations are safe with our NumericType bound
x_array = np.array(x)
y_array = np.array(y)
# Calculate correlation
correlation = np.corrcoef(x_array, y_array)[0, 1]
return correlation
def test_perfect_positive_correlation() -> None:
"""Test correlation calculation with perfect positive correlation."""
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10] # y = 2*x
result = calculate_correlation(x, y)
print(f"Correlation between {x} and {y}: {result}")
# Perfect positive correlation should be 1.0
# We use pytest.approx to account for tiny floating-point differences
assert result == pytest.approx(1.0)
def test_perfect_negative_correlation() -> None:
"""Test correlation calculation with perfect negative correlation."""
x = [1, 2, 3, 4, 5]
y = [10, 8, 6, 4, 2] # y = -2*x + 12
result = calculate_correlation(x, y)
print(f"Correlation between {x} and {y}: {result}")
# Perfect negative correlation should be -1.0
assert result == pytest.approx(-1.0)
def test_no_correlation() -> None:
"""Test correlation calculation with uncorrelated data."""
# Using a contrived example where x and y are independent
x = [1, 2, 3, 4, 5]
y = [5, 2, 8, 1, 7]
result = calculate_correlation(x, y)
print(f"Correlation between {x} and {y}: {result}")
# Calculate the expected correlation using pandas for comparison
# This serves as a reference implementation to validate our function
df = pd.DataFrame({'x': x, 'y': y})
expected = df.corr().loc['x', 'y']
print(f"Expected correlation from pandas: {expected}")
# Check that our calculation matches pandas
assert result == pytest.approx(expected)
def test_invalid_inputs() -> None:
"""Test that invalid inputs raise appropriate errors."""
# Different length lists
with pytest.raises(ValueError) as excinfo:
calculate_correlation([1, 2, 3], [4, 5])
print(f"Caught expected error: {excinfo.value}")
# Empty lists
with pytest.raises(ValueError) as excinfo:
calculate_correlation([], [])
print(f"Caught expected error: {excinfo.value}")
if __name__ == "__main__":
test_perfect_positive_correlation()
test_perfect_negative_correlation()
test_no_correlation()
test_invalid_inputs()
print("All NumPy calculation tests passed!")
# Correlation between [1, 2, 3, 4, 5] and [2, 4, 6, 8, 10]: 1.0
# Correlation between [1, 2, 3, 4, 5] and [10, 8, 6, 4, 2]: -1.0
# Correlation between [1, 2, 3, 4, 5] and [5, 2, 8, 1, 7]: -0.25742899889268337
# Expected correlation from pandas: -0.25742900276422214
# Caught expected error: Input lists must have the same non-zero length
# Caught expected error: Input lists must have the same non-zero length
# All NumPy calculation tests passed!
9.6 Micro-Project: Data Validation Testing Framework
Now let’s apply what we’ve learned to create a comprehensive test suite for a data validation module.
Project Requirements
In this micro-project, we’ll create a Python module that validates data against various quality checks, along with a comprehensive test suite to ensure its reliability. The module should:
- Validate various data types (string, numeric, date)
- Check value ranges for numeric fields
- Verify required fields are present
- Validate formats (like email, phone numbers)
Acceptance Criteria
- Test suite includes at least 3 unit tests per validation function
- Integration tests cover at least 3 multi-validation scenarios
- Property-based tests demonstrate testing with many generated inputs
- All tests pass consistently and run in under 30 seconds
- Test coverage reports show at least 90% coverage
- Test suite includes fixtures for reusable test data
- Tests handle both valid and invalid inputs appropriately
- README documents the testing approach and how to run the tests
Common Pitfalls to Avoid
Tests Coupled to Implementation Details:
- Problem: Tests that break when implementation changes, even if behavior stays the same
- Solution: Focus on testing behavior and outcomes rather than implementation specifics
Slow-Running Tests:
- Problem: Tests that take too long to run, discouraging frequent testing
- Solution: Use mocks and in-memory fixtures instead of external dependencies
Brittle Tests:
- Problem: Tests that break with minor changes to the code
- Solution: Test business requirements rather than specific implementation details
How This Differs from Production-Grade Solutions
While our micro-project demonstrates key testing concepts for data validation, production-grade validation frameworks would typically have:
More Extensive Validation Rules:
- Production: Complex domain-specific validations, schema enforcement
- Micro-project: Basic type and format validations
Performance Optimization:
- Production: Optimized for large datasets with parallel processing
- Micro-project: Focused on correctness rather than performance
Integration with Data Pipelines:
- Production: Hooks into data pipeline frameworks like Airflow
- Micro-project: Standalone validation
Reporting and Monitoring:
- Production: Detailed reports, metrics, alerts for validation failures
- Micro-project: Simple pass/fail validation
Implementation
First, let’s create our data validation module:
# Save as data_validator.py
from typing import Any, Dict, List, Optional, Union, Callable, Pattern
import re
from datetime import datetime
class ValidationError(Exception):
"""Custom exception for validation errors."""
pass
def validate_required(data: Dict[str, Any], field: str) -> None:
"""
Validate that a required field is present and not None or empty.
Args:
data: Dictionary containing the data to validate
field: Name of the field to check
Raises:
ValidationError: If the field is missing, None, or empty
"""
if field not in data:
raise ValidationError(f"Required field '{field}' is missing")
value = data[field]
if value is None:
raise ValidationError(f"Required field '{field}' cannot be None")
# For strings, check if empty
if isinstance(value, str) and value.strip() == "":
raise ValidationError(f"Required field '{field}' cannot be empty")
# For lists or dicts, check if empty
if (isinstance(value, (list, dict, tuple)) and len(value) == 0):
raise ValidationError(f"Required field '{field}' cannot be empty")
def validate_type(data: Dict[str, Any], field: str,
expected_type: Union[type, List[type]]) -> None:
"""
Validate that a field is of the expected type.
Args:
data: Dictionary containing the data to validate
field: Name of the field to check
expected_type: Type or list of types that the field should match
Raises:
ValidationError: If the field is not of the expected type
"""
if field not in data:
return # Skip validation for missing fields
value = data[field]
if value is None:
return # Skip validation for None values
# Handle multiple allowed types
if isinstance(expected_type, list):
if not any(isinstance(value, t) for t in expected_type):
type_names = [t.__name__ for t in expected_type]
raise ValidationError(
f"Field '{field}' with value '{value}' is not of expected types: {type_names}"
)
else:
if not isinstance(value, expected_type):
raise ValidationError(
f"Field '{field}' with value '{value}' is not of type {expected_type.__name__}"
)
def validate_numeric_range(data: Dict[str, Any], field: str,
min_value: Optional[float] = None,
max_value: Optional[float] = None) -> None:
"""
Validate that a numeric field is within the specified range.
Args:
data: Dictionary containing the data to validate
field: Name of the field to check
min_value: Minimum allowed value (inclusive)
max_value: Maximum allowed value (inclusive)
Raises:
ValidationError: If the field is not within the specified range
"""
if field not in data:
return # Skip validation for missing fields
value = data[field]
if value is None:
return # Skip validation for None values
# Try to convert to float if it's not already a number
if not isinstance(value, (int, float)):
try:
value = float(value)
except (ValueError, TypeError):
raise ValidationError(
f"Field '{field}' with value '{value}' cannot be converted to a number"
)
# Check range
if min_value is not None and value < min_value:
raise ValidationError(
f"Field '{field}' with value {value} is less than minimum {min_value}"
)
if max_value is not None and value > max_value:
raise ValidationError(
f"Field '{field}' with value {value} is greater than maximum {max_value}"
)
def validate_string_pattern(data: Dict[str, Any], field: str,
pattern: Union[str, Pattern]) -> None:
"""
Validate that a string field matches the specified regex pattern.
Args:
data: Dictionary containing the data to validate
field: Name of the field to check
pattern: Regular expression pattern the field should match
Raises:
ValidationError: If the field doesn't match the pattern
"""
if field not in data:
return # Skip validation for missing fields
value = data[field]
if value is None:
return # Skip validation for None values
# Convert to string if it's not already
if not isinstance(value, str):
value = str(value)
# Compile pattern if it's a string
if isinstance(pattern, str):
pattern = re.compile(pattern)
# Check match
if not pattern.match(value):
raise ValidationError(
f"Field '{field}' with value '{value}' does not match the required pattern"
)
def validate_date_format(data: Dict[str, Any], field: str,
format_str: str = "%Y-%m-%d") -> None:
"""
Validate that a string field contains a date in the specified format.
Args:
data: Dictionary containing the data to validate
field: Name of the field to check
format_str: Date format string (default: YYYY-MM-DD)
Raises:
ValidationError: If the field doesn't contain a valid date
"""
if field not in data:
return # Skip validation for missing fields
value = data[field]
if value is None:
return # Skip validation for None values
# Convert to string if it's not already
if not isinstance(value, str):
value = str(value)
# Try to parse date
try:
datetime.strptime(value, format_str)
except ValueError:
raise ValidationError(
f"Field '{field}' with value '{value}' is not a valid date in format '{format_str}'"
)
def validate_email(data: Dict[str, Any], field: str) -> None:
"""
Validate that a field contains a valid email address.
Args:
data: Dictionary containing the data to validate
field: Name of the field to check
Raises:
ValidationError: If the field doesn't contain a valid email
"""
# Simple regex for email validation
email_pattern = re.compile(r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$")
validate_string_pattern(data, field, email_pattern)
def validate_one_of(data: Dict[str, Any], field: str,
valid_values: List[Any]) -> None:
"""
Validate that a field's value is one of the specified values.
Args:
data: Dictionary containing the data to validate
field: Name of the field to check
valid_values: List of allowed values
Raises:
ValidationError: If the field's value is not in valid_values
"""
if field not in data:
return # Skip validation for missing fields
value = data[field]
if value is None:
return # Skip validation for None values
if value not in valid_values:
raise ValidationError(
f"Field '{field}' with value '{value}' is not one of {valid_values}"
)
def validate_record(data: Dict[str, Any],
validations: List[Dict[str, Any]]) -> List[str]:
"""
Apply multiple validations to a data record.
Args:
data: Dictionary containing the data to validate
validations: List of validation dictionaries
Returns:
List of error messages (empty if all validations pass)
"""
errors = []
for validation in validations:
try:
# Extract validation parameters
validation_type = validation["type"]
field = validation["field"]
# Apply the appropriate validation
if validation_type == "required":
validate_required(data, field)
elif validation_type == "type":
validate_type(data, field, validation["expected_type"])
elif validation_type == "range":
validate_numeric_range(
data, field,
validation.get("min_value"),
validation.get("max_value")
)
elif validation_type == "pattern":
validate_string_pattern(data, field, validation["pattern"])
elif validation_type == "date":
validate_date_format(
data, field,
validation.get("format", "%Y-%m-%d")
)
elif validation_type == "email":
validate_email(data, field)
elif validation_type == "one_of":
validate_one_of(data, field, validation["valid_values"])
else:
errors.append(f"Unknown validation type: {validation_type}")
except ValidationError as e:
errors.append(str(e))
except Exception as e:
errors.append(f"Validation error for field '{field}': {str(e)}")
return errors
def validate_dataset(data: List[Dict[str, Any]],
validations: List[Dict[str, Any]]) -> Dict[str, Any]:
"""
Validate a dataset (list of records) against a set of validations.
Args:
data: List of dictionaries containing the data to validate
validations: List of validation dictionaries
Returns:
Dictionary with validation results
"""
results = {
"total_records": len(data),
"valid_records": 0,
"invalid_records": 0,
"errors": []
}
for i, record in enumerate(data):
errors = validate_record(record, validations)
if errors:
results["invalid_records"] += 1
results["errors"].append({
"record_index": i,
"errors": errors
})
else:
results["valid_records"] += 1
return results
# Example usage
if __name__ == "__main__":
# Sample data
customer_data = [
{
"id": 1,
"name": "Alice Smith",
"email": "alice@example.com",
"age": 30,
"registration_date": "2023-01-15",
"status": "active"
},
{
"id": 2,
"name": "Bob Johnson",
"email": "not-an-email",
"age": 17,
"registration_date": "2023-02-20",
"status": "pending"
}
]
# Define validations
validations = [
{"type": "required", "field": "name"},
{"type": "email", "field": "email"},
{"type": "range", "field": "age", "min_value": 18},
{"type": "date", "field": "registration_date"},
{"type": "one_of", "field": "status", "valid_values": ["active", "inactive", "pending"]}
]
# Validate the dataset
results = validate_dataset(customer_data, validations)
print("Validation Results:")
print(f"Total records: {results['total_records']}")
print(f"Valid records: {results['valid_records']}")
print(f"Invalid records: {results['invalid_records']}")
if results['errors']:
print("\nErrors:")
for error_info in results['errors']:
print(f"Record {error_info['record_index']}:")
for error in error_info['errors']:
print(f" - {error}")
# Validation Results:
# Total records: 2
# Valid records: 1
# Invalid records: 1
#
# Errors:
# Record 1:
# - Field 'email' with value 'not-an-email' does not match the required pattern
# - Field 'age' with value 17 is less than minimum 18
Now, let’s create a comprehensive test suite for this validation module:
# Save as test_data_validator.py
from typing import Dict, List, Any, Callable, Optional, Type
import pytest
import re
from datetime import datetime
from hypothesis import given, strategies as st
import data_validator
#
# Unit Tests for Individual Validation Functions
#
class TestValidateRequired:
"""Tests for the validate_required function."""
def test_valid_values(self) -> None:
"""Test with valid values."""
# Test with different valid values
tests = [
{"name": "Alice"},
{"age": 30},
{"items": [1, 2, 3]},
{"config": {"key": "value"}}
]
for data in tests:
field = list(data.keys())[0]
print(f"Testing required field '{field}' with value '{data[field]}'")
# Should not raise an exception
data_validator.validate_required(data, field)
print("All valid required tests passed")
def test_missing_field(self) -> None:
"""Test with missing field."""
data = {"name": "Alice"}
field = "age"
print(f"Testing missing required field '{field}'")
with pytest.raises(data_validator.ValidationError) as excinfo:
data_validator.validate_required(data, field)
error_msg = str(excinfo.value)
print(f"Got expected error: {error_msg}")
assert "missing" in error_msg
def test_none_value(self) -> None:
"""Test with None value."""
data = {"name": None}
field = "name"
print(f"Testing None value for required field '{field}'")
with pytest.raises(data_validator.ValidationError) as excinfo:
data_validator.validate_required(data, field)
error_msg = str(excinfo.value)
print(f"Got expected error: {error_msg}")
assert "None" in error_msg
def test_empty_values(self) -> None:
"""Test with empty values of different types."""
tests = [
({"name": ""}, "name"),
({"name": " "}, "name"),
({"items": []}, "items"),
({"config": {}}, "config"),
({"values": ()}, "values")
]
for data, field in tests:
print(f"Testing empty value for required field '{field}': {data[field]}")
with pytest.raises(data_validator.ValidationError) as excinfo:
data_validator.validate_required(data, field)
error_msg = str(excinfo.value)
print(f"Got expected error: {error_msg}")
assert "empty" in error_msg
class TestValidateType:
"""Tests for the validate_type function."""
def test_valid_types(self) -> None:
"""Test with valid types."""
tests = [
({"name": "Alice"}, "name", str),
({"age": 30}, "age", int),
({"price": 10.5}, "price", float),
({"price": 10}, "price", [int, float]),
({"active": True}, "active", bool),
({"items": [1, 2, 3]}, "items", list),
({"config": {"key": "value"}}, "config", dict)
]
for data, field, expected_type in tests:
type_name = expected_type.__name__ if isinstance(expected_type, type) else str(expected_type)
print(f"Testing field '{field}' is type {type_name}")
# Should not raise an exception
data_validator.validate_type(data, field, expected_type)
print("All valid type tests passed")
def test_invalid_types(self) -> None:
"""Test with invalid types."""
tests = [
({"name": 123}, "name", str),
({"age": "30"}, "age", int),
({"price": "10.5"}, "price", float),
({"active": "true"}, "active", bool),
({"items": "not a list"}, "items", list),
({"config": "not a dict"}, "config", dict)
]
for data, field, expected_type in tests:
type_name = expected_type.__name__
print(f"Testing field '{field}' with value '{data[field]}' is not type {type_name}")
with pytest.raises(data_validator.ValidationError) as excinfo:
data_validator.validate_type(data, field, expected_type)
error_msg = str(excinfo.value)
print(f"Got expected error: {error_msg}")
assert type_name in error_msg
def test_multiple_types(self) -> None:
"""Test with multiple allowed types."""
# Valid cases
tests_valid = [
({"value": 10}, "value", [int, float, str]),
({"value": 10.5}, "value", [int, float, str]),
({"value": "10"}, "value", [int, float, str])
]
for data, field, expected_types in tests_valid:
print(f"Testing field '{field}' with value '{data[field]}' matches one of {expected_types}")
# Should not raise an exception
data_validator.validate_type(data, field, expected_types)
# Invalid case
data = {"value": True}
field = "value"
expected_types = [int, float, str]
print(f"Testing field '{field}' with value '{data[field]}' doesn't match any of {expected_types}")
with pytest.raises(data_validator.ValidationError) as excinfo:
data_validator.validate_type(data, field, expected_types)
error_msg = str(excinfo.value)
print(f"Got expected error: {error_msg}")
assert "not of expected types" in error_msg
def test_missing_or_none(self) -> None:
"""Test with missing field or None value."""
# Missing field
data = {"name": "Alice"}
field = "age"
expected_type = int
print(f"Testing missing field '{field}' with expected type {expected_type.__name__}")
# Should not raise an exception
data_validator.validate_type(data, field, expected_type)
# None value
data = {"age": None}
field = "age"
print(f"Testing None value for field '{field}' with expected type {expected_type.__name__}")
# Should not raise an exception
data_validator.validate_type(data, field, expected_type)
print("Missing field and None value tests passed")
# Additional test classes for other validation functions would follow a similar pattern
#
# Integration Tests for Multiple Validations
#
class TestValidateRecord:
"""Integration tests for the validate_record function."""
def test_valid_record(self) -> None:
"""Test validating a completely valid record."""
data = {
"id": 1,
"name": "Alice Smith",
"email": "alice@example.com",
"age": 30,
"registration_date": "2023-01-15",
"status": "active"
}
validations = [
{"type": "required", "field": "name"},
{"type": "email", "field": "email"},
{"type": "range", "field": "age", "min_value": 18, "max_value": 120},
{"type": "date", "field": "registration_date"},
{"type": "one_of", "field": "status", "valid_values": ["active", "inactive", "pending"]}
]
print(f"Testing valid record: {data}")
errors = data_validator.validate_record(data, validations)
print(f"Validation errors: {errors}")
assert len(errors) == 0
def test_invalid_record(self) -> None:
"""Test validating a record with multiple validation errors."""
data = {
"id": 2,
"name": "", # Empty required field
"email": "not-an-email", # Invalid email
"age": 16, # Below minimum age
"registration_date": "01/15/2023", # Wrong date format
"status": "disabled" # Not in valid values
}
validations = [
{"type": "required", "field": "name"},
{"type": "email", "field": "email"},
{"type": "range", "field": "age", "min_value": 18, "max_value": 120},
{"type": "date", "field": "registration_date"},
{"type": "one_of", "field": "status", "valid_values": ["active", "inactive", "pending"]}
]
print(f"Testing invalid record: {data}")
errors = data_validator.validate_record(data, validations)
print(f"Validation errors: {errors}")
# Should have 5 validation errors
assert len(errors) == 5
# Check specific error messages
error_texts = " ".join(errors)
assert "required" in error_texts
assert "email" in error_texts
assert "minimum" in error_texts
assert "date" in error_texts
assert "not one of" in error_texts
def test_partial_validation(self) -> None:
"""Test validating a record with some fields valid and some invalid."""
data = {
"id": 3,
"name": "Charlie Brown", # Valid
"email": "charlie@example.com", # Valid
"age": 150, # Above maximum age
"registration_date": "2023-01-15", # Valid
"status": "inactive" # Valid
}
validations = [
{"type": "required", "field": "name"},
{"type": "email", "field": "email"},
{"type": "range", "field": "age", "min_value": 18, "max_value": 120},
{"type": "date", "field": "registration_date"},
{"type": "one_of", "field": "status", "valid_values": ["active", "inactive", "pending"]}
]
print(f"Testing partially valid record: {data}")
errors = data_validator.validate_record(data, validations)
print(f"Validation errors: {errors}")
# Should have 1 validation error
assert len(errors) == 1
assert "maximum" in errors[0]
#
# Property-based Testing with Hypothesis
#
class TestPropertyBased:
"""Property-based tests using Hypothesis."""
@given(st.text())
def test_email_validation_property(self, email: str) -> None:
"""
Test email validation with randomly generated strings.
This demonstrates property-based testing.
"""
data = {"email": email}
# Basic email regex for comparison
email_pattern = re.compile(r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$")
expected_valid = bool(email and email_pattern.match(email))
# Determine if validation passes or fails
try:
data_validator.validate_email(data, "email")
actually_valid = True
except data_validator.ValidationError:
actually_valid = False
# Print the first few examples for demonstration
if hasattr(self, 'example_count'):
self.example_count += 1
else:
self.example_count = 1
if self.example_count <= 5:
print(f"Testing email: '{email}', expected valid: {expected_valid}, actually valid: {actually_valid}")
# The validation should match our expected validity
assert expected_valid == actually_valid
@given(
st.integers().map(str),
st.floats(allow_nan=False, allow_infinity=False).map(str),
st.text()
)
def test_numeric_conversion_property(self, int_str: str, float_str: str, non_numeric: str) -> None:
"""
Test numeric range validation with different string inputs.
This tests the type conversion behavior.
"""
# Skip empty strings
if not non_numeric or non_numeric.isdigit():
return
# Integer string should be convertible
int_data = {"value": int_str}
try:
data_validator.validate_numeric_range(int_data, "value", 0, 1000)
int_valid = True
except data_validator.ValidationError as e:
# If it fails, it should be because of range, not conversion
int_valid = "cannot be converted" not in str(e)
# Float string should be convertible
float_data = {"value": float_str}
try:
data_validator.validate_numeric_range(float_data, "value", 0, 1000)
float_valid = True
except data_validator.ValidationError as e:
# If it fails, it should be because of range, not conversion
float_valid = "cannot be converted" not in str(e)
# Non-numeric string should not be convertible
non_numeric_data = {"value": non_numeric}
try:
data_validator.validate_numeric_range(non_numeric_data, "value", 0, 1000)
non_numeric_valid = True
except data_validator.ValidationError as e:
# It should fail because of conversion, not range
non_numeric_valid = "cannot be converted" in str(e)
# Print a few examples
if hasattr(self, 'numeric_count'):
self.numeric_count += 1
else:
self.numeric_count = 1
if self.numeric_count <= 3:
print(f"Testing numeric conversion:")
print(f" int_str: '{int_str}', convertible: {int_valid}")
print(f" float_str: '{float_str}', convertible: {float_valid}")
print(f" non_numeric: '{non_numeric}', not convertible: {non_numeric_valid}")
# Assertions for conversion behavior
assert int_valid # Integer strings should be convertible
assert float_valid # Float strings should be convertible
assert non_numeric_valid # Non-numeric strings should not be convertible
#
# Fixtures for Reusable Test Data
#
@pytest.fixture
def sample_customer_data() -> List[Dict[str, Any]]:
"""Fixture providing sample customer data for tests."""
return [
{
"id": 1,
"name": "Alice Smith",
"email": "alice@example.com",
"age": 30,
"registration_date": "2023-01-15",
"status": "active"
},
{
"id": 2,
"name": "Bob Johnson",
"email": "bob@example.com",
"age": 25,
"registration_date": "2023-02-20",
"status": "pending"
},
{
"id": 3,
"name": "", # Invalid name
"email": "not-an-email", # Invalid email
"age": 16, # Invalid age
"registration_date": "01/15/2023", # Invalid date format
"status": "disabled" # Invalid status
}
]
@pytest.fixture
def standard_validations() -> List[Dict[str, Any]]:
"""Fixture providing standard validation rules."""
return [
{"type": "required", "field": "name"},
{"type": "email", "field": "email"},
{"type": "range", "field": "age", "min_value": 18, "max_value": 120},
{"type": "date", "field": "registration_date"},
{"type": "one_of", "field": "status", "valid_values": ["active", "inactive", "pending"]}
]
def test_validate_dataset_with_fixtures(
sample_customer_data: List[Dict[str, Any]],
standard_validations: List[Dict[str, Any]]
) -> None:
"""
Integration test using fixtures to validate a complete dataset.
This tests the end-to-end validation process.
"""
results = data_validator.validate_dataset(sample_customer_data, standard_validations)
print("Dataset validation results:")
print(f"Total records: {results['total_records']}")
print(f"Valid records: {results['valid_records']}")
print(f"Invalid records: {results['invalid_records']}")
# Check the overall results
assert results["total_records"] == 3
assert results["valid_records"] == 2
assert results["invalid_records"] == 1
# Check the detailed errors
assert len(results["errors"]) == 1
error_record = results["errors"][0]
assert error_record["record_index"] == 2 # Third record (index 2) has errors
# Should have 5 validation errors on this record
assert len(error_record["errors"]) == 5
# Print the errors for demonstration
for error in error_record["errors"]:
print(f" - {error}")
if __name__ == "__main__":
# Run specific test functions manually for demonstration
# Unit tests
print("\n=== Testing validate_required ===")
required_tests = TestValidateRequired()
required_tests.test_valid_values()
required_tests.test_missing_field()
required_tests.test_none_value()
required_tests.test_empty_values()
print("\n=== Testing validate_type ===")
type_tests = TestValidateType()
type_tests.test_valid_types()
type_tests.test_invalid_types()
type_tests.test_multiple_types()
type_tests.test_missing_or_none()
print("\n=== Testing validate_record (Integration) ===")
record_tests = TestValidateRecord()
record_tests.test_valid_record()
record_tests.test_invalid_record()
record_tests.test_partial_validation()
print("\n=== Testing with Fixtures ===")
# Create the fixtures manually for direct execution
sample_data = sample_customer_data()
validations = standard_validations()
test_validate_dataset_with_fixtures(sample_data, validations)
print("\n=== Property-based Testing ===")
prop_tests = TestPropertyBased()
# Only run a few examples for demonstration
import random
random.seed(42) # For reproducible output
# Run a few email examples
for _ in range(5):
email = ''.join(random.choice('abcdefghijklmnopqrstuvwxyz@.') for _ in range(10))
prop_tests.test_email_validation_property(email)
print("\nAll tests completed successfully!")
# Output:
# === Testing validate_required ===
# Testing required field 'name' with value 'Alice'
# Testing required field 'age' with value 30
# Testing required field 'items' with value [1, 2, 3]
# Testing required field 'config' with value {'key': 'value'}
# All valid required tests passed
# Testing missing required field 'age'
# Got expected error: Required field 'age' is missing
# Testing None value for required field 'name'
# Got expected error: Required field 'name' cannot be None
# Testing empty value for required field 'name':
# Got expected error: Required field 'name' cannot be empty
# Testing empty value for required field 'name':
# Got expected error: Required field 'name' cannot be empty
# Testing empty value for required field 'items': []
# Got expected error: Required field 'items' cannot be empty
# Testing empty value for required field 'config': {}
# Got expected error: Required field 'config' cannot be empty
# Testing empty value for required field 'values': ()
# Got expected error: Required field 'values' cannot be empty
#
# === Testing validate_type ===
# Testing field 'name' is type str
# Testing field 'age' is type int
# Testing field 'price' is type float
# Testing field 'price' is type [<class 'int'>, <class 'float'>]
# Testing field 'active' is type bool
# Testing field 'items' is type list
# Testing field 'config' is type dict
# All valid type tests passed
# Testing field 'name' with value '123' is not type str
# Got expected error: Field 'name' with value '123' is not of type str
# Testing field 'age' with value '30' is not type int
# Got expected error: Field 'age' with value '30' is not of type int
# Testing field 'price' with value '10.5' is not type float
# Got expected error: Field 'price' with value '10.5' is not of type float
# Testing field 'active' with value 'true' is not type bool
# Got expected error: Field 'active' with value 'true' is not of type bool
# Testing field 'items' with value 'not a list' is not type list
# Got expected error: Field 'items' with value 'not a list' is not of type list
# Testing field 'config' with value 'not a dict' is not type dict
# Got expected error: Field 'config' with value 'not a dict' is not of type dict
# Testing field 'value' with value '10' matches one of [<class 'int'>, <class 'float'>, <class 'str'>]
# Testing field 'value' with value '10.5' matches one of [<class 'int'>, <class 'float'>, <class 'str'>]
# Testing field 'value' with value '10' matches one of [<class 'int'>, <class 'float'>, <class 'str'>]
# Testing field 'value' with value 'True' doesn't match any of [<class 'int'>, <class 'float'>, <class 'str'>]
# Got expected error: Field 'value' with value 'True' is not of expected types: ['int', 'float', 'str']
# Testing missing field 'age' with expected type int
# Testing None value for field 'age' with expected type int
# Missing field and None value tests passed
#
# === Testing validate_record (Integration) ===
# Testing valid record: {'id': 1, 'name': 'Alice Smith', 'email': 'alice@example.com', 'age': 30, 'registration_date': '2023-01-15', 'status': 'active'}
# Validation errors: []
# Testing invalid record: {'id': 2, 'name': '', 'email': 'not-an-email', 'age': 16, 'registration_date': '01/15/2023', 'status': 'disabled'}
# Validation errors: ["Required field 'name' cannot be empty", "Field 'email' with value 'not-an-email' does not match the required pattern", "Field 'age' with value 16 is less than minimum 18", "Field 'registration_date' with value '01/15/2023' is not a valid date in format '%Y-%m-%d'", "Field 'status' with value 'disabled' is not one of ['active', 'inactive', 'pending']"]
# Testing partially valid record: {'id': 3, 'name': 'Charlie Brown', 'email': 'charlie@example.com', 'age': 150, 'registration_date': '2023-01-15', 'status': 'inactive'}
# Validation errors: ['Field 'age' with value 150 is greater than maximum 120']
#
# === Testing with Fixtures ===
# Dataset validation results:
# Total records: 3
# Valid records: 2
# Invalid records: 1
# - Required field 'name' cannot be empty
# - Field 'email' with value 'not-an-email' does not match the required pattern
# - Field 'age' with value 16 is less than minimum 18
# - Field 'registration_date' with value '01/15/2023' is not a valid date in format '%Y-%m-%d'
# - Field 'status' with value 'disabled' is not one of ['active', 'inactive', 'pending']
#
# === Property-based Testing ===
# Testing email: 'yyixrdb@jl', expected valid: False, actually valid: False
# Testing email: 'bpybawj.zc', expected valid: False, actually valid: False
# Testing email: 'ywfxepbw@.', expected valid: False, actually valid: False
# Testing email: 'gw@uxeibk.', expected valid: False, actually valid: False
# Testing email: 'n@prlj.csq', expected valid: True, actually valid: True
#
# All tests completed successfully!
README for Data Validation Testing Framework
Overview
This project demonstrates a comprehensive testing approach for a data validation framework. The validation framework provides functions to validate data against various quality criteria including type validation, range checking, format validation, and more.
Components
data_validator.py: The main validation module with functions to validate:
- Required fields
- Data types
- Numeric ranges
- String patterns
- Date formats
- Email formats
- Enumerated values
- Complete records and datasets
test_data_validator.py: Comprehensive test suite including:
- Unit tests for individual validation functions
- Integration tests for multi-validation scenarios
- Property-based tests using Hypothesis
- Fixtures for reusable test data
Testing Approaches Demonstrated
Unit Testing
Each validation function is thoroughly tested with multiple test cases for both valid and invalid inputs.
Integration Testing
The validate_record
and validate_dataset
functions are tested with combinations of validation rules.
Property-Based Testing
Using the Hypothesis library to generate numerous test cases automatically, ensuring robust validation.
Test Fixtures
Reusable test data and validation rules are provided as fixtures to simplify test setup.
How to Run the Tests
Prerequisites
- Python 3.7 or higher
- Required packages: pytest, hypothesis
Installation
pip install pytest hypothesis
Running All Tests
pytest -v test_data_validator.py
Running Specific Test Categories
# Run just the unit tests
pytest -v test_data_validator.py::TestValidateRequired test_data_validator.py::TestValidateType
# Run just the integration tests
pytest -v test_data_validator.py::TestValidateRecord
# Run just the property-based tests
pytest -v test_data_validator.py::TestPropertyBased
Checking Test Coverage
pip install pytest-cov
pytest --cov=data_validator test_data_validator.py
Using the Validation Framework
To use the validation framework in your own applications:
- Import the validation functions:
from data_validator import validate_record, validate_dataset
- Define your validation rules:
validations = [
{"type": "required", "field": "name"},
{"type": "email", "field": "email"},
{"type": "range", "field": "age", "min_value": 18, "max_value": 120},
{"type": "date", "field": "registration_date"},
{"type": "one_of", "field": "status", "valid_values": ["active", "inactive", "pending"]}
]
- Validate your data:
# Validate a single record
errors = validate_record(record, validations)
if errors:
print("Validation failed:", errors)
# Validate a dataset
results = validate_dataset(data, validations)
print(f"Valid records: {results['valid_records']}")
print(f"Invalid records: {results['invalid_records']}")
Project Structure
data-validation/
├── data_validator.py # Main validation module
├── test_data_validator.py # Test suite
└── README.md # This documentation
Key Testing Concepts Demonstrated
- Test Independence: Each test runs in isolation
- Test Coverage: Tests cover both valid and invalid cases
- Edge Cases: Handling of None values, empty values, etc.
- Property-Based Testing: Testing with generated inputs
- Test Fixtures: Reusable test data and setup
- Integration Testing: Testing the system as a whole
- Parametrized Tests: Testing multiple scenarios efficiently
9.7 Practice Exercises
To reinforce your understanding of testing concepts, here are exercises of increasing complexity:
Exercise 1: Basic Unit Testing
Create a function calculate_discount
that calculates a discount price given an original price and discount percentage. Then write a set of pytest unit tests that verify:
- Basic discount calculation is correct
- Handling of zero discount
- Handling of 100% discount
- Invalid inputs (negative price, negative/excessive discount)
Exercise 2: Test Fixtures
Create a Product
class with attributes for name, price, and category. Then create pytest fixtures for:
- A single product instance
- A collection of products with different prices and categories
- A fixture that yields a temporary file containing serialized product data
Write tests that use these fixtures to verify operations like filtering products by category and price range.
Exercise 3: Testing with Mocks
Create a function fetch_and_process_data
that retrieves data from an API and processes it. Then write tests that:
- Mock the API response to test successful data processing
- Mock API failures to test error handling
- Mock different response formats to test the processing logic
Exercise 4: Property-Based Testing
Use Hypothesis to create property-based tests for a function that calculates the median of a list of numbers. Your tests should verify:
- The median of a list with odd length is the middle value when sorted
- The median of a list with even length is the average of the two middle values
- The function correctly handles various numeric types (int, float)
Exercise 5: Integration Testing
Create a simple data pipeline with three components:
- A data extractor that reads from a source
- A transformer that processes the data
- A loader that saves the data
Write unit tests for each component and integration tests that verify the entire pipeline works correctly.
9.8 Exercise Solutions
Here are the solutions to the practice exercises:
Solution to Exercise 1: Basic Unit Testing
# Save as discount_calculator.py
from typing import Union, Optional, Dict, List
def calculate_discount(original_price: float, discount_percent: float) -> float:
"""
Calculate the discounted price.
Args:
original_price: The original price
discount_percent: Discount percentage (0-100)
Returns:
float: The discounted price
Raises:
ValueError: If inputs are invalid
"""
if original_price < 0:
raise ValueError("Original price cannot be negative")
if discount_percent < 0 or discount_percent > 100:
raise ValueError("Discount percentage must be between 0 and 100")
discount_amount = original_price * (discount_percent / 100)
return original_price - discount_amount
# Example usage
if __name__ == "__main__":
price = 100
discount = 20
discounted_price = calculate_discount(price, discount)
print(f"Original price: ${price}")
print(f"Discount: {discount}%")
print(f"Discounted price: ${discounted_price}")
# Original price: $100
# Discount: 20%
# Discounted price: $80.0
# Save as test_discount_calculator.py
from typing import Any
import pytest
from discount_calculator import calculate_discount
def test_basic_discount() -> None:
"""Test basic discount calculation."""
original_price = 100
discount_percent = 20
expected_price = 80
result = calculate_discount(original_price, discount_percent)
print(f"Testing {discount_percent}% discount on ${original_price}")
print(f"Expected: ${expected_price}, Result: ${result}")
assert result == expected_price
def test_zero_discount() -> None:
"""Test with zero discount."""
original_price = 50
discount_percent = 0
expected_price = 50
result = calculate_discount(original_price, discount_percent)
print(f"Testing {discount_percent}% discount on ${original_price}")
print(f"Expected: ${expected_price}, Result: ${result}")
assert result == expected_price
def test_full_discount() -> None:
"""Test with 100% discount."""
original_price = 75
discount_percent = 100
expected_price = 0
result = calculate_discount(original_price, discount_percent)
print(f"Testing {discount_percent}% discount on ${original_price}")
print(f"Expected: ${expected_price}, Result: ${result}")
assert result == expected_price
def test_invalid_price() -> None:
"""Test with negative price."""
with pytest.raises(ValueError) as excinfo:
calculate_discount(-10, 20)
print(f"Testing negative price raises ValueError")
print(f"Got expected error: {excinfo.value}")
assert "negative" in str(excinfo.value)
def test_invalid_discount() -> None:
"""Test with invalid discount percentages."""
# Test negative discount
with pytest.raises(ValueError) as excinfo:
calculate_discount(100, -10)
print(f"Testing negative discount raises ValueError")
print(f"Got expected error: {excinfo.value}")
assert "percentage" in str(excinfo.value)
# Test excessive discount
with pytest.raises(ValueError) as excinfo:
calculate_discount(100, 110)
print(f"Testing discount > 100% raises ValueError")
print(f"Got expected error: {excinfo.value}")
assert "percentage" in str(excinfo.value)
if __name__ == "__main__":
test_basic_discount()
test_zero_discount()
test_full_discount()
test_invalid_price()
test_invalid_discount()
print("All discount tests passed!")
# Testing 20% discount on $100
# Expected: $80, Result: $80.0
# Testing 0% discount on $50
# Expected: $50, Result: $50.0
# Testing 100% discount on $75
# Expected: $0, Result: $0.0
# Testing negative price raises ValueError
# Got expected error: Original price cannot be negative
# Testing negative discount raises ValueError
# Got expected error: Discount percentage must be between 0 and 100
# Testing discount > 100% raises ValueError
# Got expected error: Discount percentage must be between 0 and 100
# All discount tests passed!
Solution to Exercise 2: Test Fixtures
# Save as product.py
from typing import Dict, List, Any, Optional
import json
class Product:
"""Class representing a product."""
def __init__(self, name: str, price: float, category: str) -> None:
"""Initialize a product."""
self.name = name
self.price = price
self.category = category
def to_dict(self) -> Dict[str, Any]:
"""Convert to dictionary representation."""
return {
"name": self.name,
"price": self.price,
"category": self.category
}
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> 'Product':
"""Create a Product from a dictionary."""
return cls(
name=data["name"],
price=data["price"],
category=data["category"]
)
def __repr__(self) -> str:
return f"Product(name='{self.name}', price={self.price}, category='{self.category}')"
def filter_by_category(products: List[Product], category: str) -> List[Product]:
"""Filter products by category."""
return [p for p in products if p.category == category]
def filter_by_price_range(products: List[Product], min_price: float, max_price: float) -> List[Product]:
"""Filter products by price range."""
return [p for p in products if min_price <= p.price <= max_price]
def save_products_to_file(products: List[Product], filename: str) -> None:
"""Save products to a JSON file."""
data = [p.to_dict() for p in products]
with open(filename, 'w') as f:
json.dump(data, f)
def load_products_from_file(filename: str) -> List[Product]:
"""Load products from a JSON file."""
with open(filename, 'r') as f:
data = json.load(f)
return [Product.from_dict(item) for item in data]
# Example usage
if __name__ == "__main__":
# Create some products
products = [
Product("Laptop", 1200, "Electronics"),
Product("Headphones", 85, "Electronics"),
Product("Coffee Mug", 12, "Kitchen"),
Product("Book", 20, "Books"),
Product("Phone", 800, "Electronics")
]
# Filter examples
electronics = filter_by_category(products, "Electronics")
affordable = filter_by_price_range(products, 10, 100)
print("Electronic products:")
for p in electronics:
print(f" - {p.name}: ${p.price}")
print("\nAffordable products ($10-$100):")
for p in affordable:
print(f" - {p.name}: ${p.price}")
# Save and load example
save_products_to_file(products, "products.json")
loaded_products = load_products_from_file("products.json")
print("\nLoaded products:")
for p in loaded_products:
print(f" - {p.name}: ${p.price} ({p.category})")
# Electronic products:
# - Laptop: $1200
# - Headphones: $85
# - Phone: $800
#
# Affordable products ($10-$100):
# - Headphones: $85
# - Coffee Mug: $12
# - Book: $20
#
# Loaded products:
# - Laptop: $1200 (Electronics)
# - Headphones: $85 (Electronics)
# - Coffee Mug: $12 (Kitchen)
# - Book: $20 (Books)
# - Phone: $800 (Electronics)
# Save as test_product.py
from typing import Dict, List, Any, Optional, Iterator
import pytest
import os
import tempfile
from product import Product, filter_by_category, filter_by_price_range, save_products_to_file, load_products_from_file
@pytest.fixture
def sample_product() -> Product:
"""Fixture providing a single product for testing."""
return Product("Test Product", 50, "Test Category")
@pytest.fixture
def product_collection() -> List[Product]:
"""Fixture providing a collection of products for testing."""
return [
Product("Laptop", 1200, "Electronics"),
Product("Headphones", 85, "Electronics"),
Product("Coffee Mug", 12, "Kitchen"),
Product("Book", 20, "Books"),
Product("Phone", 800, "Electronics")
]
@pytest.fixture
def temp_product_file() -> Iterator[str]:
"""Fixture providing a temporary file with serialized product data."""
# Create products
products = [
Product("TempLaptop", 1000, "Electronics"),
Product("TempMug", 15, "Kitchen")
]
# Create temporary file
fd, temp_path = tempfile.mkstemp(suffix='.json')
os.close(fd)
# Save products to the file
save_products_to_file(products, temp_path)
print(f"Created temporary product file: {temp_path}")
# Yield the file path to the test
yield temp_path
# Clean up after the test
if os.path.exists(temp_path):
os.remove(temp_path)
print(f"Removed temporary file: {temp_path}")
def test_product_properties(sample_product: Product) -> None:
"""Test basic product properties."""
print(f"Testing product: {sample_product}")
assert sample_product.name == "Test Product"
assert sample_product.price == 50
assert sample_product.category == "Test Category"
def test_filter_by_category(product_collection: List[Product]) -> None:
"""Test filtering products by category."""
electronics = filter_by_category(product_collection, "Electronics")
kitchen_items = filter_by_category(product_collection, "Kitchen")
print(f"Found {len(electronics)} electronics items:")
for product in electronics:
print(f" - {product.name}")
print(f"Found {len(kitchen_items)} kitchen items:")
for product in kitchen_items:
print(f" - {product.name}")
assert len(electronics) == 3
assert len(kitchen_items) == 1
assert all(p.category == "Electronics" for p in electronics)
assert all(p.category == "Kitchen" for p in kitchen_items)
def test_filter_by_price_range(product_collection: List[Product]) -> None:
"""Test filtering products by price range."""
affordable = filter_by_price_range(product_collection, 10, 100)
mid_range = filter_by_price_range(product_collection, 50, 500)
expensive = filter_by_price_range(product_collection, 500, 2000)
print(f"Affordable products ($10-$100): {len(affordable)}")
for product in affordable:
print(f" - {product.name}: ${product.price}")
assert len(affordable) == 3
assert len(mid_range) == 2
assert len(expensive) == 2
assert all(10 <= p.price <= 100 for p in affordable)
assert all(50 <= p.price <= 500 for p in mid_range)
assert all(500 <= p.price <= 2000 for p in expensive)
def test_save_and_load(temp_product_file: str) -> None:
"""Test saving and loading products from a file."""
# Load the products from the temporary file
loaded_products = load_products_from_file(temp_product_file)
print(f"Loaded {len(loaded_products)} products from file:")
for product in loaded_products:
print(f" - {product.name}: ${product.price} ({product.category})")
# Verify the loaded data
assert len(loaded_products) == 2
assert loaded_products[0].name == "TempLaptop"
assert loaded_products[0].price == 1000
assert loaded_products[0].category == "Electronics"
assert loaded_products[1].name == "TempMug"
assert loaded_products[1].price == 15
assert loaded_products[1].category == "Kitchen"
# Test saving additional products
new_product = Product("New Product", 75, "New Category")
loaded_products.append(new_product)
save_products_to_file(loaded_products, temp_product_file)
# Reload and verify
reloaded_products = load_products_from_file(temp_product_file)
assert len(reloaded_products) == 3
assert reloaded_products[2].name == "New Product"
if __name__ == "__main__":
# Create fixtures manually for direct execution
product = sample_product()
products = product_collection()
# Run tests
test_product_properties(product)
test_filter_by_category(products)
test_filter_by_price_range(products)
# For the file test, we'll use a temp file path directly
temp_path = "temp_products.json"
temp_products = [
Product("TempLaptop", 1000, "Electronics"),
Product("TempMug", 15, "Kitchen")
]
save_products_to_file(temp_products, temp_path)
try:
test_save_and_load(temp_path)
finally:
if os.path.exists(temp_path):
os.remove(temp_path)
print("All product tests passed!")
# Testing product: Product(name='Test Product', price=50, category='Test Category')
# Found 3 electronics items:
# - Laptop
# - Headphones
# - Phone
# Found 1 kitchen items:
# - Coffee Mug
# Affordable products ($10-$100): 3
# - Headphones: $85
# - Coffee Mug: $12
# - Book: $20
# Loaded 2 products from file:
# - TempLaptop: $1000 (Electronics)
# - TempMug: $15 (Kitchen)
# All product tests passed!
Simplified Solutions for Exercises 3-5
For brevity, here are the essential approaches for the remaining exercises:
Exercise 3 (Testing with Mocks):
- Create a
data_service.py
with a function that fetches and processes API data - Use
unittest.mock
orpytest-mock
to mock the API responses - Test different response types and error conditions
Exercise 4 (Property-Based Testing):
- Create a median calculation function in
statistics_utils.py
- Use Hypothesis to generate random lists of numbers
- Verify that the median function obeys mathematical properties
Exercise 5 (Integration Testing):
- Create separate modules for extraction, transformation, and loading
- Write unit tests for each component
- Write integration tests that verify the full pipeline
9.9 Chapter Summary and Connection to Next Chapter
In this chapter, we’ve learned how to create comprehensive tests for data engineering code. We’ve covered:
- Basic Testing with pytest - Setting up tests, running them, and understanding results
- Unit Testing - Testing individual functions in isolation
- Test Fixtures - Creating reusable test data
- Mock Objects - Isolating code from external dependencies
- Testing Numerical Accuracy - Ensuring correct calculations
- Property-Based Testing - Testing with generated inputs
- Integration Testing - Testing components together
These testing skills are essential for ensuring the reliability of your data engineering code. By thoroughly testing your code, you catch bugs early, document your expectations, and build confidence in your data pipelines.
In the next chapter, we’ll explore Docker basics for development. Docker containerization complements the testing practices we’ve learned by providing consistent, isolated environments. You’ll learn how to package your Python applications and their dependencies into containers that can run anywhere, ensuring that your tested code works the same way across different environments.
The testing skills from this chapter will be particularly relevant when we start working with containerized applications, as we’ll need to test the behavior of code running inside Docker containers.