07 - Static Typing with Python
Complexity: Moderate (M)
7.0 Introduction: Why Static Typing Matters for Data Engineering
As data engineers, we build systems that process critical business information. Errors in these systems can lead to incorrect analyses, failed pipelines, and data integrity issues. Static typing helps us catch many of these errors before our code ever runs, making our systems more robust and easier to maintain.
Let’s visualize how static typing fits into the data engineering workflow:
flowchart TD A[Write Typed Code] --> B[Type Check with Pyright] B -->|Errors| C[Fix Type Issues] C --> B B -->|No Errors| D[Run Unit Tests] D --> E[Deploy to Production] classDef dev fill:#d0e0ff,stroke:#336,stroke-width:1px classDef test fill:#ddffdd,stroke:#363,stroke-width:1px classDef prod fill:#f9f9f9,stroke:#333,stroke-width:2px class A,B,C dev class D test class E prod
Python is traditionally a dynamically-typed language, where variable types are determined at runtime. With the introduction of type hints in Python 3.5+, we can now add optional static typing to gain several benefits:
- Catch Errors Earlier: Find type-related bugs before running your code
- Improve Readability: Make it clear what types functions expect and return
- Better IDE Support: Get more accurate code completion and documentation
- Safer Refactoring: Change code with confidence, knowing the type checker will catch issues
- Self-Documenting Code: Make your intentions explicit through type annotations
Python Version Requirements:
- Function type annotations: Python 3.5+
- Variable type annotations: Python 3.6+
- Advanced typing features: Various versions (we’ll specify as we go)
- Type checking with pyright: Compatible with Python 3.5+
In this chapter, we’ll explore how to add type annotations to your Python code and use the pyright
tool to check for type correctness.
7.1 Type Annotation Syntax
Python’s type annotations use a simple syntax to declare the expected types of variables, function parameters, and return values.
7.1.1 Variable Annotations
Variable annotations were introduced in Python 3.6+. Let’s start with basic variable annotations:
# Basic variable annotations (Python 3.6+)
name: str = "Alice"
age: int = 30
height: float = 5.9
is_active: bool = True
print(f"name is {name}, type: {type(name)}")
print(f"age is {age}, type: {type(age)}")
print(f"height is {height}, type: {type(height)}")
print(f"is_active is {is_active}, type: {type(is_active)}")
# Output:
# name is Alice, type: <class 'str'>
# age is 30, type: <class 'int'>
# height is 5.9, type: <class 'float'>
# is_active is True, type: <class 'bool'>
Notice that adding type annotations doesn’t change how Python runs the code. The annotations are hints for tools and developers, not runtime checks.
7.1.2 Function Parameter and Return Types
Function annotations were introduced in Python 3.5+ and specify the types of parameters and return values:
# Function with parameter and return type annotations (Python 3.5+)
def calculate_total_price(quantity: int, price: float) -> float:
"""Calculate the total price for a purchase."""
return quantity * price
total = calculate_total_price(5, 19.99)
print(f"Total price: ${total:.2f}")
# Output:
# Total price: $99.95
# We can see the type annotations using the __annotations__ attribute
print(f"Function annotations: {calculate_total_price.__annotations__}")
# Output:
# Function annotations: {'quantity': <class 'int'>, 'price': <class 'float'>, 'return': <class 'float'>}
7.1.3 Type Comments for Older Python Versions
If you’re working with Python 3.4 or older, or need to maintain compatibility with those versions, you can use type comments instead:
# Type comments (for Python 3.4 or older)
name = "Bob" # type: str
age = 25 # type: int
def calculate_discount(price, percentage):
# type: (float, float) -> float
"""Calculate the discount amount."""
return price * (percentage / 100)
discount = calculate_discount(99.99, 20)
print(f"Discount amount: ${discount:.2f}")
# Output:
# Discount amount: $20.00
However, for most modern Python development, we recommend using the inline annotation syntax instead of comments when possible. Type checkers like pyright support both styles, but the inline syntax is more readable and better supported by IDEs.
7.2 Common Type Annotations
Now let’s explore the most common type annotations you’ll use in data engineering.
7.2.1 Basic Types
Python’s typing module (introduced in Python 3.5+) provides access to many common types:
# Import the typing module to access more types (Python 3.5+)
from typing import Any, List, Dict, Tuple, Set
# Basic type examples
user_id: int = 12345
username: str = "data_engineer"
temperature: float = 72.8
is_available: bool = False
generic_data: Any = "This could be any type" # Any type was available since Python 3.5
print(f"User ID: {user_id}")
print(f"Username: {username}")
print(f"Temperature: {temperature}")
print(f"Available: {is_available}")
print(f"Generic data: {generic_data}")
# Output:
# User ID: 12345
# Username: data_engineer
# Temperature: 72.8
# Available: False
# Generic data: This could be any type
7.2.2 Collection Types
For collections like lists, dictionaries, and tuples, we specify the types of the elements (available since Python 3.5+):
from typing import List, Dict, Tuple, Set # Python 3.5+
# List of integers
numbers: List[int] = [1, 2, 3, 4, 5]
# Dictionary mapping strings to floats
prices: Dict[str, float] = {
"apple": 0.99,
"banana": 0.59,
"orange": 1.29
}
# Tuple with specific types for each position
person: Tuple[str, int, float] = ("Alice", 30, 5.8)
# Set of strings
unique_tags: Set[str] = {"python", "data", "engineering"}
print(f"Numbers: {numbers}")
print(f"Prices: {prices}")
print(f"Person: {person}")
print(f"Unique tags: {unique_tags}")
# Output:
# Numbers: [1, 2, 3, 4, 5]
# Prices: {'apple': 0.99, 'banana': 0.59, 'orange': 1.29}
# Person: ('Alice', 30, 5.8)
# Unique tags: {'python', 'data', 'engineering'}
Note: In Python 3.9+, you can use the built-in collection types directly for annotations:
# Python 3.9+ simplified syntax
numbers: list[int] = [1, 2, 3, 4, 5]
prices: dict[str, float] = {"apple": 0.99}
person: tuple[str, int, float] = ("Alice", 30, 5.8)
7.2.3 Nested Collection Types
Collection types can be nested to represent more complex data structures:
from typing import List, Dict, Tuple, Any # Python 3.5+
# List of dictionaries (common for data processing)
users: List[Dict[str, Any]] = [
{"id": 1, "name": "Alice", "active": True},
{"id": 2, "name": "Bob", "active": False}
]
# Dictionary with tuple keys and list values
coordinates_map: Dict[Tuple[int, int], List[str]] = {
(0, 0): ["origin", "center"],
(10, 10): ["top-right"],
(-10, 10): ["top-left"]
}
print(f"Users: {users}")
print(f"Coordinates map: {coordinates_map}")
# Output:
# Users: [{'id': 1, 'name': 'Alice', 'active': True}, {'id': 2, 'name': 'Bob', 'active': False}]
# Coordinates map: {(0, 0): ['origin', 'center'], (10, 10): ['top-right'], (-10, 10): ['top-left']}
7.2.4 Function Types
We can also define types for functions themselves (available since Python 3.5+):
from typing import Callable # Python 3.5+
# Define a function type: takes two floats and returns a float
MathOperation = Callable[[float, float], float]
def apply_operation(x: float, y: float, operation: MathOperation) -> float:
"""Apply a mathematical operation to two numbers."""
return operation(x, y)
def add(a: float, b: float) -> float:
return a + b
def multiply(a: float, b: float) -> float:
return a * b
result1 = apply_operation(5.0, 3.0, add)
result2 = apply_operation(5.0, 3.0, multiply)
print(f"5 + 3 = {result1}")
print(f"5 * 3 = {result2}")
# Output:
# 5 + 3 = 8.0
# 5 * 3 = 15.0
7.3 Optional and Union Types
In real-world data engineering, we often deal with data that might be missing or could be of different types.
7.3.1 Optional Types
The Optional
type (available since Python 3.5+) indicates that a value could be of a specific type or None
:
from typing import Optional # Python 3.5+
# A function that returns a string or None
def get_user_name(user_id: int) -> Optional[str]:
"""Get a user's name from their ID, or None if user doesn't exist."""
user_database = {
1: "Alice",
2: "Bob",
3: "Charlie"
}
return user_database.get(user_id) # Returns None if key doesn't exist
# Test with existing and non-existing users
for user_id in [1, 4]:
name = get_user_name(user_id)
if name is None:
print(f"User {user_id} not found")
else:
print(f"User {user_id}: {name}")
# Output:
# User 1: Alice
# User 4 not found
Optional[str]
is actually a shorthand for Union[str, None]
. It’s a common pattern when a value might be missing.
7.3.2 Union Types
The Union
type (available since Python 3.5+) indicates that a value could be one of several types:
from typing import Union, List, Dict, Any # Python 3.5+
# A value that can be either an int or a string
user_identifier: Union[int, str] = "user_abc"
print(f"User identifier: {user_identifier}")
# Later in the code, we might assign a different type to the same variable
user_identifier = 12345
print(f"User identifier (changed): {user_identifier}")
# Function that accepts different types of data
def process_data(data: Union[List[Dict[str, Any]], Dict[str, Any]]) -> int:
"""Process either a single record or a list of records."""
if isinstance(data, list):
# Process a list of records
record_count = len(data)
print(f"Processing {record_count} records")
return record_count
else:
# Process a single record
print("Processing a single record")
return 1
# Try both types of input
single_record = {"id": 1, "name": "Alice"}
multiple_records = [
{"id": 1, "name": "Alice"},
{"id": 2, "name": "Bob"}
]
count1 = process_data(single_record)
count2 = process_data(multiple_records)
print(f"Processed records: {count1}")
print(f"Processed records: {count2}")
# Output:
# User identifier: user_abc
# User identifier (changed): 12345
# Processing a single record
# Processed records: 1
# Processing 2 records
# Processed records: 2
Note: In Python 3.10+, you can use the pipe operator for unions:
# Python 3.10+ simplified union syntax
user_identifier: int | str = "user_abc" # Instead of Union[int, str]
If-Type Narrowing
Python’s type checker is smart enough to narrow the type based on conditional checks (this feature works in all versions that support type checking):
from typing import Union # Python 3.5+
def display_length(value: Union[str, list]) -> None:
"""Display the length of a string or list."""
# Type narrowing through isinstance check
if isinstance(value, str):
# Inside this block, the type checker knows value is a string
print(f"String length: {len(value)}")
else:
# Inside this block, the type checker knows value is a list
print(f"List length: {len(value)}")
# Test with different types
display_length("Hello, world!")
display_length([1, 2, 3, 4])
# Output:
# String length: 13
# List length: 4
7.4 TypedDict and NamedTuple
For data engineering, we often need to define structured types for dictionaries and tuples. Python’s typing module provides TypedDict
and NamedTuple
to help with this.
7.4.1 TypedDict
TypedDict
(introduced in Python 3.8+, but available earlier through typing_extensions
) lets us define the expected structure of a dictionary:
from typing import TypedDict, List # TypedDict requires Python 3.8+
# Define a structured dictionary type
class UserDict(TypedDict):
id: int
name: str
email: str
active: bool
tags: List[str]
# Create a user with the expected structure
user: UserDict = {
"id": 1,
"name": "Alice Smith",
"email": "alice@example.com",
"active": True,
"tags": ["admin", "developer"]
}
def format_user(user: UserDict) -> str:
"""Format a user for display."""
status = "Active" if user["active"] else "Inactive"
tags = ", ".join(user["tags"]) if user["tags"] else "No tags"
return f"User {user['id']}: {user['name']} ({user['email']}) - {status} - Tags: {tags}"
print(format_user(user))
# Output:
# User 1: Alice Smith (alice@example.com) - Active - Tags: admin, developer
Note for Python 3.5-3.7: If you’re using an older Python version, you can install the typing_extensions
package and import TypedDict from there:
from typing_extensions import TypedDict # For Python 3.5-3.7
TypedDict
is especially useful for working with JSON data or API responses in data engineering.
7.4.2 NamedTuple
NamedTuple
with type annotations (available since Python 3.6+) creates a tuple with named fields, making it more readable than a regular tuple:
from typing import NamedTuple, List # Typed NamedTuple requires Python 3.6+
# Define a named tuple type
class DataPoint(NamedTuple):
timestamp: str
value: float
tags: List[str]
# Create a data point
temperature = DataPoint(
timestamp="2023-04-15T12:30:00",
value=72.5,
tags=["temperature", "indoor"]
)
# Access fields by name instead of index
print(f"Time: {temperature.timestamp}")
print(f"Value: {temperature.value}°F")
print(f"Tags: {', '.join(temperature.tags)}")
# Named tuples are still tuples and support unpacking
time, value, tags = temperature
print(f"Unpacked - Time: {time}, Value: {value}, Tags: {tags}")
# Output:
# Time: 2023-04-15T12:30:00
# Value: 72.5°F
# Tags: temperature, indoor
# Unpacked - Time: 2023-04-15T12:30:00, Value: 72.5, Tags: ['temperature', 'indoor']
Note: The regular collections.namedtuple
has been available since Python 2.6, but the typed version that works with type annotations was introduced in Python 3.6 with the typing module.
NamedTuple
is similar to a simple dataclass and is great for immutable records. It makes your code more readable than using regular tuples with numeric indices.
7.5 Type Checking with Pyright
Pyright is a static type checker for Python that helps find type-related errors before you run your code. Let’s explore how to use it with our type-annotated code.
7.5.1 Installing and Running Pyright
First, we need to install pyright:
# Install pyright using pip
pip install pyright
Now, let’s create a simple file with some type issues to see how pyright identifies them:
# Save this in a file named type_errors.py
from typing import List, Dict
def add_numbers(a: int, b: int) -> int:
return a + b
def get_user_names(users: List[Dict[str, str]]) -> List[str]:
return [user["name"] for user in users]
# Type error: passing a string instead of an int
result = add_numbers(5, "10")
print(f"Result: {result}")
# Type error: missing the 'name' key in one of the dictionaries
users = [
{"name": "Alice", "email": "alice@example.com"},
{"email": "bob@example.com"} # Missing 'name' key
]
names = get_user_names(users)
print(f"Names: {names}")
Now let’s run pyright to check for type errors:
# Run pyright on our file
pyright type_errors.py
The output should show the type errors:
type_errors.py:9:26 - error: Argument of type "str" cannot be assigned to parameter "b" of type "int" (parameter 2)
type_errors.py:16:17 - error: Element may not contain key "name"
7.5.2 Understanding and Fixing Type Errors
Let’s fix the errors in our code:
# Save this in a file named type_errors_fixed.py
from typing import List, Dict, Any
def add_numbers(a: int, b: int) -> int:
return a + b
def get_user_names(users: List[Dict[str, Any]]) -> List[str]:
return [user["name"] if "name" in user else "Unknown" for user in users]
# Fixed: convert string to int before passing
result = add_numbers(5, int("10"))
print(f"Result: {result}")
# Fixed: handle missing 'name' key
users = [
{"name": "Alice", "email": "alice@example.com"},
{"email": "bob@example.com"} # Missing 'name' key
]
names = get_user_names(users)
print(f"Names: {names}")
# Output:
# Result: 15
# Names: ['Alice', 'Unknown']
Now when we run pyright again, it should pass without errors:
pyright type_errors_fixed.py
7.5.3 Configuring Pyright
For larger projects, you can configure pyright using a pyrightconfig.json
file in your project root:
{
"include": ["src"],
"exclude": ["**/node_modules", "**/__pycache__"],
"ignore": ["src/legacy"],
"reportMissingImports": true,
"reportMissingTypeStubs": false,
"pythonVersion": "3.9",
"typeCheckingMode": "basic"
}
Some important configuration options:
include
: Directories to include in type checkingexclude
: Directories to excludeignore
: Files or directories to ignorepythonVersion
: Target Python versiontypeCheckingMode
:"off"
,"basic"
, or"strict"
7.5.4 Type Checking in Your Editor
Most modern Python editors support type checking through pyright or other tools:
- VS Code: Integrates with pyright through the Pylance extension
- PyCharm: Has built-in type checking
- vim/Neovim: Can use pyright through language server protocols
This gives you immediate feedback as you type, making type errors even easier to catch and fix.
7.6 Integrating with Existing Code
When working with data engineering systems, you’ll often need to add type annotations to existing code incrementally.
7.6.1 Gradual Typing
Python’s type system is designed for gradual typing, meaning you can add type annotations to your codebase incrementally:
# Start by adding types to function signatures
def process_data(data): # No types yet
result = []
for item in data:
result.append(transform_item(item))
return result
# Then gradually add more specific types
from typing import List, Dict, Any
def transform_item(item: Dict[str, Any]) -> Dict[str, Any]:
"""Transform a single data item."""
return {
"id": item.get("id", 0),
"name": item.get("name", "").upper(),
"processed": True
}
def process_data(data: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""Process a list of data items."""
result = []
for item in data:
result.append(transform_item(item))
return result
# Test with sample data
sample_data = [
{"id": 1, "name": "Alice"},
{"id": 2, "name": "Bob"}
]
processed = process_data(sample_data)
print(f"Processed data: {processed}")
# Output:
# Processed data: [{'id': 1, 'name': 'ALICE', 'processed': True}, {'id': 2, 'name': 'BOB', 'processed': True}]
7.6.2 Using Any for Flexibility
When you’re not sure about types or need flexibility, you can use Any
:
from typing import Any, Dict, List
def parse_data(raw_data: Any) -> List[Dict[str, Any]]:
"""Parse data that could be in various formats."""
if isinstance(raw_data, list):
return raw_data # Already a list
elif isinstance(raw_data, dict):
return [raw_data] # Single dictionary
elif isinstance(raw_data, str):
try:
import json
parsed = json.loads(raw_data)
return parse_data(parsed) # Recursively handle the parsed data
except json.JSONDecodeError:
return [{"value": raw_data}] # Treat as a single value
else:
return [{"value": str(raw_data)}] # Convert to string
# Test with different input types
test_inputs = [
[{"id": 1}, {"id": 2}], # List of dictionaries
{"id": 3}, # Single dictionary
'{"id": 4}', # JSON string
"plain text", # Plain string
42 # Number
]
for input_data in test_inputs:
result = parse_data(input_data)
print(f"Input: {input_data} -> Result: {result}")
# Output:
# Input: [{'id': 1}, {'id': 2}] -> Result: [{'id': 1}, {'id': 2}]
# Input: {'id': 3} -> Result: [{'id': 3}]
# Input: {"id": 4} -> Result: [{'id': 4}]
# Input: plain text -> Result: [{'value': 'plain text'}]
# Input: 42 -> Result: [{'value': '42'}]
Remember that while Any
is useful, it effectively opts out of type checking for that value. Use it sparingly and try to replace it with more specific types when possible.
7.7 Type Safety for Data Engineering
Now let’s look at some data engineering specific examples of how type safety can help build more robust data pipelines.
7.7.1 Data Schema Definitions
In data engineering, we often work with schema definitions:
from typing import Dict, List, Union, Optional, TypedDict # Python 3.8+ for TypedDict
# For Python 3.5-3.7, use:
# from typing import Dict, List, Union, Optional
# from typing_extensions import TypedDict
# Define types for database schema
class ColumnSchema(TypedDict):
name: str
data_type: str
nullable: bool
description: Optional[str]
class TableSchema(TypedDict):
name: str
columns: List[ColumnSchema]
primary_key: List[str]
# Create a sample table schema
user_table: TableSchema = {
"name": "users",
"columns": [
{"name": "id", "data_type": "integer", "nullable": False, "description": "Unique user ID"},
{"name": "username", "data_type": "varchar(50)", "nullable": False, "description": "User's login name"},
{"name": "email", "data_type": "varchar(100)", "nullable": False, "description": "User's email address"},
{"name": "active", "data_type": "boolean", "nullable": False, "description": "Whether the user account is active"}
],
"primary_key": ["id"]
}
def generate_create_table_sql(schema: TableSchema) -> str:
"""Generate SQL CREATE TABLE statement from schema."""
columns_sql = []
for column in schema["columns"]:
parts = [f"{column['name']} {column['data_type']}"]
if not column["nullable"]:
parts.append("NOT NULL")
if "description" in column and column["description"]:
# Escape single quotes in the description
escaped_desc = column["description"].replace("'", "''")
parts.append(f"COMMENT '{escaped_desc}'")
columns_sql.append(" ".join(parts))
primary_key = f"PRIMARY KEY ({', '.join(schema['primary_key'])})"
columns_sql.append(primary_key)
sql = f"CREATE TABLE {schema['name']} (\n " + ",\n ".join(columns_sql) + "\n);"
return sql
# Generate and print the SQL
sql = generate_create_table_sql(user_table)
print(sql)
# Output:
# CREATE TABLE users (
# id integer NOT NULL COMMENT 'Unique user ID',
# username varchar(50) NOT NULL COMMENT 'User''s login name',
# email varchar(100) NOT NULL COMMENT 'User''s email address',
# active boolean NOT NULL COMMENT 'Whether the user account is active',
# PRIMARY KEY (id)
# );
7.7.2 ETL Pipeline with Type Annotations
Here’s a simple ETL (Extract, Transform, Load) example with type annotations (all features used are available since Python 3.5+):
# Standard library imports (typing was added to stdlib in Python 3.5)
from typing import List, Dict, Any, Callable, Optional
# Define types for our ETL pipeline
RawData = List[Dict[str, Any]]
TransformedData = List[Dict[str, Any]]
Extractor = Callable[[], RawData]
Transformer = Callable[[RawData], TransformedData]
Loader = Callable[[TransformedData], None]
def create_etl_pipeline(
extract: Extractor,
transform: Transformer,
load: Loader
) -> Callable[[], None]:
"""Create an ETL pipeline from extract, transform, and load functions."""
def pipeline() -> None:
"""Execute the ETL pipeline."""
print("Starting ETL pipeline")
# Extract
print("Extracting data...")
raw_data = extract()
print(f"Extracted {len(raw_data)} records")
# Transform
print("Transforming data...")
transformed_data = transform(raw_data)
print(f"Transformed data into {len(transformed_data)} records")
# Load
print("Loading data...")
load(transformed_data)
print("Loading complete")
print("ETL pipeline complete")
return pipeline
# Example implementation of extract, transform, and load functions
def extract_sample_data() -> RawData:
"""Extract sample data (in real life, this would come from a database or API)."""
return [
{"id": 1, "name": "alice", "age": "30"},
{"id": 2, "name": "bob", "age": "25"},
{"id": 3, "name": "charlie", "age": None}
]
def transform_sample_data(data: RawData) -> TransformedData:
"""Transform the data by standardizing types and formats."""
result = []
for record in data:
transformed = {
"user_id": record["id"],
"full_name": record["name"].title(),
"age": int(record["age"]) if record["age"] is not None else None,
"is_adult": True if record["age"] is not None and int(record["age"]) >= 18 else False
}
result.append(transformed)
return result
def load_to_console(data: TransformedData) -> None:
"""Load the data (in this case, just print it)."""
for record in data:
print(f"Loaded: {record}")
# Create and run our pipeline
etl_pipeline = create_etl_pipeline(
extract_sample_data,
transform_sample_data,
load_to_console
)
etl_pipeline()
# Output:
# Starting ETL pipeline
# Extracting data...
# Extracted 3 records
# Transforming data...
# Transformed data into 3 records
# Loading data...
# Loaded: {'user_id': 1, 'full_name': 'Alice', 'age': 30, 'is_adult': True}
# Loaded: {'user_id': 2, 'full_name': 'Bob', 'age': 25, 'is_adult': True}
# Loaded: {'user_id': 3, 'full_name': 'Charlie', 'age': None, 'is_adult': False}
# Loading complete
# ETL pipeline complete
This pipeline structure makes it clear what types of data flow between each stage, making it easier to understand and maintain.
This pipeline structure makes it clear what types of data flow between each stage, making it easier to understand and maintain.
7.8 Micro-Project: Type-Safe Data Processor
Project Requirements
For this micro-project, you’ll add comprehensive static typing to a data processing script using Python’s type annotation system and verify correctness with pyright.
Objectives:
- Add complete type annotations to the provided data processor script
- Include proper typing for variables, function parameters, and return values
- Use complex types where appropriate (List, Dict, Optional, Union, etc.)
- Create custom type definitions for domain-specific structures
- Configure pyright for type checking
- Fix any type issues identified during checking
Acceptance Criteria
- All functions have properly annotated parameters and return types
- Variables are appropriately typed, especially in complex data processing functions
- You use appropriate collection types (List, Dict, Tuple, etc.) with their content types specified
- Code includes at least one custom type definition (TypedDict, NamedTuple, or dataclass)
- You use Optional or Union types where values might be None or of different types
- pyright runs with strict settings and reports no errors
- Type annotations don’t change the runtime behavior of the application
- Documentation explains any cases where type annotations required code restructuring
Common Pitfalls
Type annotations becoming overly complex and hard to read
- Solution: Start with simpler types and gradually refine them as needed
Using Any type too liberally, defeating the purpose of typing
- Solution: Be as specific as possible with types, even if it requires more effort
Type checking errors that seem impossible to resolve
- Solution: Use Union and Optional types to handle edge cases, or use cast() when appropriate
Production vs. Educational Differences
In a production environment, this project would differ in several ways:
- CI/CD Integration: Type checking would be part of automated tests in CI/CD pipelines
- Typing Coverage: Production code might have type coverage metrics and require minimum thresholds
- Performance Considerations: Production systems might use type comments in performance-critical code
- Legacy Code Handling: Production would have strategies for adding types to legacy code without breaking changes
- Team Standards: There would be team-wide typing standards and conventions
Our educational version focuses on learning the core concepts without these production complexities.
Implementation
Here’s a simple data processor script without type annotations that we’ll enhance:
# data_processor.py - Before adding type annotations
def parse_csv_line(line):
"""Parse a CSV line into a list of values."""
return line.strip().split(',')
def process_record(record):
"""Process a single data record."""
if len(record) < 3:
return None
# Extract and transform fields
try:
id_value = int(record[0])
name = record[1].strip()
value = float(record[2])
tags = record[3].split(';') if len(record) > 3 else []
return {
"id": id_value,
"name": name,
"value": value,
"tags": tags,
"is_valid": value > 0
}
except (ValueError, IndexError):
return None
def process_data(data_lines):
"""Process multiple lines of data."""
records = []
error_count = 0
for i, line in enumerate(data_lines):
record_data = parse_csv_line(line)
processed_record = process_record(record_data)
if processed_record:
records.append(processed_record)
else:
error_count += 1
print(f"Error processing line {i+1}: {line.strip()}")
return {
"records": records,
"record_count": len(records),
"error_count": error_count
}
def calculate_statistics(processed_data):
"""Calculate statistics from processed data."""
records = processed_data["records"]
if not records:
return {
"count": 0,
"total": 0,
"average": 0,
"min": None,
"max": None
}
values = [r["value"] for r in records]
return {
"count": len(values),
"total": sum(values),
"average": sum(values) / len(values),
"min": min(values),
"max": max(values)
}
def generate_report(processed_data, statistics):
"""Generate a report from processed data and statistics."""
report_lines = []
report_lines.append("DATA PROCESSING REPORT")
report_lines.append("=====================")
report_lines.append(f"Processed {processed_data['record_count']} records with {processed_data['error_count']} errors")
report_lines.append("")
report_lines.append("STATISTICS:")
report_lines.append(f"- Total: {statistics['total']:.2f}")
report_lines.append(f"- Average: {statistics['average']:.2f}")
report_lines.append(f"- Min: {statistics['min']}")
report_lines.append(f"- Max: {statistics['max']}")
report_lines.append("")
report_lines.append("RECORDS:")
for record in processed_data["records"]:
tags = f" [Tags: {', '.join(record['tags'])}]" if record["tags"] else ""
status = "VALID" if record["is_valid"] else "INVALID"
report_lines.append(f"- {record['id']}: {record['name']} = {record['value']} ({status}){tags}")
return "\n".join(report_lines)
def main():
# Sample data for testing
sample_data = [
"1,Temperature,23.5,sensor;outdoor",
"2,Humidity,45.2,sensor;indoor",
"3,Pressure,-1.0,sensor",
"4,Invalid,abc,test",
"5,Partial",
"6,Wind Speed,15.7"
]
# Process the data
processed_data = process_data(sample_data)
statistics = calculate_statistics(processed_data)
report = generate_report(processed_data, statistics)
# Print the report
print(report)
if __name__ == "__main__":
main()
Now, let’s add type annotations to this script:
# data_processor_typed.py - With type annotations
from typing import List, Dict, Optional, Union, TypedDict, Any, Tuple
# Custom type definitions
class Record(TypedDict):
"""Structure for a processed data record."""
id: int
name: str
value: float
tags: List[str]
is_valid: bool
class Statistics(TypedDict):
"""Structure for statistical calculations."""
count: int
total: float
average: float
min: Optional[float]
max: Optional[float]
class ProcessedData(TypedDict):
"""Structure for the results of data processing."""
records: List[Record]
record_count: int
error_count: int
def parse_csv_line(line: str) -> List[str]:
"""Parse a CSV line into a list of values."""
return line.strip().split(',')
def process_record(record: List[str]) -> Optional[Record]:
"""Process a single data record."""
if len(record) < 3:
return None
# Extract and transform fields
try:
id_value = int(record[0])
name = record[1].strip()
value = float(record[2])
tags = record[3].split(';') if len(record) > 3 else []
return {
"id": id_value,
"name": name,
"value": value,
"tags": tags,
"is_valid": value > 0
}
except (ValueError, IndexError):
return None
def process_data(data_lines: List[str]) -> ProcessedData:
"""Process multiple lines of data."""
records: List[Record] = []
error_count: int = 0
for i, line in enumerate(data_lines):
record_data = parse_csv_line(line)
processed_record = process_record(record_data)
if processed_record:
records.append(processed_record)
else:
error_count += 1
print(f"Error processing line {i+1}: {line.strip()}")
return {
"records": records,
"record_count": len(records),
"error_count": error_count
}
def calculate_statistics(processed_data: ProcessedData) -> Statistics:
"""Calculate statistics from processed data."""
records = processed_data["records"]
if not records:
return {
"count": 0,
"total": 0.0,
"average": 0.0,
"min": None,
"max": None
}
values: List[float] = [r["value"] for r in records]
return {
"count": len(values),
"total": sum(values),
"average": sum(values) / len(values),
"min": min(values),
"max": max(values)
}
def generate_report(processed_data: ProcessedData, statistics: Statistics) -> str:
"""Generate a report from processed data and statistics."""
report_lines: List[str] = []
report_lines.append("DATA PROCESSING REPORT")
report_lines.append("=====================")
report_lines.append(f"Processed {processed_data['record_count']} records with {processed_data['error_count']} errors")
report_lines.append("")
report_lines.append("STATISTICS:")
report_lines.append(f"- Total: {statistics['total']:.2f}")
report_lines.append(f"- Average: {statistics['average']:.2f}")
report_lines.append(f"- Min: {statistics['min']}")
report_lines.append(f"- Max: {statistics['max']}")
report_lines.append("")
report_lines.append("RECORDS:")
for record in processed_data["records"]:
tags = f" [Tags: {', '.join(record['tags'])}]" if record["tags"] else ""
status = "VALID" if record["is_valid"] else "INVALID"
report_lines.append(f"- {record['id']}: {record['name']} = {record['value']} ({status}){tags}")
return "\n".join(report_lines)
def main() -> None:
# Sample data for testing
sample_data: List[str] = [
"1,Temperature,23.5,sensor;outdoor",
"2,Humidity,45.2,sensor;indoor",
"3,Pressure,-1.0,sensor",
"4,Invalid,abc,test",
"5,Partial",
"6,Wind Speed,15.7"
]
# Process the data
processed_data: ProcessedData = process_data(sample_data)
statistics: Statistics = calculate_statistics(processed_data)
report: str = generate_report(processed_data, statistics)
# Print the report
print(report)
if __name__ == "__main__":
main()
How to Run and Test the Solution
To run and test our type-safe data processor:
Save the code to a file named
data_processor_typed.py
Install pyright if you haven’t already:
pip install pyright
Run the script to verify it works:
python data_processor_typed.py
You should see output like:
Error processing line 4: 4,Invalid,abc,test Error processing line 5: 5,Partial DATA PROCESSING REPORT ===================== Processed 3 records with 2 errors STATISTICS: - Total: 84.40 - Average: 28.13 - Min: 15.7 - Max: 45.2 RECORDS: - 1: Temperature = 23.5 (VALID) [Tags: sensor, outdoor] - 2: Humidity = 45.2 (VALID) [Tags: sensor, indoor] - 6: Wind Speed = 15.7 (VALID)
Run pyright to check for type errors:
pyright data_processor_typed.py
If everything is correct, you should see no errors.
Try introducing a type error to see how pyright catches it: For example, change line 91 from:
values: List[float] = [r["value"] for r in records]
To:
values: List[int] = [r["value"] for r in records]
Then run pyright again to see the error.
Key Additions to the Code
- Custom Type Definitions: We added
Record
,Statistics
, andProcessedData
types usingTypedDict
- Function Annotations: All functions now have parameter and return type annotations
- Optional Types: We used
Optional[Record]
to indicate the function might returnNone
- Variable Annotations: Key variables have explicit type annotations
- Collection Types: We specified the element types for lists and dictionaries
These annotations make it clear what data structures we expect throughout our code, helping catch errors before runtime.
7.9 Practice Exercises
To reinforce the concepts we’ve learned, here are some exercises to try:
Exercise 1: Basic Type Annotations
Add type annotations to the following function:
def calculate_discounted_price(price, discount_percentage, tax_rate):
"""Calculate the final price after discount and tax."""
discounted_price = price * (1 - discount_percentage / 100)
final_price = discounted_price * (1 + tax_rate / 100)
return round(final_price, 2)
Exercise 2: Collection Types
Add type annotations to the following function:
def group_by_category(products):
"""Group products by their category."""
result = {}
for product in products:
category = product.get("category", "Uncategorized")
if category not in result:
result[category] = []
result[category].append(product)
return result
Exercise 3: Optional and Union Types
Add type annotations to the following function:
def find_user(user_id=None, email=None):
"""Find a user by ID or email."""
if user_id is None and email is None:
return None
# Simulated database
users = [
{"id": 1, "email": "user1@example.com", "name": "User One"},
{"id": 2, "email": "user2@example.com", "name": "User Two"}
]
if user_id is not None:
for user in users:
if user["id"] == user_id:
return user
if email is not None:
for user in users:
if user["email"] == email:
return user
return None
Exercise 4: TypedDict
Create a TypedDict
for representing a book and use it to annotate the following function:
def format_book_citation(book):
"""Format a book for citation."""
authors = ", ".join(book["authors"])
return f"{authors}. ({book['year']}). {book['title']}. {book['publisher']}."
Exercise 5: Function Types
Add type annotations to the following higher-order function:
def create_data_validator(validation_rules):
"""Create a validator function based on rules."""
def validator(data):
"""Validate data against rules."""
errors = []
for field, rule in validation_rules.items():
if field not in data:
errors.append(f"Missing field: {field}")
elif not rule(data[field]):
errors.append(f"Invalid value for {field}: {data[field]}")
return errors
return validator
7.10 Exercise Solutions
Solution to Exercise 1: Basic Type Annotations
def calculate_discounted_price(
price: float,
discount_percentage: float,
tax_rate: float
) -> float:
"""Calculate the final price after discount and tax."""
discounted_price = price * (1 - discount_percentage / 100)
final_price = discounted_price * (1 + tax_rate / 100)
return round(final_price, 2)
# Test the function
result = calculate_discounted_price(100.0, 20.0, 10.0)
print(f"Final price: ${result}")
# Output:
# Final price: $88.0
Solution to Exercise 2: Collection Types
from typing import List, Dict, Any
def group_by_category(products: List[Dict[str, Any]]) -> Dict[str, List[Dict[str, Any]]]:
"""Group products by their category."""
result: Dict[str, List[Dict[str, Any]]] = {}
for product in products:
category = product.get("category", "Uncategorized")
if category not in result:
result[category] = []
result[category].append(product)
return result
# Test the function
sample_products = [
{"id": 1, "name": "Laptop", "category": "Electronics"},
{"id": 2, "name": "Mouse", "category": "Electronics"},
{"id": 3, "name": "Coffee Mug", "category": "Kitchen"},
{"id": 4, "name": "T-shirt", "category": "Clothing"}
]
grouped = group_by_category(sample_products)
print("Products grouped by category:")
for category, items in grouped.items():
names = [item["name"] for item in items]
print(f"- {category}: {', '.join(names)}")
# Output:
# Products grouped by category:
# - Electronics: Laptop, Mouse
# - Kitchen: Coffee Mug
# - Clothing: T-shirt
Solution to Exercise 3: Optional and Union Types
from typing import Dict, Optional, Any, Union
def find_user(
user_id: Optional[int] = None,
email: Optional[str] = None
) -> Optional[Dict[str, Union[int, str]]]:
"""Find a user by ID or email."""
if user_id is None and email is None:
return None
# Simulated database
users: List[Dict[str, Union[int, str]]] = [
{"id": 1, "email": "user1@example.com", "name": "User One"},
{"id": 2, "email": "user2@example.com", "name": "User Two"}
]
if user_id is not None:
for user in users:
if user["id"] == user_id:
return user
if email is not None:
for user in users:
if user["email"] == email:
return user
return None
# Test the function
user1 = find_user(user_id=1)
user2 = find_user(email="user2@example.com")
nonexistent = find_user(user_id=999)
print(f"User 1: {user1}")
print(f"User 2: {user2}")
print(f"Nonexistent user: {nonexistent}")
# Output:
# User 1: {'id': 1, 'email': 'user1@example.com', 'name': 'User One'}
# User 2: {'id': 2, 'email': 'user2@example.com', 'name': 'User Two'}
# Nonexistent user: None
Solution to Exercise 4: TypedDict
from typing import List, TypedDict
class Book(TypedDict):
"""Structure for book information."""
title: str
authors: List[str]
year: int
publisher: str
def format_book_citation(book: Book) -> str:
"""Format a book for citation."""
authors = ", ".join(book["authors"])
return f"{authors}. ({book['year']}). {book['title']}. {book['publisher']}."
# Test the function
python_book: Book = {
"title": "Python Programming: A Modern Approach",
"authors": ["Smith, John", "Doe, Jane"],
"year": 2023,
"publisher": "Tech Books Publishing"
}
citation = format_book_citation(python_book)
print(f"Citation: {citation}")
# Output:
# Citation: Smith, John, Doe, Jane. (2023). Python Programming: A Modern Approach. Tech Books Publishing.
Solution to Exercise 5: Function Types
from typing import Dict, List, Callable, Any, TypeVar
T = TypeVar('T') # Generic type for values
ValidationRule = Callable[[Any], bool]
Validator = Callable[[Dict[str, Any]], List[str]]
def create_data_validator(
validation_rules: Dict[str, ValidationRule]
) -> Validator:
"""Create a validator function based on rules."""
def validator(data: Dict[str, Any]) -> List[str]:
"""Validate data against rules."""
errors: List[str] = []
for field, rule in validation_rules.items():
if field not in data:
errors.append(f"Missing field: {field}")
elif not rule(data[field]):
errors.append(f"Invalid value for {field}: {data[field]}")
return errors
return validator
# Test the function
def is_positive_number(value: Any) -> bool:
"""Check if value is a positive number."""
return isinstance(value, (int, float)) and value > 0
def is_valid_email(value: Any) -> bool:
"""Simple check if value looks like an email."""
return isinstance(value, str) and "@" in value
# Create a validator for a user record
user_validator = create_data_validator({
"id": is_positive_number,
"email": is_valid_email
})
# Test with valid data
valid_user = {"id": 1, "email": "user@example.com"}
print(f"Valid user errors: {user_validator(valid_user)}")
# Test with invalid data
invalid_user = {"id": -1, "email": "not_an_email"}
print(f"Invalid user errors: {user_validator(invalid_user)}")
# Test with missing field
incomplete_user = {"id": 1}
print(f"Incomplete user errors: {user_validator(incomplete_user)}")
# Output:
# Valid user errors: []
# Invalid user errors: ['Invalid value for id: -1', 'Invalid value for email: not_an_email']
# Incomplete user errors: ['Missing field: email']
7.11 Chapter Summary and Connection to the Next Chapter
In this chapter, we’ve learned how to use Python’s type annotations to improve code quality and catch errors early. We covered:
- Basic Type Annotation Syntax - How to add types to variables, functions, and return values
- Collection Types - How to specify types for lists, dictionaries, and other collections
- Optional and Union Types - How to handle variables that might be None or of different types
- TypedDict and NamedTuple - How to create structured types for dictionaries and tuples
- Type Checking with Pyright - How to verify type correctness in our code
Adding static typing to our Python code provides several benefits for data engineering projects:
- It helps catch type-related errors before runtime
- It makes code more self-documenting and easier to understand
- It enables better IDE support with more accurate autocomplete
- It makes refactoring safer and more predictable
In the next chapter, we’ll build on these concepts as we explore Data Engineering Code Quality in Chapter 8. We’ll learn about tools and practices for maintaining high-quality, reliable, and secure data engineering code, including:
- Code formatting with black
- Linting with ruff
- Security considerations
- Pre-commit hooks
- Documentation standards
- Git best practices for data engineering
The type safety practices you’ve learned in this chapter will integrate perfectly with these code quality tools to create a comprehensive approach to writing robust, maintainable data engineering code.