04 - Web Integration and APIs
Complexity: Moderate (M)
4.0 Introduction: Why This Matters for Data Engineering
As a data engineer, your job often involves gathering data from various sources. While we’ve learned how to work with local files in previous chapters, the reality is that much of the world’s data lives on the web behind APIs (Application Programming Interfaces).
Being able to access and integrate data from web sources is a critical skill that enables you to:
- Pull data from cloud services and SaaS platforms
- Connect to public data sources like government databases and weather services
- Integrate with internal microservices within your organization
- Automate data collection that would otherwise require manual downloading
Let’s visualize how web APIs fit into a typical data engineering workflow:
flowchart TD A[External Data Sources] --> B[Web APIs] B --> C[Python API Client] C --> D{Error Handling} D -->|Error| E[Error Logs] D -->|Success| F[Data Transformation] F --> G[Local Storage or Database] G --> H[Analysis and Reporting] classDef external fill:#f9f9f9,stroke:#333,stroke-width:2px classDef processing fill:#d0e0ff,stroke:#336,stroke-width:1px classDef errors fill:#ffdddd,stroke:#633,stroke-width:1px classDef storage fill:#ddffdd,stroke:#363,stroke-width:1px class A external class B,C,F processing class D,E errors class G,H storage
In this chapter, we’ll learn how to:
- Understand the fundamentals of HTTP and RESTful APIs
- Use the requests library to communicate with web APIs
- Process and parse JSON data from API responses
- Handle errors that occur during API communication
- Implement basic security practices when working with APIs
These skills will build directly on the Python fundamentals and error handling techniques from Chapters 1-2, and will prepare you for working with more complex data integration scenarios in later chapters.
4.1 HTTP and REST Fundamentals
Before diving into code, let’s understand the foundational concepts of web communication.
4.1.1 What is HTTP?
HTTP (Hypertext Transfer Protocol) is the protocol that powers the web. It defines how messages are formatted and transmitted between clients (like your Python script) and servers (like an API provider).
The HTTP protocol works through a request-response cycle:
- A client sends a request to a server
- The server processes the request
- The server sends back a response
Each HTTP request includes:
- A method (verb) indicating the desired action
- A URL (endpoint) specifying the resource location
- Headers containing metadata about the request
- Optionally, a body containing data to send to the server
4.1.2 HTTP Methods
HTTP defines several methods (or verbs) that indicate what action you want to perform:
# Common HTTP methods and their purposes
http_methods = {
'GET': 'Retrieve data from a specified resource',
'POST': 'Submit data to be processed to a specified resource',
'PUT': 'Update a specified resource with the provided data',
'DELETE': 'Delete a specified resource',
'PATCH': 'Partially update a resource',
'HEAD': 'Same as GET but retrieves only headers (no body)',
'OPTIONS': 'Returns the HTTP methods that the server supports'
}
# Print each method and its purpose
for method, purpose in http_methods.items():
print(f"{method}: {purpose}")
# Output:
# GET: Retrieve data from a specified resource
# POST: Submit data to be processed to a specified resource
# PUT: Update a specified resource with the provided data
# DELETE: Delete a specified resource
# PATCH: Partially update a resource
# HEAD: Same as GET but retrieves only headers (no body)
# OPTIONS: Returns the HTTP methods that the server supports
As data engineers, we’ll primarily use GET to retrieve data and occasionally POST to send data or authentication information.
4.1.3 HTTP Status Codes
When a server responds to an HTTP request, it includes a status code that indicates the outcome:
# Common HTTP status codes and their meanings
status_codes = {
# 2xx: Success
200: 'OK - The request was successful',
201: 'Created - The request was successful and a resource was created',
204: 'No Content - The request was successful but there is no data to return',
# 3xx: Redirection
301: 'Moved Permanently - The resource has moved to a new URL',
302: 'Found - The resource is temporarily at a different URL',
# 4xx: Client Error
400: 'Bad Request - The server could not understand the request',
401: 'Unauthorized - Authentication is required and has failed',
403: 'Forbidden - The server understood but refuses the request',
404: 'Not Found - The requested resource could not be found',
429: 'Too Many Requests - Rate limit exceeded',
# 5xx: Server Error
500: 'Internal Server Error - The server encountered an unexpected condition',
502: 'Bad Gateway - The server received an invalid response from an upstream server',
503: 'Service Unavailable - The server is temporarily unavailable'
}
# Print each status code and its meaning
print("Important HTTP Status Codes:")
for code in sorted([200, 201, 400, 401, 403, 404, 429, 500, 503]):
print(f"{code}: {status_codes[code]}")
# Output:
# Important HTTP Status Codes:
# 200: OK - The request was successful
# 201: Created - The request was successful and a resource was created
# 400: Bad Request - The server could not understand the request
# 401: Unauthorized - Authentication is required and has failed
# 403: Forbidden - The server understood but refuses the request
# 404: Not Found - The requested resource could not be found
# 429: Too Many Requests - Rate limit exceeded
# 500: Internal Server Error - The server encountered an unexpected condition
# 503: Service Unavailable - The server is temporarily unavailable
Understanding these status codes will help you debug issues with API requests and implement proper error handling.
4.1.4 What is REST?
REST (Representational State Transfer) is an architectural style for designing networked applications. RESTful APIs use HTTP requests to perform CRUD (Create, Read, Update, Delete) operations on resources.
Key principles of RESTful design include:
- Stateless: Each request contains all the information needed to complete it
- Resource-based: APIs are organized around resources (e.g., users, products)
- Standard methods: Uses standard HTTP methods for operations
- Representations: Resources can have multiple representations (JSON, XML, etc.)
# Example of a RESTful API structure
weather_api = {
'base_url': 'https://api.weather.example',
'resources': {
'current': '/v1/current?location={city}',
'forecast': '/v1/forecast?location={city}&days={days}',
'historical': '/v1/historical?location={city}&date={date}'
},
'methods': {
'current': 'GET',
'forecast': 'GET',
'historical': 'GET'
}
}
# Show how you might access these endpoints
city = "New York"
print(f"To get current weather: {weather_api['methods']['current']} {weather_api['base_url']}{weather_api['resources']['current'].format(city=city)}")
print(f"To get 5-day forecast: {weather_api['methods']['forecast']} {weather_api['base_url']}{weather_api['resources']['forecast'].format(city=city, days=5)}")
# Output:
# To get current weather: GET https://api.weather.example/v1/current?location=New York
# To get 5-day forecast: GET https://api.weather.example/v1/forecast?location=New York&days=5
Now that we understand the fundamentals of HTTP and REST, let’s learn how to interact with APIs using Python.
4.2 Working with the requests Library
The requests
library is the standard for making HTTP requests in Python. It provides a simple, elegant interface for sending HTTP requests and handling responses.
4.2.1 Installing the requests Library
First, let’s make sure the requests library is installed:
# This would typically be done at the command line, but for completeness:
# pip install requests
# Let's verify the installation by importing requests
import requests
print(f"Successfully imported requests library (version: {requests.__version__})")
# Output:
# Successfully imported requests library (version: 2.28.1) # Your version may differ
4.2.2 Making Your First GET Request
Let’s start by making a simple GET request to a public API that doesn’t require authentication:
import requests
# Make a GET request to a public API that returns JSON data
response = requests.get('https://jsonplaceholder.typicode.com/todos/1')
# Print the response details
print(f"Response status code: {response.status_code}")
print(f"Response headers: {dict(response.headers)}")
print(f"Response content type: {response.headers.get('Content-Type')}")
print(f"Response body: {response.text}")
# Output:
# Response status code: 200
# Response headers: {'Date': 'Mon, 01 May 2023 12:34:56 GMT', 'Content-Type': 'application/json; charset=utf-8', ...}
# Response content type: application/json; charset=utf-8
# Response body: {
# "userId": 1,
# "id": 1,
# "title": "delectus aut autem",
# "completed": false
# }
The response object contains all the information about the server’s response, including status code, headers, and the response body.
4.2.3 Working with Response Objects
The requests
library provides several convenient properties and methods for working with responses:
import requests
# Make a request to GitHub's API to get information about Python
response = requests.get('https://api.github.com/search/repositories?q=language:python&sort=stars&per_page=5')
# Check if the request was successful
if response.status_code == 200:
print("Request successful!")
# Access the response body in different ways
print(f"Response as text: {response.text[:100]}...") # First 100 chars
# Parse JSON response
data = response.json() # Convert JSON response to Python dictionary
# Access specific data from the response
print(f"Total Python repositories: {data['total_count']}")
print("\nTop 5 Python repositories:")
for i, repo in enumerate(data['items'], 1):
print(f"{i}. {repo['name']} - ⭐ {repo['stargazers_count']} - {repo['html_url']}")
else:
print(f"Request failed with status code: {response.status_code}")
# Output:
# Request successful!
# Response as text: {"total_count":5854786,"incomplete_results":false,"items":[{"id":83222441,"node_id":"MDEwOlJlcG9za...
# Total Python repositories: 5854786
#
# Top 5 Python repositories:
# 1. system-design-primer - ⭐ 209412 - https://github.com/donnemartin/system-design-primer
# 2. public-apis - ⭐ 201283 - https://github.com/public-apis/public-apis
# 3. Python - ⭐ 165280 - https://github.com/TheAlgorithms/Python
# 4. Python-100-Days - ⭐ 127767 - https://github.com/jackfrued/Python-100-Days
# 5. youtube-dl - ⭐ 115801 - https://github.com/ytdl-org/youtube-dl
The .json()
method is especially useful because it automatically parses the JSON response into Python data structures. This builds on our JSON knowledge from Chapter 2.
4.2.4 Passing Query Parameters
Many APIs require query parameters to filter or customize the response. The requests
library makes it easy to include these parameters:
import requests
# Define parameters as a dictionary
params = {
'q': 'data engineering', # Search query
'sort': 'stars', # Sort by stars
'per_page': 3 # Limit to 3 results
}
# Make a request with parameters
response = requests.get('https://api.github.com/search/repositories', params=params)
# Show the actual URL that was requested
print(f"Request URL: {response.url}")
# Parse the response
if response.status_code == 200:
data = response.json()
print(f"Found {data['total_count']} repositories related to 'data engineering'")
print("\nTop 3 results:")
for i, repo in enumerate(data['items'], 1):
print(f"{i}. {repo['name']} by {repo['owner']['login']}")
print(f" Description: {repo['description']}")
print(f" Stars: {repo['stargazers_count']}")
print()
else:
print(f"Request failed with status code: {response.status_code}")
# Output:
# Request URL: https://api.github.com/search/repositories?q=data+engineering&sort=stars&per_page=3
# Found 29145 repositories related to 'data engineering'
#
# Top 3 results:
# 1. data-engineering-zoomcamp by DataTalksClub
# Description: Free Data Engineering course!
# Stars: 12574
#
# 2. analytics-engineering-handbook by getdbt
# Description: A handbook and crash course on Analytics Engineering. Fork and contribute!
# Stars: 3102
#
# 3. the-data-engineering-handbook by dataeducation
# Description: A handbook on Data Engineering covering fundamentals, architecture, modern data stack, data mesh and much more!
# Stars: 2564
Using the params
argument is not only more convenient but also safer than manually constructing URLs, as it properly encodes the parameters.
4.2.5 Working with HTTP Headers
HTTP headers provide additional information about the request or response. Some APIs require specific headers for authentication, content type specification, or other purposes:
import requests
# Define custom headers
headers = {
'User-Agent': 'Python-Requests/Data-Engineering-Tutorial',
'Accept': 'application/json'
}
# Make a request with custom headers
response = requests.get('https://api.github.com/users/python', headers=headers)
# Print the headers we sent
print(f"Request headers: {headers}")
# Print response details
if response.status_code == 200:
user_data = response.json()
print(f"\nGitHub user: {user_data['login']}")
print(f"Name: {user_data['name']}")
print(f"Location: {user_data['location']}")
print(f"Followers: {user_data['followers']}")
else:
print(f"Request failed with status code: {response.status_code}")
# Output:
# Request headers: {'User-Agent': 'Python-Requests/Data-Engineering-Tutorial', 'Accept': 'application/json'}
#
# GitHub user: python
# Name: Python
# Location: Netherlands
# Followers: 13528
4.2.6 Making POST Requests
While GET requests are used to retrieve data, POST requests are commonly used to submit data to an API:
import requests
import json
# URL for a service that echoes back what you POST to it
url = 'https://httpbin.org/post'
# Data to send
data = {
'name': 'Data Engineering Project',
'description': 'Learning to use APIs with Python',
'values': [1, 2, 3, 4, 5]
}
# Make the POST request
# We need to convert our Python dictionary to a JSON string
response = requests.post(
url,
data=json.dumps(data),
headers={'Content-Type': 'application/json'}
)
# Check response
print(f"Response status code: {response.status_code}")
# Parse response
if response.status_code == 200:
response_data = response.json()
print(f"\nServer received our data:")
# httpbin.org returns what we sent in the 'data' field
received_data = json.loads(response_data['data'])
print(f"Name: {received_data['name']}")
print(f"Description: {received_data['description']}")
print(f"Values: {received_data['values']}")
else:
print(f"Request failed with status code: {response.status_code}")
# Output:
# Response status code: 200
#
# Server received our data:
# Name: Data Engineering Project
# Description: Learning to use APIs with Python
# Values: [1, 2, 3, 4, 5]
Instead of using json.dumps()
and setting the Content-Type header manually, we can use the more convenient json
parameter:
import requests
url = 'https://httpbin.org/post'
# Data to send
data = {
'name': 'Data Engineering Project',
'description': 'Learning to use APIs with Python',
'values': [1, 2, 3, 4, 5]
}
# Make the POST request with the json parameter
response = requests.post(url, json=data)
# Check response
if response.status_code == 200:
response_data = response.json()
print(f"Request was successful. Server received:")
print(f"JSON data: {response_data['json']}")
else:
print(f"Request failed with status code: {response.status_code}")
# Output:
# Request was successful. Server received:
# JSON data: {'name': 'Data Engineering Project', 'description': 'Learning to use APIs with Python', 'values': [1, 2, 3, 4, 5]}
Using the json
parameter automatically:
- Serializes the data as JSON
- Sets the Content-Type header to ‘application/json’
- Handles the encoding of the data
4.3 Parsing API Data
In Chapter 2, we learned about working with JSON data. APIs commonly return data in JSON format, which we need to parse and process for our data engineering pipelines.
4.3.1 Working with JSON API Responses
Let’s explore how to work with more complex JSON responses:
import requests
# Make a request to an API that returns complex nested JSON
response = requests.get('https://api.github.com/repos/python/cpython/contributors?per_page=5')
if response.status_code == 200:
contributors = response.json()
print(f"Top {len(contributors)} contributors to CPython:")
# Process the list of contributors
for i, contributor in enumerate(contributors, 1):
# Extract specific fields from each contributor
username = contributor['login']
contributions = contributor['contributions']
profile_url = contributor['html_url']
print(f"{i}. {username} - {contributions} contributions - {profile_url}")
# Calculate total contributions from these top contributors
total_contributions = sum(c['contributions'] for c in contributors)
print(f"\nTotal contributions from top {len(contributors)}: {total_contributions}")
# Find the contributor with the most contributions
most_contributions = max(contributors, key=lambda c: c['contributions'])
print(f"Most prolific contributor: {most_contributions['login']} with {most_contributions['contributions']} contributions")
else:
print(f"Request failed with status code: {response.status_code}")
# Output:
# Top 5 contributors to CPython:
# 1. vstinner - 3987 contributions - https://github.com/vstinner
# 2. erlendaasland - 1926 contributions - https://github.com/erlendaasland
# 3. AlexWaygood - 1593 contributions - https://github.com/AlexWaygood
# 4. pablogsal - 1443 contributions - https://github.com/pablogsal
# 5. iritkatriel - 1347 contributions - https://github.com/iritkatriel
#
# Total contributions from top 5: 10296
# Most prolific contributor: vstinner with 3987 contributions
This example demonstrates how to:
- Parse a JSON array response into a Python list
- Extract fields from nested structures
- Process the data using list comprehension (from Chapter 2)
- Find specific information within the dataset
4.3.2 Handling Pagination
Many APIs limit the amount of data returned in a single response, using pagination to split results across multiple requests. Let’s see how to handle this:
import requests
def get_all_pages(base_url, params=None, max_pages=5):
"""
Fetch all pages of results from a paginated API.
Args:
base_url (str): The API endpoint URL
params (dict): Query parameters
max_pages (int): Maximum number of pages to fetch
Returns:
list: All items from all pages
"""
if params is None:
params = {}
# Ensure we have a per_page parameter
if 'per_page' not in params:
params['per_page'] = 10
page = 1
all_items = []
while page <= max_pages:
# Update params with current page
params['page'] = page
print(f"Fetching page {page}...")
response = requests.get(base_url, params=params)
if response.status_code == 200:
# Parse the current page of results
page_items = response.json()
# If we got an empty list, we've run out of items
if not page_items:
break
print(f" Received {len(page_items)} items")
all_items.extend(page_items)
page += 1
else:
print(f" Error fetching page {page}: Status code {response.status_code}")
break
return all_items
# Use our function to get multiple pages of Python repositories
url = 'https://api.github.com/users/python/repos'
all_repos = get_all_pages(url, params={'per_page': 5}, max_pages=3)
print(f"\nFetched a total of {len(all_repos)} repositories")
print("\nRepository names:")
for repo in all_repos:
print(f"- {repo['name']}")
# Output:
# Fetching page 1...
# Received 5 items
# Fetching page 2...
# Received 5 items
# Fetching page 3...
# Received 5 items
#
# Fetched a total of 15 repositories
#
# Repository names:
# - bedevere
# - core-workflow
# - cpython-emailer-webhook
# - cpython-builder-docker
# - python-docs-theme
# - psf-salt
# - pypi-salt
# - community-starter-kit
# - peps
# - python-docs-tr
# - pythondotorg
# - docsbuild-scripts
# - typing-council
# - schemapi
# - gh-migration
This function handles page-by-page fetching, which is a common pattern when working with APIs that return large result sets.
4.3.3 Working with Different API Response Formats
While JSON is the most common format, some APIs may return other formats like XML. In those cases, you might need additional libraries:
import requests
import xml.etree.ElementTree as ET
# A very simple example with an API that can return XML
url = 'https://httpbin.org/xml'
# Request XML data
response = requests.get(url)
if response.status_code == 200:
# Parse XML content
root = ET.fromstring(response.text)
# Find specific elements (using XPath)
title = root.find('.//title').text
slide_title = root.find('.//slide/title').text
slide_type = root.find('.//slide').attrib.get('type')
print(f"Document title: {title}")
print(f"Slide title: {slide_title}")
print(f"Slide type: {slide_type}")
# Count all items
items = root.findall('.//item')
print(f"\nFound {len(items)} items in the document:")
for i, item in enumerate(items, 1):
print(f"{i}. {item.text}")
else:
print(f"Request failed with status code: {response.status_code}")
# Output:
# Document title: Sample Slide Show
# Slide title: Wake up to WonderWidgets!
# Slide type: all
#
# Found 6 items in the document:
# 1. Why WonderWidgets are great
# 2. Who buys WonderWidgets
# 3. Overview
# 4. Sales Forecast
# 5. Cost of Ownership
# 6. Where to find more information
While XML responses are less common in modern APIs, it’s good to be aware that not all APIs use JSON, and you may need to adapt your parsing strategy accordingly.
4.4 Basic Error Handling for APIs
When working with APIs, many things can go wrong: the server might be down, you might hit rate limits, or your request might be invalid. Proper error handling is essential for building robust data pipelines.
4.4.1 Handling HTTP Error Status Codes
Let’s implement basic error handling for HTTP status codes:
import requests
def get_repository_info(owner, repo):
"""
Get information about a GitHub repository with error handling.
Args:
owner (str): The repository owner
repo (str): The repository name
Returns:
dict: Repository information or error details
"""
url = f"https://api.github.com/repos/{owner}/{repo}"
try:
response = requests.get(url)
# Check for successful response
if response.status_code == 200:
data = response.json()
return {
'success': True,
'data': {
'name': data['name'],
'description': data['description'],
'stars': data['stargazers_count'],
'forks': data['forks_count'],
'open_issues': data['open_issues_count'],
'created_at': data['created_at']
}
}
# Handle common error status codes
elif response.status_code == 404:
return {
'success': False,
'error': f"Repository '{owner}/{repo}' not found"
}
elif response.status_code == 403:
return {
'success': False,
'error': "API rate limit exceeded"
}
else:
return {
'success': False,
'error': f"API request failed with status code: {response.status_code}"
}
except requests.exceptions.RequestException as e:
# Handle network/connection errors
return {
'success': False,
'error': f"Request failed: {str(e)}"
}
# Test with valid and invalid repositories
repos_to_check = [
('python', 'cpython'), # Valid repository
('python', 'nonexistent'), # Non-existent repository
('invaliduser123456', 'repo') # Invalid user
]
for owner, repo in repos_to_check:
print(f"\nChecking {owner}/{repo}:")
result = get_repository_info(owner, repo)
if result['success']:
data = result['data']
print(f"✅ Repository found:")
print(f" Name: {data['name']}")
print(f" Description: {data['description']}")
print(f" Stars: {data['stars']}")
print(f" Created: {data['created_at']}")
else:
print(f"❌ Error: {result['error']}")
# Output:
# Checking python/cpython:
# ✅ Repository found:
# Name: cpython
# Description: The Python programming language
# Stars: 49823
# Created: 2017-02-10T18:51:46Z
#
# Checking python/nonexistent:
# ❌ Error: Repository 'python/nonexistent' not found
#
# Checking invaliduser123456/repo:
# ❌ Error: Repository 'invaliduser123456/repo' not found
This function demonstrates how to:
- Make a request and check the status code
- Handle specific error cases differently
- Use a try/except block to catch network-level errors
- Return structured results that contain success/failure information
4.4.2 Implementing Retry Logic
APIs can sometimes fail temporarily due to network issues or server load. Retry logic helps make your code more resilient:
import requests
import time
def request_with_retries(url, max_retries=3, backoff_factor=1):
"""
Make a request with retry logic for transient failures.
Args:
url (str): The URL to request
max_retries (int): Maximum number of retry attempts
backoff_factor (int): Multiplier for the delay between retries
Returns:
Response object or None if all retries fail
"""
retries = 0
while retries <= max_retries:
try:
print(f"Attempt {retries + 1}/{max_retries + 1}...")
response = requests.get(url, timeout=5)
# If we get a successful response, return it
if response.status_code == 200:
return response
# If we get a 429 (Too Many Requests), retry after waiting
elif response.status_code == 429:
retry_after = int(response.headers.get('Retry-After', 1))
print(f"Rate limited. Waiting {retry_after} seconds...")
time.sleep(retry_after)
retries += 1
continue
# For server errors (5xx), retry with backoff
elif 500 <= response.status_code < 600:
print(f"Server error: {response.status_code}. Retrying...")
retries += 1
if retries <= max_retries:
# Exponential backoff: 1s, 2s, 4s, ...
wait_time = backoff_factor * (2 ** (retries - 1))
print(f"Waiting {wait_time} seconds before next attempt...")
time.sleep(wait_time)
continue
# For other status codes, don't retry
else:
print(f"Request failed with status code: {response.status_code}")
return response
except requests.exceptions.Timeout:
print("Request timed out")
retries += 1
if retries <= max_retries:
wait_time = backoff_factor * (2 ** (retries - 1))
print(f"Waiting {wait_time} seconds before next attempt...")
time.sleep(wait_time)
continue
except requests.exceptions.ConnectionError:
print("Connection error")
retries += 1
if retries <= max_retries:
wait_time = backoff_factor * (2 ** (retries - 1))
print(f"Waiting {wait_time} seconds before next attempt...")
time.sleep(wait_time)
continue
except requests.exceptions.RequestException as e:
print(f"Request failed: {str(e)}")
return None
print("All retry attempts failed")
return None
# Test with a URL that should work
response = request_with_retries('https://api.github.com/zen')
if response:
print(f"\nSuccess! Response: {response.text}")
else:
print("\nFailed to get a response after retries")
# Output:
# Attempt 1/4...
# Success! Response: Favor focus over features.
Retry logic is essential for production environments where network reliability can’t be guaranteed and external services may have occasional outages.
4.4.3 Handling Rate Limits
Many APIs implement rate limiting to prevent abuse. Let’s see how to handle and respect these limits:
import requests
import time
def check_rate_limits():
"""Check GitHub API rate limits and display usage."""
response = requests.get('https://api.github.com/rate_limit')
if response.status_code == 200:
limits = response.json()
core = limits['resources']['core']
search = limits['resources']['search']
# Calculate reset times
core_reset = time.strftime('%H:%M:%S', time.localtime(core['reset']))
search_reset = time.strftime('%H:%M:%S', time.localtime(search['reset']))
print("GitHub API Rate Limits:")
print(f"Core: {core['remaining']}/{core['limit']} requests remaining (resets at {core_reset})")
print(f"Search: {search['remaining']}/{search['limit']} requests remaining (resets at {search_reset})")
return limits
else:
print(f"Failed to check rate limits: {response.status_code}")
return None
# Function that respects rate limits
def make_rate_limited_request(url, headers=None):
"""
Make a request that respects rate limits.
Args:
url (str): The URL to request
headers (dict): Optional headers
Returns:
Response object or None if rate limited
"""
# Check rate limits first
limits = check_rate_limits()
if limits and limits['resources']['core']['remaining'] > 0:
print("\nMaking request...")
response = requests.get(url, headers=headers)
# Check if we got rate limited despite our check
if response.status_code == 403 and 'rate limit exceeded' in response.text.lower():
reset_time = int(response.headers.get('X-RateLimit-Reset', 0))
wait_time = max(reset_time - time.time(), 0)
print(f"Rate limit exceeded. Reset in {wait_time:.0f} seconds.")
return None
return response
else:
print("\nRate limit already exceeded. Try again later.")
return None
# Check current rate limits
check_rate_limits()
# Make a request that respects rate limits
response = make_rate_limited_request('https://api.github.com/users/python')
if response:
print(f"Request successful! Status code: {response.status_code}")
# Check updated rate limits
check_rate_limits()
# Output:
# GitHub API Rate Limits:
# Core: 58/60 requests remaining (resets at 13:45:17)
# Search: 10/10 requests remaining (resets at 13:45:17)
#
# Making request...
# Request successful! Status code: 200
# GitHub API Rate Limits:
# Core: 57/60 requests remaining (resets at 13:45:17)
# Search: 10/10 requests remaining (resets at 13:45:17)
Understanding and respecting rate limits is crucial for sustainable API usage, especially in production data pipelines that might run frequently.
4.5 API Security Considerations
Security is a critical aspect of working with APIs, especially those that require authentication or handle sensitive data.
4.5.1 Authentication Methods
APIs use various authentication methods to verify identity and control access:
import requests
import os
# Common API authentication methods
auth_methods = {
'API Key': 'A simple key string sent in headers, query params, or request body',
'Basic Auth': 'Base64-encoded username and password sent in an Authorization header',
'Bearer Token': 'A token (like JWT) sent in an Authorization header',
'OAuth': 'A protocol for secure authorization, often used for third-party access',
'API Key + Secret': 'A combination of public key and private secret, used for signing requests'
}
print("Common API Authentication Methods:")
for method, description in auth_methods.items():
print(f"{method}: {description}")
# The following examples show how to use each method but don't include actual credentials
# In a real environment, these would be stored securely (not in your code)
# Example: API Key in query parameters
api_key = "YOUR_API_KEY" # This would come from a secure source
weather_url = f"https://api.weatherapi.com/v1/current.json?key={api_key}&q=London"
print(f"\nAPI Key example URL: {weather_url}")
# Example: API Key in header
headers_with_key = {
"X-API-Key": "YOUR_API_KEY" # This would come from a secure source
}
print(f"Headers with API Key: {headers_with_key}")
# Example: Basic Authentication
from requests.auth import HTTPBasicAuth
basic_auth = HTTPBasicAuth('username', 'password') # These would come from a secure source
print(f"Basic Auth object: {basic_auth}")
# Example: Bearer Token
token = "YOUR_ACCESS_TOKEN" # This would come from a secure source
headers_with_token = {
"Authorization": f"Bearer {token}"
}
print(f"Headers with Bearer token: {headers_with_token}")
# Output:
# Common API Authentication Methods:
# API Key: A simple key string sent in headers, query params, or request body
# Basic Auth: Base64-encoded username and password sent in an Authorization header
# Bearer Token: A token (like JWT) sent in an Authorization header
# OAuth: A protocol for secure authorization, often used for third-party access
# API Key + Secret: A combination of public key and private secret, used for signing requests
#
# API Key example URL: https://api.weatherapi.com/v1/current.json?key=YOUR_API_KEY&q=London
# Headers with API Key: {'X-API-Key': 'YOUR_API_KEY'}
# Basic Auth object: <requests.auth.HTTPBasicAuth object at 0x7f8a6b4f2d90>
# Headers with Bearer token: {'Authorization': 'Bearer YOUR_ACCESS_TOKEN'}
4.5.2 Secure Credential Management
Never hardcode credentials in your scripts! Here’s a better approach:
import os
import requests
def get_api_credentials(source='environment'):
"""
Get API credentials from a secure source.
Args:
source (str): Where to get credentials from ('environment', 'file', etc.)
Returns:
dict: Dictionary containing credentials
"""
if source == 'environment':
# Get credentials from environment variables
# In a real scenario, these would be set in your environment
api_key = os.environ.get('API_KEY', 'demo_key_for_educational_purposes')
return {'api_key': api_key}
elif source == 'file':
# Read from a secure file (not in version control)
# In a real scenario, this file would be .gitignore'd and have restricted permissions
try:
with open('.api_credentials.txt', 'r') as f:
api_key = f.read().strip()
return {'api_key': api_key}
except:
print("Couldn't read credentials file")
return {'api_key': 'demo_key_for_educational_purposes'}
else:
# In production, you might use a secrets manager service
# For this example, we'll just use a demo key
return {'api_key': 'demo_key_for_educational_purposes'}
# Get credentials
credentials = get_api_credentials()
print(f"Got credentials: {credentials}")
# Use credentials in a request
def make_secure_request(url, credentials):
"""Make a request using credentials from a secure source."""
params = {'api_key': credentials['api_key']}
return requests.get(url, params=params)
# NASA APOD API - works with demo key (but has strict rate limits)
response = make_secure_request('https://api.nasa.gov/planetary/apod', credentials)
if response.status_code == 200:
data = response.json()
print(f"\nNASA Astronomy Picture of the Day:")
print(f"Title: {data['title']}")
print(f"Date: {data['date']}")
print(f"Explanation: {data['explanation'][:100]}...")
else:
print(f"\nRequest failed: {response.status_code}")
print(response.text)
# Output:
# Got credentials: {'api_key': 'demo_key_for_educational_purposes'}
#
# NASA Astronomy Picture of the Day:
# Title: A Partial Eclipse of the Flower Moon
# Date: 2023-05-01
# Explanation: You couldn't see this flower. The usually full flower moon was partially eclipsed during the early m...
4.5.3 HTTPS and Data Protection
Always use HTTPS URLs for API requests to ensure data is encrypted in transit:
import requests
# Compare HTTP vs HTTPS
http_url = 'http://example.com'
https_url = 'https://example.com'
def check_protocol(url):
"""Check if a URL uses HTTPS and is secure."""
protocol = url.split('://')[0]
is_secure = protocol.lower() == 'https'
print(f"URL: {url}")
print(f"Protocol: {protocol}")
print(f"Secure: {'✅ Yes' if is_secure else '❌ No'}")
if not is_secure:
print("⚠️ Warning: Unencrypted HTTP connection. Data may be intercepted.")
return is_secure
print("Checking protocols:")
check_protocol(http_url)
print("\n")
check_protocol(https_url)
# Demonstrate using sessions with SSL verification
session = requests.Session()
print("\nDefault SSL verification setting:", session.verify)
# Output:
# Checking protocols:
# URL: http://example.com
# Protocol: http
# Secure: ❌ No
# ⚠️ Warning: Unencrypted HTTP connection. Data may be intercepted.
#
#
# URL: https://example.com
# Protocol: https
# Secure: ✅ Yes
#
# Default SSL verification setting: True
The requests
library verifies SSL certificates by default, which helps prevent man-in-the-middle attacks.
4.6 Micro-Project: Weather Data API Integrator
Now let’s put everything together in a complete micro-project that:
- Connects to a public weather API
- Retrieves weather data for multiple cities
- Processes the JSON responses
- Transforms the data into a structured format
- Saves the results to a CSV file
- Implements proper error handling
Project Requirements
Create a Python script that:
- Connects to the OpenWeatherMap API (free tier)
- Retrieves current weather data for 5-10 cities
- Extracts relevant information (temperature, humidity, weather conditions, etc.)
- Formats the data consistently
- Writes the processed data to a CSV file
- Handles API failures, rate limits, and other errors gracefully
Acceptance Criteria
- Successfully authenticates with the chosen API using secure key management
- Correctly handles API response codes and potential errors
- Processes JSON data into a structured format
- Implements at least one meaningful transformation of the raw data
- Creates a well-formatted CSV output file with proper headers
- Includes timeout handling and retry logic for failed requests
- Documentation includes instructions for obtaining and securely using an API key
Common Pitfalls and Solutions
Exposing API Keys in Code
- Problem: Hardcoding API keys in your scripts is a security risk, especially if you share your code.
- Solution: Store API keys in environment variables or a configuration file that isn’t committed to version control.
Hitting API Rate Limits
- Problem: Free tier APIs often have strict rate limits that can cause your requests to fail.
- Solution: Implement delayed requests and proper retry handling with exponential backoff.
API Structure Changes
- Problem: APIs can change their response structure without notice.
- Solution: Add validation checks for expected data structure and fallback options.
No Error Handling
- Problem: Simple scripts often ignore potential API errors.
- Solution: Implement comprehensive error handling for network issues, timeouts, and API errors.
How This Differs from Production-Grade Solutions
In a production environment, this solution would differ in several ways:
Configuration Management:
- Production: Would use a proper secrets manager or environment variables in a controlled environment.
- Micro-project: Uses a simple config file approach for educational purposes.
Retry Mechanisms:
- Production: Would include more sophisticated retry logic, possibly using libraries like tenacity.
- Micro-project: Uses a simple custom retry function.
Logging:
- Production: Would implement structured logging to files or centralized logging systems.
- Micro-project: Uses simple print statements.
Performance:
- Production: Might use async requests to fetch multiple cities in parallel.
- Micro-project: Uses synchronous requests for simplicity.
Monitoring:
- Production: Would include monitoring for API health, performance metrics, and alerting.
- Micro-project: Minimal monitoring via console output.
Data Validation:
- Production: Would have comprehensive schema validation and data quality checks.
- Micro-project: Basic validation of response structure.
Implementation
import requests
import csv
import time
import os
import json
from datetime import datetime
class WeatherDataIntegrator:
"""A class to fetch, process, and store weather data from OpenWeatherMap API."""
def __init__(self, config_file='weather_config.json'):
"""
Initialize the integrator with configuration.
Args:
config_file (str): Path to the configuration file
"""
self.config = self._load_config(config_file)
self.api_key = self._get_api_key()
self.base_url = "https://api.openweathermap.org/data/2.5/weather"
def _load_config(self, config_file):
"""Load configuration from a JSON file."""
# Create a default config if file doesn't exist
if not os.path.exists(config_file):
default_config = {
'api_key_source': 'file', # 'file', 'env', or 'direct'
'api_key_file': '.openweathermap_key.txt',
'api_key_env': 'OPENWEATHERMAP_API_KEY',
'api_key': '', # Only used if api_key_source is 'direct'
'units': 'metric', # 'metric', 'imperial', or 'standard'
'cities': [
'London,UK',
'New York,US',
'Tokyo,JP',
'Sydney,AU',
'Rio de Janeiro,BR'
],
'retry_attempts': 3,
'retry_delay': 2,
'timeout': 10
}
with open(config_file, 'w') as f:
json.dump(default_config, f, indent=4)
print(f"Created default configuration file: {config_file}")
print("Please update it with your API key before running.")
# For educational purposes, we'll create a dummy API key file
with open(default_config['api_key_file'], 'w') as f:
f.write('your_api_key_here')
print(f"Created placeholder API key file: {default_config['api_key_file']}")
print("Please replace the placeholder with your actual API key.")
# Load the configuration
try:
with open(config_file, 'r') as f:
return json.load(f)
except json.JSONDecodeError:
print(f"Error: {config_file} is not valid JSON")
return {}
except Exception as e:
print(f"Error loading configuration: {e}")
return {}
def _get_api_key(self):
"""Get the API key from the configured source."""
source = self.config.get('api_key_source', 'file')
if source == 'file':
key_file = self.config.get('api_key_file', '.openweathermap_key.txt')
try:
with open(key_file, 'r') as f:
return f.read().strip()
except Exception as e:
print(f"Error reading API key file: {e}")
return None
elif source == 'env':
env_var = self.config.get('api_key_env', 'OPENWEATHERMAP_API_KEY')
return os.environ.get(env_var)
elif source == 'direct':
return self.config.get('api_key', '')
return None
def fetch_weather(self, city):
"""
Fetch weather data for a single city with retry logic.
Args:
city (str): City name and optional country code (e.g., 'London,UK')
Returns:
dict: Weather data or None if failed
"""
if not self.api_key:
print("Error: API key not configured")
return None
params = {
'q': city,
'appid': self.api_key,
'units': self.config.get('units', 'metric')
}
retry_attempts = self.config.get('retry_attempts', 3)
retry_delay = self.config.get('retry_delay', 2)
timeout = self.config.get('timeout', 10)
for attempt in range(retry_attempts):
try:
print(f"Fetching weather for {city} (Attempt {attempt + 1}/{retry_attempts})...")
response = requests.get(
self.base_url,
params=params,
timeout=timeout
)
if response.status_code == 200:
print(f"✅ Successfully fetched data for {city}")
return response.json()
elif response.status_code == 404:
print(f"❌ City not found: {city}")
return None
elif response.status_code == 401:
print(f"❌ Invalid API key")
return None
elif response.status_code == 429:
print(f"⚠️ Rate limited. Waiting before retry...")
time.sleep(retry_delay * (attempt + 1)) # Exponential backoff
else:
print(f"⚠️ API returned status code {response.status_code}. Retrying...")
time.sleep(retry_delay)
except requests.exceptions.Timeout:
print(f"⚠️ Request timed out. Retrying...")
time.sleep(retry_delay)
except requests.exceptions.ConnectionError:
print(f"⚠️ Connection error. Retrying...")
time.sleep(retry_delay)
except Exception as e:
print(f"❌ Unexpected error: {e}")
return None
print(f"❌ Failed to fetch weather for {city} after {retry_attempts} attempts")
return None
def process_weather_data(self, weather_data):
"""
Extract and transform relevant information from weather data.
Args:
weather_data (dict): Raw weather data from API
Returns:
dict: Processed weather data with consistent format
"""
if not weather_data:
return None
try:
# Basic validation of expected structure
required_fields = ['main', 'weather', 'wind', 'name']
for field in required_fields:
if field not in weather_data:
print(f"❌ Missing required field: {field}")
return None
# Extract and transform data
weather = weather_data['weather'][0] # First weather condition
main = weather_data['main']
wind = weather_data['wind']
# Calculate additional metrics
# - Convert temperature to a "feels like" description
# - Add a wind chill or heat index based on conditions
feels_like = main.get('feels_like')
temp = main.get('temp')
if feels_like is not None and temp is not None:
temp_difference = feels_like - temp
if temp_difference < -2:
feels_like_description = "colder than it looks"
elif temp_difference > 2:
feels_like_description = "warmer than it looks"
else:
feels_like_description = "similar to actual temperature"
else:
feels_like_description = "unknown"
# Process wind data for cardinal direction
if 'deg' in wind:
directions = ["N", "NE", "E", "SE", "S", "SW", "W", "NW", "N"]
index = round(wind['deg'] / 45)
wind_direction = directions[index]
else:
wind_direction = "unknown"
# Create processed data structure
processed_data = {
'city': weather_data['name'],
'country': weather_data.get('sys', {}).get('country', 'Unknown'),
'latitude': weather_data.get('coord', {}).get('lat'),
'longitude': weather_data.get('coord', {}).get('lon'),
'description': weather['description'],
'temperature_c': main.get('temp'),
'feels_like_c': main.get('feels_like'),
'feels_like_description': feels_like_description,
'humidity_percent': main.get('humidity'),
'pressure_hPa': main.get('pressure'),
'wind_speed_mps': wind.get('speed'),
'wind_direction': wind_direction,
'cloudiness_percent': weather_data.get('clouds', {}).get('all'),
'timestamp': datetime.fromtimestamp(weather_data['dt']).strftime('%Y-%m-%d %H:%M:%S')
}
return processed_data
except KeyError as e:
print(f"❌ KeyError while processing data: {e}")
return None
except Exception as e:
print(f"❌ Error processing weather data: {e}")
return None
def fetch_all_cities(self):
"""
Fetch weather data for all configured cities.
Returns:
list: Processed weather data for each city
"""
cities = self.config.get('cities', [])
results = []
for city in cities:
raw_data = self.fetch_weather(city)
if raw_data:
processed_data = self.process_weather_data(raw_data)
if processed_data:
results.append(processed_data)
return results
def save_to_csv(self, data, filename='weather_data.csv'):
"""
Save processed weather data to a CSV file.
Args:
data (list): List of weather data dictionaries
filename (str): Output filename
Returns:
bool: True if successful, False otherwise
"""
if not data:
print("No data to save")
return False
try:
# Get field names from the first data item
fieldnames = data[0].keys()
with open(filename, 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(data)
print(f"✅ Successfully saved data to {filename}")
return True
except Exception as e:
print(f"❌ Error saving to CSV: {e}")
return False
def run(self, output_file='weather_data.csv'):
"""
Run the complete process: fetch, process, and save weather data.
Args:
output_file (str): Path to the output CSV file
Returns:
bool: True if successful, False otherwise
"""
print(f"🌤️ Weather Data Integrator")
print(f"==========================")
print(f"Starting at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Units: {self.config.get('units', 'metric')}")
print(f"Cities: {', '.join(self.config.get('cities', []))}")
print(f"Output file: {output_file}")
print(f"==========================")
# Fetch and process data
weather_data = self.fetch_all_cities()
if not weather_data:
print("❌ No weather data was collected")
return False
# Save to CSV
success = self.save_to_csv(weather_data, output_file)
if success:
print(f"✅ Successfully collected weather data for {len(weather_data)} cities")
print(f"Data saved to: {output_file}")
# Print a sample of the data
if weather_data:
print("\nSample data from first city:")
sample = weather_data[0]
print(f"City: {sample['city']}, {sample['country']}")
print(f"Weather: {sample['description']}")
print(f"Temperature: {sample['temperature_c']}°C (Feels like: {sample['feels_like_c']}°C - {sample['feels_like_description']})")
print(f"Wind: {sample['wind_speed_mps']} m/s from {sample['wind_direction']}")
print(f"Humidity: {sample['humidity_percent']}%")
return True
else:
print("❌ Failed to save weather data")
return False
# Run the integrator if this script is executed directly
if __name__ == "__main__":
integrator = WeatherDataIntegrator()
integrator.run()
How to Run and Test the Solution
Save the above code to a file (e.g.,
weather_integrator.py
)Get an API key:
- Go to OpenWeatherMap and create a free account
- Navigate to the API Keys section of your account
- Copy your API key
Configure the API key:
- The first time you run the script, it will create a default configuration file (
weather_config.json
) and a placeholder API key file (.openweathermap_key.txt
) - Edit the
.openweathermap_key.txt
file and replace the placeholder text with your actual API key
- The first time you run the script, it will create a default configuration file (
Run the script:
python weather_integrator.py
Examine the output:
- The script will print detailed information about each step of the process
- After completion, check the generated CSV file (
weather_data.csv
)
To test different scenarios:
- Edit
weather_config.json
to change cities, units, or other settings - Try with an invalid API key to test error handling
- Try with a non-existent city to test error handling
- Add a city with special characters to test encoding handling
- Edit
4.7 Practice Exercises
To reinforce your understanding of APIs and web integration, try these exercises:
Exercise 1: Simple API Request
Write a script that:
- Makes a GET request to https://jsonplaceholder.typicode.com/posts
- Prints the total number of posts
- Finds the post with the most words in its body
- Prints the title and body of that post
Exercise 2: API with Query Parameters
Write a script that:
- Takes a search term as input (e.g., “python”)
- Makes a request to GitHub’s search API (https://api.github.com/search/repositories) with the search term
- Sorts the results by number of stars
- Prints the top 5 repositories with their names, descriptions, and star counts
Exercise 3: Error Handling
Modify your script from Exercise 2 to:
- Handle rate limiting by checking the X-RateLimit-Remaining header
- Implement retry logic if a request fails (up to 3 attempts)
- Print a helpful error message if the API returns an error status code
Exercise 4: Working with Multiple APIs
Create a script that:
- Fetches cryptocurrency price data from a public API (e.g., CoinGecko)
- Fetches exchange rate data from another API (e.g., ExchangeRate-API)
- Calculates the value of a specified cryptocurrency in different currencies
- Prints the results in a formatted table
Exercise 5: Data Transformation and Export
Write a script that:
- Fetches user data from https://jsonplaceholder.typicode.com/users
- Fetches all posts by each user
- Creates a report showing how many posts each user has made
- Exports the report to both CSV and JSON formats
4.8 Exercise Solutions
Here are the solutions to the practice exercises:
Solution to Exercise 1: Simple API Request
import requests
def analyze_posts():
"""
Fetch posts from JSONPlaceholder API and find the post with the longest body.
"""
print("Fetching posts from JSONPlaceholder API...")
response = requests.get('https://jsonplaceholder.typicode.com/posts')
if response.status_code != 200:
print(f"Error: API returned status code {response.status_code}")
return
# Parse the JSON response
posts = response.json()
# Print the total number of posts
print(f"Total number of posts: {len(posts)}")
# Find the post with the most words in its body
post_with_most_words = max(posts, key=lambda post: len(post['body'].split()))
# Count the words in the longest post
word_count = len(post_with_most_words['body'].split())
print(f"\nPost with the most words (ID: {post_with_most_words['id']}, Word count: {word_count}):")
print(f"Title: {post_with_most_words['title']}")
print(f"Body: {post_with_most_words['body']}")
# Run the function
analyze_posts()
# Output:
# Fetching posts from JSONPlaceholder API...
# Total number of posts: 100
#
# Post with the most words (ID: 100, Word count: 94):
# Title: at nam consequatur ea labore ea harum
# Body: cupiditate quo est a modi nesciunt soluta
# ipsa voluptas error itaque dicta in
# autem qui minus magnam et distinctio eum
# accusamus ratione error aut
Solution to Exercise 2: API with Query Parameters
import requests
def search_github_repositories(search_term):
"""
Search for repositories on GitHub and display top results.
Args:
search_term (str): Term to search for
"""
print(f"Searching GitHub for repositories matching '{search_term}'...")
# Define parameters
params = {
'q': search_term,
'sort': 'stars',
'order': 'desc',
'per_page': 5
}
# Make the request
response = requests.get('https://api.github.com/search/repositories', params=params)
# Check for successful response
if response.status_code != 200:
print(f"Error: API returned status code {response.status_code}")
print(response.text)
return
# Parse the response
data = response.json()
# Print the results
print(f"Found {data['total_count']} repositories matching '{search_term}'")
print("\nTop 5 repositories by star count:")
for i, repo in enumerate(data['items'], 1):
print(f"{i}. {repo['name']} by {repo['owner']['login']}")
print(f" Description: {repo['description'] or 'No description'}")
print(f" Stars: {repo['stargazers_count']:,}")
print(f" URL: {repo['html_url']}")
print()
# Get search term from user
search_term = input("Enter a search term for GitHub repositories: ")
search_github_repositories(search_term)
# Sample Output (for search term "data engineering"):
# Searching GitHub for repositories matching 'data engineering'...
# Found 29145 repositories matching 'data engineering'
#
# Top 5 repositories by star count:
# 1. data-engineering-zoomcamp by DataTalksClub
# Description: Free Data Engineering course!
# Stars: 12,574
# URL: https://github.com/DataTalksClub/data-engineering-zoomcamp
#
# 2. analytics-engineering-handbook by getdbt
# Description: A handbook and crash course on Analytics Engineering. Fork and contribute!
# Stars: 3,102
# URL: https://github.com/getdbt/analytics-engineering-handbook
#
# ...and so on for the remaining repositories
Solution to Exercise 3: Error Handling
import requests
import time
def search_github_repositories_with_retry(search_term):
"""
Search for repositories on GitHub with retry logic and rate limit handling.
Args:
search_term (str): Term to search for
"""
print(f"Searching GitHub for repositories matching '{search_term}'...")
# Define parameters
params = {
'q': search_term,
'sort': 'stars',
'order': 'desc',
'per_page': 5
}
# Retry configuration
max_retries = 3
retry_delay = 2 # seconds
# Make the request with retry logic
for attempt in range(max_retries):
try:
print(f"Attempt {attempt + 1}/{max_retries}...")
# Make the request
response = requests.get(
'https://api.github.com/search/repositories',
params=params,
timeout=10
)
# Check rate limit
remaining = int(response.headers.get('X-RateLimit-Remaining', 0))
print(f"Rate limit remaining: {remaining}")
if remaining == 0:
reset_time = int(response.headers.get('X-RateLimit-Reset', 0))
wait_time = max(reset_time - time.time(), 0)
print(f"Rate limit exceeded. Resets in {wait_time:.0f} seconds.")
if attempt < max_retries - 1 and wait_time < 60: # Only wait if it's a short time
print(f"Waiting for rate limit reset...")
time.sleep(wait_time + 1) # Add 1 second buffer
continue
else:
print("Rate limit wait time too long or out of retries.")
return
# Check for successful response
if response.status_code == 200:
# Parse the response
data = response.json()
# Print the results
print(f"Found {data['total_count']} repositories matching '{search_term}'")
print("\nTop 5 repositories by star count:")
for i, repo in enumerate(data['items'], 1):
print(f"{i}. {repo['name']} by {repo['owner']['login']}")
print(f" Description: {repo['description'] or 'No description'}")
print(f" Stars: {repo['stargazers_count']:,}")
print(f" URL: {repo['html_url']}")
print()
return
elif response.status_code == 403 and 'rate limit exceeded' in response.text.lower():
print("Rate limit exceeded. Trying again later...")
if attempt < max_retries - 1:
time.sleep(retry_delay * (2 ** attempt)) # Exponential backoff
continue
else:
print("Out of retries. Please try again later.")
return
elif response.status_code == 404:
print(f"Error: Resource not found (404)")
return
elif response.status_code >= 500:
print(f"Server error: {response.status_code}. Retrying...")
if attempt < max_retries - 1:
time.sleep(retry_delay * (2 ** attempt)) # Exponential backoff
continue
else:
print("Out of retries. Please try again later.")
return
else:
print(f"Error: API returned status code {response.status_code}")
print(f"Response: {response.text}")
return
except requests.exceptions.Timeout:
print("Request timed out. Retrying...")
if attempt < max_retries - 1:
time.sleep(retry_delay)
continue
else:
print("Out of retries after timeouts.")
return
except requests.exceptions.ConnectionError:
print("Connection error. Retrying...")
if attempt < max_retries - 1:
time.sleep(retry_delay)
continue
else:
print("Out of retries after connection errors.")
return
except Exception as e:
print(f"Unexpected error: {str(e)}")
return
# Get search term from user
search_term = input("Enter a search term for GitHub repositories: ")
search_github_repositories_with_retry(search_term)
# Sample Output (for search term "python"):
# Searching GitHub for repositories matching 'python'...
# Attempt 1/3...
# Rate limit remaining: 9
# Found 5854786 repositories matching 'python'
#
# Top 5 repositories by star count:
# 1. system-design-primer by donnemartin
# Description: Learn how to design large-scale systems. Prep for the system design interview. Includes Anki flashcards.
# Stars: 209,412
# URL: https://github.com/donnemartin/system-design-primer
#
# ... and so on for the remaining repositories
Solution to Exercise 4: Working with Multiple APIs
import requests
from tabulate import tabulate # You might need to pip install tabulate
def get_crypto_exchange_rates(crypto_id='bitcoin', currencies=None):
"""
Get cryptocurrency prices and convert to multiple currencies.
Args:
crypto_id (str): ID of the cryptocurrency to fetch
currencies (list): List of currency codes to convert to
Returns:
dict: Exchange rates for the cryptocurrency
"""
if currencies is None:
currencies = ['USD', 'EUR', 'GBP', 'JPY', 'CAD']
# Step 1: Fetch cryptocurrency price in USD from CoinGecko API
print(f"Fetching {crypto_id} price data from CoinGecko...")
try:
crypto_url = f"https://api.coingecko.com/api/v3/simple/price"
crypto_params = {
'ids': crypto_id,
'vs_currencies': 'usd'
}
crypto_response = requests.get(crypto_url, params=crypto_params)
if crypto_response.status_code != 200:
print(f"Error fetching crypto data: {crypto_response.status_code}")
return None
crypto_data = crypto_response.json()
if crypto_id not in crypto_data:
print(f"Cryptocurrency '{crypto_id}' not found")
return None
crypto_usd_price = crypto_data[crypto_id]['usd']
print(f"Current {crypto_id} price: ${crypto_usd_price:,.2f} USD")
# Step 2: Fetch exchange rates from ExchangeRate API
print("Fetching exchange rates from ExchangeRate API...")
# Using a free exchange rate API that doesn't require keys for this example
# In production, you'd want to use a more reliable API with authentication
exchange_url = "https://open.er-api.com/v6/latest/USD"
exchange_response = requests.get(exchange_url)
if exchange_response.status_code != 200:
print(f"Error fetching exchange rate data: {exchange_response.status_code}")
return None
exchange_data = exchange_response.json()
if 'rates' not in exchange_data:
print("Exchange rate data is not in expected format")
return None
rates = exchange_data['rates']
# Step 3: Calculate cryptocurrency value in each currency
results = {}
for currency in currencies:
if currency in rates:
crypto_value = crypto_usd_price * rates[currency]
results[currency] = crypto_value
else:
print(f"Exchange rate for {currency} not found")
return {
'crypto_id': crypto_id,
'usd_price': crypto_usd_price,
'last_updated': exchange_data.get('time_last_update_utc', 'Unknown'),
'values': results
}
except requests.exceptions.RequestException as e:
print(f"Request error: {e}")
return None
except Exception as e:
print(f"Unexpected error: {e}")
return None
def print_crypto_values(crypto_data):
"""Print cryptocurrency values in a formatted table."""
if not crypto_data:
print("No data to display")
return
print(f"\n{crypto_data['crypto_id'].upper()} Exchange Rates")
print(f"Last updated: {crypto_data['last_updated']}")
# Prepare data for table
table_data = []
for currency, value in crypto_data['values'].items():
# Format the value based on currency
if currency == 'JPY':
formatted_value = f"¥{value:,.0f}"
elif currency == 'EUR':
formatted_value = f"€{value:,.2f}"
elif currency == 'GBP':
formatted_value = f"£{value:,.2f}"
elif currency == 'CAD':
formatted_value = f"C${value:,.2f}"
else:
formatted_value = f"${value:,.2f}"
table_data.append([currency, formatted_value])
# Print the table
print(tabulate(table_data, headers=["Currency", "Value"], tablefmt="grid"))
# Get user input for cryptocurrency
crypto_id = input("Enter a cryptocurrency ID (e.g., bitcoin, ethereum): ").lower() or "bitcoin"
# Get exchange rates and print results
crypto_data = get_crypto_exchange_rates(crypto_id)
if crypto_data:
print_crypto_values(crypto_data)
# Sample Output (for bitcoin):
# Fetching bitcoin price data from CoinGecko...
# Current bitcoin price: $63,209.00 USD
# Fetching exchange rates from ExchangeRate API...
#
# BITCOIN Exchange Rates
# Last updated: Fri, 19 Apr 2024 00:00:02 +0000
# +----------+-------------+
# | Currency | Value |
# +==========+=============+
# | USD | $63,209.00 |
# +----------+-------------+
# | EUR | €59,107.22 |
# +----------+-------------+
# | GBP | £50,567.20 |
# +----------+-------------+
# | JPY | ¥9,665,978 |
# +----------+-------------+
# | CAD | C$86,195.08 |
# +----------+-------------+
Solution to Exercise 5: Data Transformation and Export
import requests
import json
import csv
from collections import defaultdict
def fetch_users_and_posts():
"""
Fetch users and their posts from JSONPlaceholder,
and create a report on post counts per user.
"""
print("Fetching users from JSONPlaceholder API...")
try:
# Step 1: Fetch all users
users_response = requests.get('https://jsonplaceholder.typicode.com/users')
if users_response.status_code != 200:
print(f"Error fetching users: {users_response.status_code}")
return None
users = users_response.json()
print(f"Found {len(users)} users")
# Step 2: Fetch all posts
posts_response = requests.get('https://jsonplaceholder.typicode.com/posts')
if posts_response.status_code != 200:
print(f"Error fetching posts: {posts_response.status_code}")
return None
posts = posts_response.json()
print(f"Found {len(posts)} posts")
# Step 3: Count posts by user
user_posts = defaultdict(list)
for post in posts:
user_id = post['userId']
user_posts[user_id].append(post)
# Step 4: Create the report
report = []
for user in users:
user_id = user['id']
posts_by_user = user_posts.get(user_id, [])
# Find the longest post by this user
longest_post = None
if posts_by_user:
longest_post = max(posts_by_user, key=lambda p: len(p['body']))
report.append({
'user_id': user_id,
'name': user['name'],
'username': user['username'],
'email': user['email'],
'company': user['company']['name'],
'post_count': len(posts_by_user),
'longest_post_title': longest_post['title'] if longest_post else None,
'longest_post_word_count': len(longest_post['body'].split()) if longest_post else 0
})
return report
except requests.exceptions.RequestException as e:
print(f"Request error: {e}")
return None
except Exception as e:
print(f"Unexpected error: {e}")
return None
def export_to_csv(data, filename):
"""Export data to a CSV file."""
if not data:
print("No data to export")
return False
try:
with open(filename, 'w', newline='') as csvfile:
fieldnames = data[0].keys()
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(data)
print(f"Data exported to CSV: {filename}")
return True
except Exception as e:
print(f"Error exporting to CSV: {e}")
return False
def export_to_json(data, filename):
"""Export data to a JSON file."""
if not data:
print("No data to export")
return False
try:
with open(filename, 'w') as jsonfile:
json.dump(data, jsonfile, indent=2)
print(f"Data exported to JSON: {filename}")
return True
except Exception as e:
print(f"Error exporting to JSON: {e}")
return False
def print_report(data):
"""Print the report to the console."""
if not data:
print("No data to display")
return
print("\nUser Post Report:")
print("-" * 80)
print(f"{'User ID':<8} {'Name':<20} {'Username':<15} {'Post Count':<10} {'Longest Post Words':<20}")
print("-" * 80)
for user in data:
print(f"{user['user_id']:<8} {user['name'][:20]:<20} {user['username'][:15]:<15} {user['post_count']:<10} {user['longest_post_word_count']:<20}")
# Run the complete process
report_data = fetch_users_and_posts()
if report_data:
# Print the report
print_report(report_data)
# Export to CSV and JSON
export_to_csv(report_data, 'user_post_report.csv')
export_to_json(report_data, 'user_post_report.json')
print("\nExport complete!")
# Sample Output:
# Fetching users from JSONPlaceholder API...
# Found 10 users
# Found 100 posts
#
# User Post Report:
# --------------------------------------------------------------------------------
# User ID Name Username Post Count Longest Post Words
# --------------------------------------------------------------------------------
# 1 Leanne Graham Bret 10 101
# 2 Ervin Howell Antonette 10 94
# 3 Clementine Bauch Samantha 10 85
# 4 Patricia Lebsack Karianne 10 76
# 5 Chelsey Dietrich Kamren 10 82
# 6 Mrs. Dennis Schulist Leopoldo_Corkery 10 84
# 7 Kurtis Weissnat Elwyn.Skiles 10 89
# 8 Nicholas Runolfsdottir V Maxime_Nienow 10 89
# 9 Glenna Reichert Delphine 10 94
# 10 Clementina DuBuque Moriah.Stanton 10 88
#
# Data exported to CSV: user_post_report.csv
# Data exported to JSON: user_post_report.json
#
# Export complete!
4.9 Chapter Summary and Connection to Chapter 5
In this chapter, we’ve explored how to integrate external data sources into our Python applications through web APIs. We’ve learned:
- HTTP and REST Fundamentals: Understanding the request-response cycle, HTTP methods, and status codes that form the foundation of web communication
- Working with the requests Library: Making GET and POST requests, passing parameters, and handling responses
- Parsing API Data: Extracting and transforming data from JSON responses
- Error Handling for APIs: Implementing retry logic, handling rate limits, and managing common API errors
- API Security: Authenticating with APIs securely and protecting sensitive credentials
These skills enable you to access a vast array of external data sources, which is essential for modern data engineering workflows.
Connection to Chapter 5: Object-Oriented Programming for Data Engineering
In the next chapter, we’ll explore Object-Oriented Programming (OOP) and how it can help us structure our data engineering code more effectively. The concepts you’ve learned in this chapter will serve as building blocks:
- We’ll create classes that encapsulate API interactions, making our code more maintainable and reusable
- We’ll implement inheritance to create specialized API client classes for different data sources
- We’ll use encapsulation to hide implementation details and protect sensitive data like API keys
- We’ll design more sophisticated retry and error handling mechanisms using OOP principles
The weather data integrator from our micro-project already demonstrated some OOP concepts through the use of a class. In Chapter 5, we’ll expand on these ideas and create a more robust framework for data fetching that can be extended to work with multiple data sources.
By combining the API integration skills from this chapter with the OOP principles in the next, you’ll be able to build scalable, maintainable data pipelines that can reliably pull data from diverse sources.