๐งน Data Cleaning¶
Prerequisites: Modules 01-03, 19_data_processing/01-02
Estimated Time: 1.5-2 hours
Learn techniques for handling messy, incomplete, and inconsistent data.
Why Data Cleaning Matters¶
Real-world data is messy: - Missing values - Duplicate records - Inconsistent formatting - Invalid data types - Typos and errors
Garbage in, garbage out - Clean data is essential for accurate analysis!
Common Data Quality Issues¶
| Issue | Example | Solution |
|---|---|---|
| Missing values | age: None |
Imputation, deletion |
| Duplicates | Same record twice | Deduplication |
| Inconsistent formats | "2024-01-01" vs "01/01/2024" |
Standardization |
| Typos | "Englsh" vs "English" |
Validation, correction |
| Outliers | Age: 150 | Detection, handling |
| Type mismatches | "123" instead of 123 |
Type conversion |
Handling Missing Values¶
Detect Missing Values¶
import sqlite3
conn = sqlite3.connect('data.db')
cursor = conn.cursor()
# Count NULL values per column
cursor.execute('''
SELECT
COUNT(*) - COUNT(age) as missing_age,
COUNT(*) - COUNT(email) as missing_email
FROM users
''')
Strategies for Missing Data¶
# 1. Delete rows with missing values
# Use when: Missing data is rare and random
cursor.execute('DELETE FROM users WHERE email IS NULL')
# 2. Fill with default value
# Use when: There's a sensible default
cursor.execute('UPDATE users SET country = ? WHERE country IS NULL', ('Unknown',))
# 3. Fill with statistical value (mean, median)
# Use when: Data is numeric and missing at random
cursor.execute('SELECT AVG(age) FROM users WHERE age IS NOT NULL')
avg_age = cursor.fetchone()[0]
cursor.execute('UPDATE users SET age = ? WHERE age IS NULL', (avg_age,))
# 4. Forward fill (use previous value)
# Use when: Data is ordered (time series)
Removing Duplicates¶
Find Duplicates¶
# Find duplicate emails
cursor.execute('''
SELECT email, COUNT(*) as count
FROM users
GROUP BY email
HAVING COUNT(*) > 1
''')
Remove Duplicates¶
# Keep only the first occurrence
cursor.execute('''
DELETE FROM users
WHERE id NOT IN (
SELECT MIN(id)
FROM users
GROUP BY email
)
''')
Standardizing Data¶
String Standardization¶
# Trim whitespace
cursor.execute("UPDATE users SET name = TRIM(name)")
# Convert to consistent case
cursor.execute("UPDATE users SET email = LOWER(email)")
cursor.execute("UPDATE products SET category = UPPER(category)")
# Standardize formats
cursor.execute('''
UPDATE users
SET phone = REPLACE(REPLACE(phone, '(', ''), ')', '')
''') # Remove parentheses from phone numbers
Date Standardization¶
from datetime import datetime
# Convert various date formats to ISO format
def standardize_date(date_str):
formats = ['%Y-%m-%d', '%m/%d/%Y', '%d-%m-%Y']
for fmt in formats:
try:
return datetime.strptime(date_str, fmt).strftime('%Y-%m-%d')
except ValueError:
continue
return None # Invalid date
Data Validation¶
Check Constraints¶
# Validate email format (basic check)
cursor.execute('''
SELECT * FROM users
WHERE email NOT LIKE '%@%.%'
''')
invalid_emails = cursor.fetchall()
# Validate age is reasonable
cursor.execute('SELECT * FROM users WHERE age < 0 OR age > 120')
invalid_ages = cursor.fetchall()
# Check for negative prices
cursor.execute('SELECT * FROM products WHERE price < 0')
invalid_prices = cursor.fetchall()
Data Type Conversion¶
# Convert string numbers to integers
def safe_int(value, default=0):
try:
return int(value)
except (ValueError, TypeError):
return default
# Convert and validate
cursor.execute('SELECT id, age FROM users')
for row_id, age_str in cursor.fetchall():
age = safe_int(age_str)
if age < 0 or age > 120:
age = None # Mark as invalid
cursor.execute('UPDATE users SET age = ? WHERE id = ?', (age, row_id))
Outlier Detection¶
Statistical Methods¶
# Using standard deviation
cursor.execute('SELECT AVG(price), AVG(price*price) FROM products')
avg, avg_sq = cursor.fetchone()
std_dev = (avg_sq - avg**2) ** 0.5
# Find outliers (beyond 3 standard deviations)
lower = avg - 3 * std_dev
upper = avg + 3 * std_dev
cursor.execute('''
SELECT * FROM products
WHERE price < ? OR price > ?
''', (lower, upper))
outliers = cursor.fetchall()
IQR Method (Interquartile Range)¶
# More robust to extreme outliers
# Q1 = 25th percentile, Q3 = 75th percentile
# IQR = Q3 - Q1
# Outliers are < Q1 - 1.5*IQR or > Q3 + 1.5*IQR
Data Cleaning Workflow¶
โโโโโโโโโโโโโโโโโโโ
โ 1. Profile โ โ Understand data structure, identify issues
โ Data โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโ
โ 2. Handle โ โ Remove or fill missing values
โ Missing Data โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโ
โ 3. Remove โ โ Delete duplicate records
โ Duplicates โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโ
โ 4. Standardize โ โ Consistent formats, units, casing
โ Formats โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโ
โ 5. Validate โ โ Check ranges, formats, constraints
โ & Fix โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโ
โ 6. Document โ โ Record what was changed
โ Changes โ
โโโโโโโโโโโโโโโโโโโ
Best Practices¶
- Always work on a copy - Never clean original data directly
- Document everything - Track what was changed and why
- Be conservative - Don't remove data unless necessary
- Validate results - Check your cleaning didn't introduce errors
- Automate - Create reusable cleaning functions
Common Mistakes¶
- Deleting too aggressively - Losing valuable data
- Filling missing values blindly - Using inappropriate defaults
- Not checking for duplicates - Inflating counts
- Ignoring outliers - They might be errors or interesting cases
- No documentation - Can't reproduce or explain changes
โ Before You Continue¶
- What are 3 different ways to handle missing data?
- When should you delete duplicates vs. keep them?
- Why is it important to standardize text casing?
- What's the difference between an outlier and an error?
Next Steps¶
- Run
examples.pyto see data cleaning in action - Practice with
exercises.py - Learn data transformation techniques next!