🔄 Data Transformation¶
Prerequisites: Modules 01-03, 19_data_processing/01-03
Estimated Time: 1.5-2 hours
Learn to transform raw data into useful formats for analysis and reporting.
What is Data Transformation?¶
Data transformation is the process of converting data from one format or structure to another. It includes:
- Filtering - Selecting relevant rows
- Sorting - Ordering data meaningfully
- Aggregating - Summarizing data
- Pivoting - Rotating data (rows to columns)
- Calculated fields - Creating new columns from existing ones
Think of it like this: Raw data is like crude oil; transformation refines it into useful fuel.
Filtering and Selection¶
Selecting Specific Columns¶
# Only get what you need
cursor.execute('''
SELECT customer_id, order_date, total_amount
FROM orders
''')
Row Filtering¶
# Multiple conditions
cursor.execute('''
SELECT * FROM sales
WHERE date >= '2024-01-01'
AND amount > 100
AND region IN ('North', 'South')
''')
Sorting and Ranking¶
Basic Sorting¶
# Single column
cursor.execute('SELECT * FROM products ORDER BY price DESC')
# Multiple columns
cursor.execute('''
SELECT * FROM employees
ORDER BY department ASC, salary DESC
''')
Ranking with Window Functions (SQLite 3.25+)¶
# Rank employees by salary within each department
cursor.execute('''
SELECT
name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank
FROM employees
''')
Aggregation and Grouping¶
Common Aggregations¶
# Summarize by category
cursor.execute('''
SELECT
category,
COUNT(*) as product_count,
AVG(price) as avg_price,
MIN(price) as min_price,
MAX(price) as max_price,
SUM(units_sold) as total_sold
FROM products
GROUP BY category
''')
Grouping Sets (Multiple Grouping Levels)¶
# Total and by category in one query
# (Simulated with UNION in basic SQLite)
cursor.execute('''
SELECT category, COUNT(*) as count FROM products GROUP BY category
UNION ALL
SELECT 'TOTAL' as category, COUNT(*) FROM products
''')
Pivoting Data¶
Pivoting transforms rows into columns (like a spreadsheet pivot table).
Example: Sales by Month¶
# Before pivot:
# month | sales
# ---------|-------
# January | 1000
# February | 1500
# March | 1200
# After pivot:
# jan_sales | feb_sales | mar_sales
# ----------|-----------|----------
# 1000 | 1500 | 1200
# Using CASE statements for pivot
cursor.execute('''
SELECT
product_id,
SUM(CASE WHEN month = 'Jan' THEN sales ELSE 0 END) as jan_sales,
SUM(CASE WHEN month = 'Feb' THEN sales ELSE 0 END) as feb_sales,
SUM(CASE WHEN month = 'Mar' THEN sales ELSE 0 END) as mar_sales
FROM monthly_sales
GROUP BY product_id
''')
Creating Calculated Fields¶
Simple Calculations¶
cursor.execute('''
SELECT
product_name,
price,
quantity,
price * quantity as total_value,
price * 0.9 as discounted_price
FROM inventory
''')
Conditional Logic¶
cursor.execute('''
SELECT
customer_name,
order_total,
CASE
WHEN order_total >= 1000 THEN 'VIP'
WHEN order_total >= 500 THEN 'Premium'
ELSE 'Standard'
END as customer_tier
FROM orders
''')
Date Calculations¶
cursor.execute('''
SELECT
order_date,
julianday('now') - julianday(order_date) as days_ago,
date(order_date, '+30 days') as due_date
FROM orders
''')
String Transformations¶
Concatenation¶
cursor.execute('''
SELECT
first_name || ' ' || last_name as full_name,
city || ', ' || country as location
FROM customers
''')
String Functions¶
cursor.execute('''
SELECT
UPPER(name) as uppercase_name,
LOWER(email) as lowercase_email,
SUBSTR(phone, 1, 3) as area_code,
LENGTH(description) as desc_length,
REPLACE(status, 'active', 'live') as new_status
FROM users
''')
Data Normalization and Denormalization¶
Normalization (Reduce Redundancy)¶
Split one table into many:
# Before: orders(customer, address, product, price, ...)
# After: customers(id, name, address)
# products(id, name, price)
# orders(id, customer_id, product_id, quantity)
Denormalization (Improve Read Performance)¶
Join tables for reporting:
cursor.execute('''
SELECT
c.name as customer,
p.name as product,
o.quantity,
p.price * o.quantity as total
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
''')
Common Transformation Patterns¶
| Pattern | SQL Technique |
|---|---|
| Filter rows | WHERE clause |
| Select columns | Column list in SELECT |
| Sort data | ORDER BY |
| Summarize | GROUP BY with aggregates |
| Add calculated fields | Expressions in SELECT |
| Conditional values | CASE statements |
| Pivot data | CASE with aggregates |
| Join tables | JOIN clauses |
Best Practices¶
- Transform incrementally - Apply one transformation at a time
- Validate results - Check row counts and sample data
- Use views - Save complex transformations as database views
- Document logic - Comment why transformations are needed
- Keep originals - Don't overwrite source data
Common Mistakes¶
- Forgetting GROUP BY - All non-aggregated columns must be in GROUP BY
- Null handling -
NULL + anything = NULL; useCOALESCE - Type mismatches - Can't concatenate numbers and strings directly
- Losing data - Inner joins exclude non-matching rows unintentionally
- Over-pivoting - Creating too many columns makes data unwieldy
✅ Before You Continue¶
- When would you pivot data vs. keep it in row format?
- What's the difference between WHERE and HAVING in aggregation queries?
- How do you handle NULL values in calculations?
- When should you normalize vs. denormalize data?
Next Steps¶
- Run
examples.pyto see transformations in action - Practice with
exercises.py - Learn report generation next!