🔄 Data Transformation¶

Prerequisites: Modules 01-03, 19_data_processing/01-03
Estimated Time: 1.5-2 hours

Learn to transform raw data into useful formats for analysis and reporting.

What is Data Transformation?¶

Data transformation is the process of converting data from one format or structure to another. It includes:

Filtering - Selecting relevant rows
Sorting - Ordering data meaningfully
Aggregating - Summarizing data
Pivoting - Rotating data (rows to columns)
Calculated fields - Creating new columns from existing ones

Think of it like this: Raw data is like crude oil; transformation refines it into useful fuel.

Filtering and Selection¶

Selecting Specific Columns¶

# Only get what you need
cursor.execute('''
    SELECT customer_id, order_date, total_amount 
    FROM orders
''')

Row Filtering¶

# Multiple conditions
cursor.execute('''
    SELECT * FROM sales 
    WHERE date >= '2024-01-01' 
      AND amount > 100
      AND region IN ('North', 'South')
''')

Sorting and Ranking¶

Basic Sorting¶

# Single column
cursor.execute('SELECT * FROM products ORDER BY price DESC')

# Multiple columns
cursor.execute('''
    SELECT * FROM employees 
    ORDER BY department ASC, salary DESC
''')

Ranking with Window Functions (SQLite 3.25+)¶

# Rank employees by salary within each department
cursor.execute('''
    SELECT 
        name,
        department,
        salary,
        RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank
    FROM employees
''')

Aggregation and Grouping¶

Common Aggregations¶

# Summarize by category
cursor.execute('''
    SELECT 
        category,
        COUNT(*) as product_count,
        AVG(price) as avg_price,
        MIN(price) as min_price,
        MAX(price) as max_price,
        SUM(units_sold) as total_sold
    FROM products
    GROUP BY category
''')

Grouping Sets (Multiple Grouping Levels)¶

# Total and by category in one query
# (Simulated with UNION in basic SQLite)
cursor.execute('''
    SELECT category, COUNT(*) as count FROM products GROUP BY category
    UNION ALL
    SELECT 'TOTAL' as category, COUNT(*) FROM products
''')

Pivoting Data¶

Pivoting transforms rows into columns (like a spreadsheet pivot table).

Example: Sales by Month¶

# Before pivot:
# month    | sales
# ---------|-------
# January  | 1000
# February | 1500
# March    | 1200

# After pivot:
# jan_sales | feb_sales | mar_sales
# ----------|-----------|----------
# 1000      | 1500      | 1200

# Using CASE statements for pivot
cursor.execute('''
    SELECT 
        product_id,
        SUM(CASE WHEN month = 'Jan' THEN sales ELSE 0 END) as jan_sales,
        SUM(CASE WHEN month = 'Feb' THEN sales ELSE 0 END) as feb_sales,
        SUM(CASE WHEN month = 'Mar' THEN sales ELSE 0 END) as mar_sales
    FROM monthly_sales
    GROUP BY product_id
''')

Creating Calculated Fields¶

Simple Calculations¶

cursor.execute('''
    SELECT 
        product_name,
        price,
        quantity,
        price * quantity as total_value,
        price * 0.9 as discounted_price
    FROM inventory
''')

Conditional Logic¶

cursor.execute('''
    SELECT 
        customer_name,
        order_total,
        CASE 
            WHEN order_total >= 1000 THEN 'VIP'
            WHEN order_total >= 500 THEN 'Premium'
            ELSE 'Standard'
        END as customer_tier
    FROM orders
''')

Date Calculations¶

cursor.execute('''
    SELECT 
        order_date,
        julianday('now') - julianday(order_date) as days_ago,
        date(order_date, '+30 days') as due_date
    FROM orders
''')

String Transformations¶

Concatenation¶

cursor.execute('''
    SELECT 
        first_name || ' ' || last_name as full_name,
        city || ', ' || country as location
    FROM customers
''')

String Functions¶

cursor.execute('''
    SELECT 
        UPPER(name) as uppercase_name,
        LOWER(email) as lowercase_email,
        SUBSTR(phone, 1, 3) as area_code,
        LENGTH(description) as desc_length,
        REPLACE(status, 'active', 'live') as new_status
    FROM users
''')

Data Normalization and Denormalization¶

Normalization (Reduce Redundancy)¶

Split one table into many:

# Before: orders(customer, address, product, price, ...)
# After:  customers(id, name, address)
#         products(id, name, price)
#         orders(id, customer_id, product_id, quantity)

Denormalization (Improve Read Performance)¶

Join tables for reporting:

cursor.execute('''
    SELECT 
        c.name as customer,
        p.name as product,
        o.quantity,
        p.price * o.quantity as total
    FROM orders o
    JOIN customers c ON o.customer_id = c.id
    JOIN products p ON o.product_id = p.id
''')

Common Transformation Patterns¶

Pattern	SQL Technique
Filter rows	`WHERE` clause
Select columns	Column list in `SELECT`
Sort data	`ORDER BY`
Summarize	`GROUP BY` with aggregates
Add calculated fields	Expressions in `SELECT`
Conditional values	`CASE` statements
Pivot data	`CASE` with aggregates
Join tables	`JOIN` clauses

Best Practices¶

Transform incrementally - Apply one transformation at a time
Validate results - Check row counts and sample data
Use views - Save complex transformations as database views
Document logic - Comment why transformations are needed
Keep originals - Don't overwrite source data

Common Mistakes¶

Forgetting GROUP BY - All non-aggregated columns must be in GROUP BY
Null handling - NULL + anything = NULL; use COALESCE
Type mismatches - Can't concatenate numbers and strings directly
Losing data - Inner joins exclude non-matching rows unintentionally
Over-pivoting - Creating too many columns makes data unwieldy

✅ Before You Continue¶

When would you pivot data vs. keep it in row format?
What's the difference between WHERE and HAVING in aggregation queries?
How do you handle NULL values in calculations?
When should you normalize vs. denormalize data?

Next Steps¶

Run examples.py to see transformations in action
Practice with exercises.py
Learn report generation next!