SQL Patterns — Ergest Xheblati

A short, tactical book (~77 pages) that teaches SQL through named patterns applied to a real project: building a StackOverflow user reputation/engagement feature table in BigQuery. Each pattern follows the format: name, problem, solution, tradeoffs. Xheblati also publishes the Decision Patterns Substack, extending this pattern-oriented thinking to broader data and business decisions.

This is a 06-reference/concepts/analytics-as-craft text — it treats SQL not as syntax to memorize but as a craft with recurring motifs that compound with practice. The patterns here map to 06-reference/2026-04-04-analytics-engineering-levels Level 1-2 skills (query writing, decomposition, maintainability).

Chapter 1: Introducing the Project

Setup: Uses the public StackOverflow dataset in BigQuery (free tier, 1TB/month). The goal is to build a one-row-per-user feature table with reputation and engagement metrics (posts created, answers, edits, upvotes, comments, streaks, per-day ratios). This is the same shape as a customer 360 or feature table for data science.

Data model: 8 post-type tables, users, comments, votes, and post_history (a bridge table connecting users to posts via activity type). The ER diagram shows 7 relationships across these tables. Key insight: understanding the data model before writing queries is the foundation of accurate SQL.

Chapter 2: Core Concepts

Pattern: Granularity

Problem: You need to know what constitutes a unique row before you can aggregate or join safely.
Solution: Express granularity as “one row per X per Y per Z.” Verify with GROUP BY + HAVING COUNT(*) > 1 — if rows come back, your assumption is wrong and you have duplicates to handle.
Tradeoff: Real-world data is messy; trust but verify. The post_history table in StackOverflow has duplicates at the expected grain.

Pattern: Granularity Manipulation (Aggregation)

Problem: Source tables are too fine-grained for the target output.
Solution: Collapse granularity using GROUP BY with fewer columns. Use CASE statements to reclassify categorical values (e.g., collapsing post_history_type_id 1-3 into “created” and 4-6 into “edited”). You don’t always need an aggregate function — GROUP BY alone removes duplicates.
Tradeoff: You lose detail. Keep the finest grain in storage and aggregate at query time.

Pattern: Date Granularity

Problem: Timestamps contain too much precision (hours, minutes, seconds) for day-level reporting.
Solution: Use CAST(timestamp AS DATE) or DATE_TRUNC() to reduce to the needed level (day, week, month, etc.).
Tradeoff: Collapsing time precision is irreversible in the output — keep the raw timestamp in the source.

Pattern: Pivoting

Problem: You need row values turned into separate columns (e.g., one column per activity type).
Solution: SUM(CASE WHEN condition THEN 1 ELSE 0 END) AS column_name inside a GROUP BY. This is how all the user metrics get calculated.
Tradeoff: Schema becomes rigid — adding a new category means adding a new column.

Pattern: Granularity Multiplication

Problem: Joining tables at different grains multiplies rows unexpectedly (e.g., 1 user row x 10 history rows = 10 output rows).
Solution: Understand the grain of both sides before joining. Aggregate first, then join.
Tradeoff: None if you’re aware of it; catastrophic if you’re not.

Pattern: Accidental INNER JOIN

Problem: Filtering a LEFT JOINed table in the WHERE clause silently converts it to an INNER JOIN, dropping rows.
Solution: Put filters on the LEFT JOINed table in the ON clause, not the WHERE clause. The only exception: WHERE right_table.id IS NULL (anti-join) preserves the LEFT JOIN behavior.
Rule of thumb: Default to LEFT JOIN when you’re unsure whether one table is a subset of the other. Data warehouses lack foreign key constraints, so referential integrity is never guaranteed.

Chapter 3: Query Decomposition

Pattern: CTE Decomposition

Problem: Complex queries are hard to write, read, debug, and maintain as monolithic blocks.
Solution: Use Common Table Expressions (CTEs) via WITH ... AS (...) to break the problem into self-contained sub-problems. Each CTE should have a single responsibility and be independently testable. Chain CTEs by referencing earlier ones.
Tradeoff: Deep CTE nesting can hit database limits (BigQuery enforces a max nesting depth). At that point, materialize intermediate results as views or tables.

Applied decomposition for the project:

Sub-problem 1 — Post metrics: post_activity CTE collapses post_history to user/post/activity_type/date grain, then joins with post_types (UNION of questions + answers) and pivots into questions_created, answers_created, questions_edited, answers_edited.
Sub-problem 2 — Comment metrics: Two CTEs — comments_by_user (comments a user wrote) and comments_on_user_post (comments others left on a user’s posts). Joined on user_id + activity_date.
Sub-problem 3 — Vote metrics: votes_on_user_post CTE pivots upvotes (vote_type_id=2) and downvotes (vote_type_id=3) per user per day.
Final assembly: JOIN all three sub-problem CTEs on user_id + activity_date, then GROUP BY user to get the final one-row-per-user table.

Chapter 4: Query Maintainability

Pattern: Reusability Principle

Problem: CTEs that are too narrowly scoped can’t be reused, leading to duplication.
Solution: Design CTEs at a general-enough grain that they can serve multiple downstream consumers. The post_activity CTE is reused for both user post metrics and for joining with comments/votes via post_id.
Tradeoff: Generality vs. specificity — too general and you carry unnecessary data; too specific and you duplicate logic.

This connects directly to 06-reference/2026-04-04-dedp-data-asset-reusability-pattern — reusability at the CTE level is the micro version of reusability at the data asset level.

Pattern: DRY (Don’t Repeat Yourself)

Problem: Copy-pasting the same SQL logic across multiple CTEs creates maintenance burden and inconsistency risk.
Solution: Extract repeated logic into a single CTE and reference it. Use UNION ALL to combine tables with the same schema (e.g., posts_questions + posts_answers into post_types). Add a constant column (e.g., 'question' AS post_type) to preserve source identity.
Key distinction: UNION ALL is fast (no dedup). UNION DISTINCT removes duplicates but is slower — only use when you can’t guarantee uniqueness beforehand.

Pattern: Creating Views

Problem: You find yourself pasting the same CTE into multiple queries.
Solution: Promote it to a CREATE OR REPLACE VIEW. Views encapsulate business logic, are stored in the database, but don’t consume storage (unless materialized).
Tradeoff: In BigQuery, views count toward CTE nesting limits. If nesting gets too deep, materialize the view into a table.

Chapter 5: Query Performance

Pattern: Reduce Rows Early

Problem: Joining large, unfiltered tables is expensive and slow.
Solution: Filter each CTE to the minimum rows needed before joining. In the project, every CTE filters to a 90-day window (creation_date >= '2021-06-01' AND creation_date <= '2021-09-30').
Tradeoff: You must ensure filters are consistent across CTEs or you’ll get mismatched joins.

Pattern: Avoid Functions in WHERE

Problem: Using functions in WHERE clauses (e.g., INSTR(tags, '|sql|') > 0) forces full table scans because the optimizer can’t use indexes.
Solution: Use LIKE '%|sql|%' instead — query optimizers handle LIKE much better. Compare columns to fixed values or other columns for best performance.
Tradeoff: LIKE with leading wildcards still prevents index usage, but it’s better than arbitrary functions.

Pattern: Reduce Columns (Avoid SELECT *)

Problem: SELECT * in columnar warehouses scans all columns, increasing cost and latency regardless of LIMIT.
Solution: Explicitly select only the columns you need. In the post_types CTE, only id and a constant post_type are selected.
Tradeoff: More verbose queries, but dramatically cheaper on BigQuery-style pricing (pay per bytes scanned).

Pattern: Delay Ordering

Problem: Sorting is expensive and unnecessary in intermediate results.
Solution: Only ORDER BY in the final output query. Never sort inside CTEs. If the data feeds a BI tool (Looker, Tableau), let the tool handle sorting.
Tradeoff: Debugging intermediate CTEs is slightly harder without ordering, but ORDER BY can always be added temporarily.

Pattern: Bounded Time Windows

Problem: Queries without date bounds get slower over time as data accumulates.
Solution: Use fixed windows (>= '2021-06-01') or sliding windows (>= DATE_ADD(CURRENT_DATE(), INTERVAL -90 DAY)) to cap the data volume. Sliding windows keep performance constant as the table grows.
Tradeoff: Fixed lower bounds without upper bounds degrade over time — always bound both sides or use a sliding window.

Chapter 6: Query Robustness

Pattern: Safe Casting (Ignore Bad Data)

Problem: CAST() fails on malformed data, breaking the entire query.
Solution: Use SAFE_CAST() (BigQuery) or TRY_CAST() (SQL Server) — returns NULL instead of failing. Wrap with COALESCE() to provide a sensible default.
Tradeoff: You silently skip bad rows — make sure to investigate why data is malformed separately.

Pattern: Force Formatting

Problem: Data has inconsistent formats (e.g., dates as 2021-12--01 and 12/04/2021 in the same column).
Solution: Use CASE WHEN ... LIKE pattern to detect format variants, then SUBSTRING() + string concatenation (||) to reconstruct the correct format before casting.
Tradeoff: Fragile if new format variants appear. Combine with SAFE_CAST() as a safety net.

Pattern: NULL Defense

Problem: Any column can become NULL unexpectedly (new columns, deleted data, upstream changes). NULL propagates through calculations — NULL + 1 = NULL.
Solution: Wrap fields in COALESCE(field, default_value) in production queries. Remember: WHERE col = NULL doesn’t work — use IS NULL / IS NOT NULL.
Tradeoff: Choosing the right default matters — 0 for numbers, '1900-01-01' for dates, etc. A bad default can be worse than NULL.

Pattern: Safe Division

Problem: Division by zero crashes the query. It may work initially but fail later as data changes.
Solution: Use SAFE_DIVIDE(numerator, denominator) (BigQuery) or DIV0() (Snowflake) — returns NULL on divide-by-zero. Wrap with IFNULL() or COALESCE() to default to 0. Alternative: use CASE WHEN denominator > 0 THEN ... ELSE 0 END.
Tradeoff: Returning 0 for a ratio can be misleading — consider whether NULL is more honest.

Pattern: String Comparison Defense

Problem: String joins fail silently due to case differences ('string' != 'String') or whitespace padding ('string' != ' string').
Solution: Always apply TRIM(LOWER(field)) on both sides when joining or comparing strings. Essential for email fields.
Tradeoff: Slight performance cost, but far cheaper than debugging join mismatches.

Chapter 7: Finishing the Project

The final query chains 7 CTEs: post_activity -> post_types -> user_post_metrics -> comments_by_user -> comments_on_user_post -> votes_on_user_post -> total_metrics_per_user. The main SELECT adds ratio-based metrics using SAFE_DIVIDE() + IFNULL() + ROUND() and orders by posts_created DESC.

The output table serves as a customer 360 / feature table pattern — one row per entity with all metrics pre-computed. Useful for customer segmentation, engagement scoring, and as input features for data science models.

Pattern Index (Quick Reference)

Pattern	Chapter	Problem	Solution
Granularity	2	Don’t know what makes a row unique	Express as “one row per X per Y”, verify with GROUP BY + HAVING
Aggregation	2	Too fine-grained for target	GROUP BY fewer columns, CASE to reclassify
Date Granularity	2	Timestamp too precise	CAST(AS DATE) or DATE_TRUNC
Pivoting	2	Need row values as columns	SUM(CASE WHEN … THEN 1 ELSE 0 END)
Granularity Multiplication	2	JOIN multiplies rows	Aggregate before joining
Accidental INNER JOIN	2	LEFT JOIN filters in WHERE become INNER	Put filters in ON clause
CTE Decomposition	3	Complex query is unmanageable	Break into single-purpose CTEs
Reusability	4	CTEs can’t be reused	Design at general-enough grain
DRY	4	Duplicated SQL logic	Extract to CTE, use UNION ALL
Views	4	Same CTE in multiple queries	Promote to CREATE VIEW
Reduce Rows	5	Joins on large tables are slow	Filter in CTEs before joining
Avoid Functions in WHERE	5	Full table scans	Use LIKE instead of functions
Reduce Columns	5	SELECT * is expensive	Select only needed columns
Delay Ordering	5	Sorting in CTEs wastes compute	ORDER BY only in final query
Bounded Time Windows	5	Query gets slower over time	Use fixed or sliding date bounds
Safe Casting	6	CAST fails on bad data	SAFE_CAST + COALESCE
Force Formatting	6	Inconsistent data formats	CASE + SUBSTRING to reconstruct
NULL Defense	6	NULLs propagate silently	COALESCE around all fields
Safe Division	6	Division by zero	SAFE_DIVIDE + IFNULL
String Comparison	6	Case/whitespace mismatches	TRIM(LOWER()) on both sides

Connections

06-reference/concepts/systems-over-goals — Patterns are systems. Learning 20 named patterns gives you a reusable toolkit that applies across any SQL problem, rather than solving each query from scratch.
06-reference/2026-04-03-the-data-warehouse-toolkit — Kimball’s dimensional modeling operates at the schema design level; Xheblati’s patterns operate at the query authoring level. They’re complementary: Kimball tells you what to build, SQL Patterns tells you how to query it.
06-reference/2026-04-04-dedp-data-asset-reusability-pattern — The Reusability Principle and DRY at the CTE/view level are the micro-expression of data asset reusability at the platform level.
06-reference/2026-04-04-analytics-engineering-levels — This book covers Level 1 (writing correct queries) and Level 2 (writing maintainable, performant queries). It’s the practical skills complement to the levels framework.
01-projects/phdata/index — Strong reference material for consulting engagements. These patterns are the kind of thing you teach in a SQL workshop or embed in a client’s style guide.

Open Questions

Window functions and string manipulation: The book mentions a Chapter 8 covering window functions, regex, and JSON parsing, but it’s not included in this PDF. Worth tracking down — window functions are a Level 2-3 skill gap for many analysts.
dbt integration: How do these CTE patterns map to dbt’s model/ref paradigm? The Reusability and DRY patterns are essentially what dbt enforces structurally. Could be worth a dedicated note connecting the two.
Materialization strategy: Xheblati mentions BigQuery’s CTE nesting limits and the option to materialize views. What’s the decision framework for when to materialize? This connects to cost optimization patterns.
Testing patterns: The granularity verification query (GROUP BY + HAVING COUNT(*) > 1) is a lightweight data test. How does this relate to dbt tests or Great Expectations?