SQL Patterns — Ergest Xheblati
A short, tactical book (~77 pages) that teaches SQL through named patterns applied to a real project: building a StackOverflow user reputation/engagement feature table in BigQuery. Each pattern follows the format: name, problem, solution, tradeoffs. Xheblati also publishes the Decision Patterns Substack, extending this pattern-oriented thinking to broader data and business decisions.
This is a 06-reference/concepts/analytics-as-craft text — it treats SQL not as syntax to memorize but as a craft with recurring motifs that compound with practice. The patterns here map to 06-reference/2026-04-04-analytics-engineering-levels Level 1-2 skills (query writing, decomposition, maintainability).
Chapter 1: Introducing the Project
Setup: Uses the public StackOverflow dataset in BigQuery (free tier, 1TB/month). The goal is to build a one-row-per-user feature table with reputation and engagement metrics (posts created, answers, edits, upvotes, comments, streaks, per-day ratios). This is the same shape as a customer 360 or feature table for data science.
Data model: 8 post-type tables, users, comments, votes, and post_history (a bridge table connecting users to posts via activity type). The ER diagram shows 7 relationships across these tables. Key insight: understanding the data model before writing queries is the foundation of accurate SQL.
Chapter 2: Core Concepts
Pattern: Granularity
- Problem: You need to know what constitutes a unique row before you can aggregate or join safely.
- Solution: Express granularity as “one row per X per Y per Z.” Verify with
GROUP BY+HAVING COUNT(*) > 1— if rows come back, your assumption is wrong and you have duplicates to handle. - Tradeoff: Real-world data is messy; trust but verify. The
post_historytable in StackOverflow has duplicates at the expected grain.
Pattern: Granularity Manipulation (Aggregation)
- Problem: Source tables are too fine-grained for the target output.
- Solution: Collapse granularity using
GROUP BYwith fewer columns. UseCASEstatements to reclassify categorical values (e.g., collapsingpost_history_type_id1-3 into “created” and 4-6 into “edited”). You don’t always need an aggregate function —GROUP BYalone removes duplicates. - Tradeoff: You lose detail. Keep the finest grain in storage and aggregate at query time.
Pattern: Date Granularity
- Problem: Timestamps contain too much precision (hours, minutes, seconds) for day-level reporting.
- Solution: Use
CAST(timestamp AS DATE)orDATE_TRUNC()to reduce to the needed level (day, week, month, etc.). - Tradeoff: Collapsing time precision is irreversible in the output — keep the raw timestamp in the source.
Pattern: Pivoting
- Problem: You need row values turned into separate columns (e.g., one column per activity type).
- Solution:
SUM(CASE WHEN condition THEN 1 ELSE 0 END) AS column_nameinside aGROUP BY. This is how all the user metrics get calculated. - Tradeoff: Schema becomes rigid — adding a new category means adding a new column.
Pattern: Granularity Multiplication
- Problem: Joining tables at different grains multiplies rows unexpectedly (e.g., 1 user row x 10 history rows = 10 output rows).
- Solution: Understand the grain of both sides before joining. Aggregate first, then join.
- Tradeoff: None if you’re aware of it; catastrophic if you’re not.
Pattern: Accidental INNER JOIN
- Problem: Filtering a LEFT JOINed table in the
WHEREclause silently converts it to an INNER JOIN, dropping rows. - Solution: Put filters on the LEFT JOINed table in the
ONclause, not theWHEREclause. The only exception:WHERE right_table.id IS NULL(anti-join) preserves the LEFT JOIN behavior. - Rule of thumb: Default to LEFT JOIN when you’re unsure whether one table is a subset of the other. Data warehouses lack foreign key constraints, so referential integrity is never guaranteed.
Chapter 3: Query Decomposition
Pattern: CTE Decomposition
- Problem: Complex queries are hard to write, read, debug, and maintain as monolithic blocks.
- Solution: Use Common Table Expressions (CTEs) via
WITH ... AS (...)to break the problem into self-contained sub-problems. Each CTE should have a single responsibility and be independently testable. Chain CTEs by referencing earlier ones. - Tradeoff: Deep CTE nesting can hit database limits (BigQuery enforces a max nesting depth). At that point, materialize intermediate results as views or tables.
Applied decomposition for the project:
- Sub-problem 1 — Post metrics:
post_activityCTE collapsespost_historyto user/post/activity_type/date grain, then joins withpost_types(UNION of questions + answers) and pivots intoquestions_created,answers_created,questions_edited,answers_edited. - Sub-problem 2 — Comment metrics: Two CTEs —
comments_by_user(comments a user wrote) andcomments_on_user_post(comments others left on a user’s posts). Joined on user_id + activity_date. - Sub-problem 3 — Vote metrics:
votes_on_user_postCTE pivots upvotes (vote_type_id=2) and downvotes (vote_type_id=3) per user per day. - Final assembly: JOIN all three sub-problem CTEs on
user_id+activity_date, thenGROUP BYuser to get the final one-row-per-user table.
Chapter 4: Query Maintainability
Pattern: Reusability Principle
- Problem: CTEs that are too narrowly scoped can’t be reused, leading to duplication.
- Solution: Design CTEs at a general-enough grain that they can serve multiple downstream consumers. The
post_activityCTE is reused for both user post metrics and for joining with comments/votes viapost_id. - Tradeoff: Generality vs. specificity — too general and you carry unnecessary data; too specific and you duplicate logic.
This connects directly to 06-reference/2026-04-04-dedp-data-asset-reusability-pattern — reusability at the CTE level is the micro version of reusability at the data asset level.
Pattern: DRY (Don’t Repeat Yourself)
- Problem: Copy-pasting the same SQL logic across multiple CTEs creates maintenance burden and inconsistency risk.
- Solution: Extract repeated logic into a single CTE and reference it. Use
UNION ALLto combine tables with the same schema (e.g.,posts_questions+posts_answersintopost_types). Add a constant column (e.g.,'question' AS post_type) to preserve source identity. - Key distinction:
UNION ALLis fast (no dedup).UNION DISTINCTremoves duplicates but is slower — only use when you can’t guarantee uniqueness beforehand.
Pattern: Creating Views
- Problem: You find yourself pasting the same CTE into multiple queries.
- Solution: Promote it to a
CREATE OR REPLACE VIEW. Views encapsulate business logic, are stored in the database, but don’t consume storage (unless materialized). - Tradeoff: In BigQuery, views count toward CTE nesting limits. If nesting gets too deep, materialize the view into a table.
Chapter 5: Query Performance
Pattern: Reduce Rows Early
- Problem: Joining large, unfiltered tables is expensive and slow.
- Solution: Filter each CTE to the minimum rows needed before joining. In the project, every CTE filters to a 90-day window (
creation_date >= '2021-06-01' AND creation_date <= '2021-09-30'). - Tradeoff: You must ensure filters are consistent across CTEs or you’ll get mismatched joins.
Pattern: Avoid Functions in WHERE
- Problem: Using functions in
WHEREclauses (e.g.,INSTR(tags, '|sql|') > 0) forces full table scans because the optimizer can’t use indexes. - Solution: Use
LIKE '%|sql|%'instead — query optimizers handleLIKEmuch better. Compare columns to fixed values or other columns for best performance. - Tradeoff:
LIKEwith leading wildcards still prevents index usage, but it’s better than arbitrary functions.
Pattern: Reduce Columns (Avoid SELECT *)
- Problem:
SELECT *in columnar warehouses scans all columns, increasing cost and latency regardless ofLIMIT. - Solution: Explicitly select only the columns you need. In the
post_typesCTE, onlyidand a constantpost_typeare selected. - Tradeoff: More verbose queries, but dramatically cheaper on BigQuery-style pricing (pay per bytes scanned).
Pattern: Delay Ordering
- Problem: Sorting is expensive and unnecessary in intermediate results.
- Solution: Only
ORDER BYin the final output query. Never sort inside CTEs. If the data feeds a BI tool (Looker, Tableau), let the tool handle sorting. - Tradeoff: Debugging intermediate CTEs is slightly harder without ordering, but
ORDER BYcan always be added temporarily.
Pattern: Bounded Time Windows
- Problem: Queries without date bounds get slower over time as data accumulates.
- Solution: Use fixed windows (
>= '2021-06-01') or sliding windows (>= DATE_ADD(CURRENT_DATE(), INTERVAL -90 DAY)) to cap the data volume. Sliding windows keep performance constant as the table grows. - Tradeoff: Fixed lower bounds without upper bounds degrade over time — always bound both sides or use a sliding window.
Chapter 6: Query Robustness
Pattern: Safe Casting (Ignore Bad Data)
- Problem:
CAST()fails on malformed data, breaking the entire query. - Solution: Use
SAFE_CAST()(BigQuery) orTRY_CAST()(SQL Server) — returns NULL instead of failing. Wrap withCOALESCE()to provide a sensible default. - Tradeoff: You silently skip bad rows — make sure to investigate why data is malformed separately.
Pattern: Force Formatting
- Problem: Data has inconsistent formats (e.g., dates as
2021-12--01and12/04/2021in the same column). - Solution: Use
CASE WHEN ... LIKE patternto detect format variants, thenSUBSTRING()+ string concatenation (||) to reconstruct the correct format before casting. - Tradeoff: Fragile if new format variants appear. Combine with
SAFE_CAST()as a safety net.
Pattern: NULL Defense
- Problem: Any column can become NULL unexpectedly (new columns, deleted data, upstream changes). NULL propagates through calculations —
NULL + 1 = NULL. - Solution: Wrap fields in
COALESCE(field, default_value)in production queries. Remember:WHERE col = NULLdoesn’t work — useIS NULL/IS NOT NULL. - Tradeoff: Choosing the right default matters —
0for numbers,'1900-01-01'for dates, etc. A bad default can be worse than NULL.
Pattern: Safe Division
- Problem: Division by zero crashes the query. It may work initially but fail later as data changes.
- Solution: Use
SAFE_DIVIDE(numerator, denominator)(BigQuery) orDIV0()(Snowflake) — returns NULL on divide-by-zero. Wrap withIFNULL()orCOALESCE()to default to 0. Alternative: useCASE WHEN denominator > 0 THEN ... ELSE 0 END. - Tradeoff: Returning 0 for a ratio can be misleading — consider whether NULL is more honest.
Pattern: String Comparison Defense
- Problem: String joins fail silently due to case differences (
'string' != 'String') or whitespace padding ('string' != ' string'). - Solution: Always apply
TRIM(LOWER(field))on both sides when joining or comparing strings. Essential for email fields. - Tradeoff: Slight performance cost, but far cheaper than debugging join mismatches.
Chapter 7: Finishing the Project
The final query chains 7 CTEs: post_activity -> post_types -> user_post_metrics -> comments_by_user -> comments_on_user_post -> votes_on_user_post -> total_metrics_per_user. The main SELECT adds ratio-based metrics using SAFE_DIVIDE() + IFNULL() + ROUND() and orders by posts_created DESC.
The output table serves as a customer 360 / feature table pattern — one row per entity with all metrics pre-computed. Useful for customer segmentation, engagement scoring, and as input features for data science models.
Pattern Index (Quick Reference)
| Pattern | Chapter | Problem | Solution |
|---|---|---|---|
| Granularity | 2 | Don’t know what makes a row unique | Express as “one row per X per Y”, verify with GROUP BY + HAVING |
| Aggregation | 2 | Too fine-grained for target | GROUP BY fewer columns, CASE to reclassify |
| Date Granularity | 2 | Timestamp too precise | CAST(AS DATE) or DATE_TRUNC |
| Pivoting | 2 | Need row values as columns | SUM(CASE WHEN … THEN 1 ELSE 0 END) |
| Granularity Multiplication | 2 | JOIN multiplies rows | Aggregate before joining |
| Accidental INNER JOIN | 2 | LEFT JOIN filters in WHERE become INNER | Put filters in ON clause |
| CTE Decomposition | 3 | Complex query is unmanageable | Break into single-purpose CTEs |
| Reusability | 4 | CTEs can’t be reused | Design at general-enough grain |
| DRY | 4 | Duplicated SQL logic | Extract to CTE, use UNION ALL |
| Views | 4 | Same CTE in multiple queries | Promote to CREATE VIEW |
| Reduce Rows | 5 | Joins on large tables are slow | Filter in CTEs before joining |
| Avoid Functions in WHERE | 5 | Full table scans | Use LIKE instead of functions |
| Reduce Columns | 5 | SELECT * is expensive | Select only needed columns |
| Delay Ordering | 5 | Sorting in CTEs wastes compute | ORDER BY only in final query |
| Bounded Time Windows | 5 | Query gets slower over time | Use fixed or sliding date bounds |
| Safe Casting | 6 | CAST fails on bad data | SAFE_CAST + COALESCE |
| Force Formatting | 6 | Inconsistent data formats | CASE + SUBSTRING to reconstruct |
| NULL Defense | 6 | NULLs propagate silently | COALESCE around all fields |
| Safe Division | 6 | Division by zero | SAFE_DIVIDE + IFNULL |
| String Comparison | 6 | Case/whitespace mismatches | TRIM(LOWER()) on both sides |
Connections
- 06-reference/concepts/systems-over-goals — Patterns are systems. Learning 20 named patterns gives you a reusable toolkit that applies across any SQL problem, rather than solving each query from scratch.
- 06-reference/2026-04-03-the-data-warehouse-toolkit — Kimball’s dimensional modeling operates at the schema design level; Xheblati’s patterns operate at the query authoring level. They’re complementary: Kimball tells you what to build, SQL Patterns tells you how to query it.
- 06-reference/2026-04-04-dedp-data-asset-reusability-pattern — The Reusability Principle and DRY at the CTE/view level are the micro-expression of data asset reusability at the platform level.
- 06-reference/2026-04-04-analytics-engineering-levels — This book covers Level 1 (writing correct queries) and Level 2 (writing maintainable, performant queries). It’s the practical skills complement to the levels framework.
- 01-projects/phdata/index — Strong reference material for consulting engagements. These patterns are the kind of thing you teach in a SQL workshop or embed in a client’s style guide.
Open Questions
- Window functions and string manipulation: The book mentions a Chapter 8 covering window functions, regex, and JSON parsing, but it’s not included in this PDF. Worth tracking down — window functions are a Level 2-3 skill gap for many analysts.
- dbt integration: How do these CTE patterns map to dbt’s model/ref paradigm? The Reusability and DRY patterns are essentially what dbt enforces structurally. Could be worth a dedicated note connecting the two.
- Materialization strategy: Xheblati mentions BigQuery’s CTE nesting limits and the option to materialize views. What’s the decision framework for when to materialize? This connects to cost optimization patterns.
- Testing patterns: The granularity verification query (
GROUP BY + HAVING COUNT(*) > 1) is a lightweight data test. How does this relate to dbt tests or Great Expectations?