06-reference

sql patterns xheblati

Fri Apr 03 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·book ·source: PDF (shared via iMessage) ·by Ergest Xheblati

SQL Patterns — Ergest Xheblati

A short, tactical book (~77 pages) that teaches SQL through named patterns applied to a real project: building a StackOverflow user reputation/engagement feature table in BigQuery. Each pattern follows the format: name, problem, solution, tradeoffs. Xheblati also publishes the Decision Patterns Substack, extending this pattern-oriented thinking to broader data and business decisions.

This is a 06-reference/concepts/analytics-as-craft text — it treats SQL not as syntax to memorize but as a craft with recurring motifs that compound with practice. The patterns here map to 06-reference/2026-04-04-analytics-engineering-levels Level 1-2 skills (query writing, decomposition, maintainability).


Chapter 1: Introducing the Project

Setup: Uses the public StackOverflow dataset in BigQuery (free tier, 1TB/month). The goal is to build a one-row-per-user feature table with reputation and engagement metrics (posts created, answers, edits, upvotes, comments, streaks, per-day ratios). This is the same shape as a customer 360 or feature table for data science.

Data model: 8 post-type tables, users, comments, votes, and post_history (a bridge table connecting users to posts via activity type). The ER diagram shows 7 relationships across these tables. Key insight: understanding the data model before writing queries is the foundation of accurate SQL.


Chapter 2: Core Concepts

Pattern: Granularity

Pattern: Granularity Manipulation (Aggregation)

Pattern: Date Granularity

Pattern: Pivoting

Pattern: Granularity Multiplication

Pattern: Accidental INNER JOIN


Chapter 3: Query Decomposition

Pattern: CTE Decomposition

Applied decomposition for the project:

  1. Sub-problem 1 — Post metrics: post_activity CTE collapses post_history to user/post/activity_type/date grain, then joins with post_types (UNION of questions + answers) and pivots into questions_created, answers_created, questions_edited, answers_edited.
  2. Sub-problem 2 — Comment metrics: Two CTEs — comments_by_user (comments a user wrote) and comments_on_user_post (comments others left on a user’s posts). Joined on user_id + activity_date.
  3. Sub-problem 3 — Vote metrics: votes_on_user_post CTE pivots upvotes (vote_type_id=2) and downvotes (vote_type_id=3) per user per day.
  4. Final assembly: JOIN all three sub-problem CTEs on user_id + activity_date, then GROUP BY user to get the final one-row-per-user table.

Chapter 4: Query Maintainability

Pattern: Reusability Principle

This connects directly to 06-reference/2026-04-04-dedp-data-asset-reusability-pattern — reusability at the CTE level is the micro version of reusability at the data asset level.

Pattern: DRY (Don’t Repeat Yourself)

Pattern: Creating Views


Chapter 5: Query Performance

Pattern: Reduce Rows Early

Pattern: Avoid Functions in WHERE

Pattern: Reduce Columns (Avoid SELECT *)

Pattern: Delay Ordering

Pattern: Bounded Time Windows


Chapter 6: Query Robustness

Pattern: Safe Casting (Ignore Bad Data)

Pattern: Force Formatting

Pattern: NULL Defense

Pattern: Safe Division

Pattern: String Comparison Defense


Chapter 7: Finishing the Project

The final query chains 7 CTEs: post_activity -> post_types -> user_post_metrics -> comments_by_user -> comments_on_user_post -> votes_on_user_post -> total_metrics_per_user. The main SELECT adds ratio-based metrics using SAFE_DIVIDE() + IFNULL() + ROUND() and orders by posts_created DESC.

The output table serves as a customer 360 / feature table pattern — one row per entity with all metrics pre-computed. Useful for customer segmentation, engagement scoring, and as input features for data science models.


Pattern Index (Quick Reference)

PatternChapterProblemSolution
Granularity2Don’t know what makes a row uniqueExpress as “one row per X per Y”, verify with GROUP BY + HAVING
Aggregation2Too fine-grained for targetGROUP BY fewer columns, CASE to reclassify
Date Granularity2Timestamp too preciseCAST(AS DATE) or DATE_TRUNC
Pivoting2Need row values as columnsSUM(CASE WHEN … THEN 1 ELSE 0 END)
Granularity Multiplication2JOIN multiplies rowsAggregate before joining
Accidental INNER JOIN2LEFT JOIN filters in WHERE become INNERPut filters in ON clause
CTE Decomposition3Complex query is unmanageableBreak into single-purpose CTEs
Reusability4CTEs can’t be reusedDesign at general-enough grain
DRY4Duplicated SQL logicExtract to CTE, use UNION ALL
Views4Same CTE in multiple queriesPromote to CREATE VIEW
Reduce Rows5Joins on large tables are slowFilter in CTEs before joining
Avoid Functions in WHERE5Full table scansUse LIKE instead of functions
Reduce Columns5SELECT * is expensiveSelect only needed columns
Delay Ordering5Sorting in CTEs wastes computeORDER BY only in final query
Bounded Time Windows5Query gets slower over timeUse fixed or sliding date bounds
Safe Casting6CAST fails on bad dataSAFE_CAST + COALESCE
Force Formatting6Inconsistent data formatsCASE + SUBSTRING to reconstruct
NULL Defense6NULLs propagate silentlyCOALESCE around all fields
Safe Division6Division by zeroSAFE_DIVIDE + IFNULL
String Comparison6Case/whitespace mismatchesTRIM(LOWER()) on both sides

Connections


Open Questions