Fear not - while I may not have learned much during my Data Science major, a few years working on analytics teams has given me the requisite knowledge to illuminate this thick darkness of harmful jargon. (View Highlight)
Note: Great language. “Illuminate this thick darkness of harmful jargon”
→ Analytics stacks are very different from Data Science stacks
Data Science – or in my personal definition, the practice of creating and utilizing predictive models to drive business value – is distinct from analytics, and that distinction is beyond the scope of this post. Suffice it to say that Machine Learning is its own beast with its own tools, and I won’t be covering those tools here. But, having said that, Machine Learning stacks will usually overlap with analytics stacks, and rely on some of the same tools (e.g. data warehouses). (View Highlight)
Note: I do not think analytics and ML stacks have to be different. Notebooks are different but things like the model store can be in the DW. Maybe there is only more overlap for tech ML.
The goal of any analytics stack is to be able to answer questions about the business with data. Those questions can be simple:
How many active users do we have?
What was our recurring revenue last month?
Did we hit our goal for sales leads this quarter?
But they can also be complex and bespoke:
What behavior in a user’s first day indicates they’re likely to sign up for a paid plan?
What’s our net dollar retention for customers with more than 50 employees?
Which paid marketing channel has been most effective over the past 7 days? (View Highlight)
Note: Drive home the point. It’s only as good as the results you are able to produce
the road to getting there is unpaved and treacherous. The actual data you need is all over the place, siloed in different tools with different interfaces. It’s dirty, and needs reformatting and cleaning. It’s constantly changing, and needs maintenance, care, and thoughtful monitoring. The analytics stack and its associated work is all about getting that data in the format and location you need it. (View Highlight)
Note: 1. Gather data - break down silos
2. Clean data - format for easy querying
3. Maintain data - monitor for changes and adjust accordingly
Where data comes from: production data stores, instrumentation, SaaS tools, and public data
Where data goes: managed data warehouses and homegrown storage
How data moves around: ETL tools, SaaS connectors, and streaming
How data gets ready: modeling, testing, and documentation
How data gets used: charting, SQL clients, sharing (View Highlight)
Much more in vogue these days is instrumenting your product, or firing little “events” every time a user does something in the product. Those events go into a database (View Highlight)
The move towards getting information out of events instead of production databases is getting called Event Driven Analytics and it’s picking up steam. (View Highlight)
Effective analytics teams have what’s called a data model – it’s how they map the domain of the business (users, dollars, activity) to the actual underlying data. Modeling covers a few important pieces of warehouse design:
Definitions – what does a user mean? What does active mean? What does recurring mean? I.e. how you map concepts to your data
Intermediate tables – taking common queries, scheduling them, and materializing them as tables in your warehouse (e.g. daily_active_users)
Performance and structure – how to organize longer and wider tables in a schema that optimizes query speed and cost (View Highlight)
Typically, modeling has been relegated to disparate SQL files and scheduling via ETL tools. But dbt, an open source package for data modeling via templated SQL, has been all the rage over the past few quarters. They’re riding the wave of “the emergence of the Analytics Engineer” and offer a full solution for building, testing, and documenting a warehouse. (View Highlight)