“My favorite thing about the work that we’re all doing is that we’re expanding the scope of what you can do if you know SQL,” he said. “We’re allowing people who know SQL to do a lot of the things that previously you could have only done if you were a software engineer. I think that’s a really, great mission – bringing that power of software engineering to a larger group of people.” (View Highlight)
Tags: #favorite
When you start using something like Census, that means that data is going to be live in a lot more systems, so it’s bad if you screw up. There’s a lot more responsibility there, and data teams will be responsible for testing, validating and making sure it’s of high quality. That’s what software teams do relatively well. We’ve learned so much about how to ship systems on the internet, and we have to learn how to do the same for data.” (View Highlight)
George Fraser:
Yeah, I think it’s also part of a larger story of declining end to end latency of the analytic database stack. Data warehouses, which is the world we all live in, historically 24 hours was the update frequency. The classic way that you would configure your Teradata data warehouse is, you would ingest all the new data at night when no one was querying it. And so it was expected that everything would be 24 hours old. And you can do a lot with 24 hour old data. Don’t get me wrong, but there’s a lot of things that you can’t do with 24 hour old data. (View Highlight)
the traditional enterprise data warehouse, data pipeline just copies all the data every night, because anything else is too complicated, too error prone. So you just refresh the entire thing every night. And we immediately encountered that was not going to be feasible. These saas tools, their APIs are too slow to pull the entire data set even once a day. And so we built our connectors around change data capture from day one, and we did it because we had to. (View Highlight)
it had the side effect that if you configured the update frequency down to one hour or 30 minutes, or 15 minutes, or one minute, you had much more up to date data. And suddenly there were all these other things that you could do with it. And every other component of the stack has to change in order to enable that (View Highlight)
Every incremental, 2X reduction and end to end latency requires every single component of the stack to make a major improvement. But I do think that is a unifying story across the entire analytic data stack, is just that latency going down from 24 hours, to one hour, to one minute, to below that. (View Highlight)
George Fraser:
Yeah. I think when you cross the boundary from micro batch systems to true streaming systems that actually process data transactionally one row at a time, that is like a hard barrier. It’s just much more expensive. It’s almost like physics. You have to do certain things in order to process data a single row at a time. There’s all of these efficiencies you gain with micro batch systems. And that boundary is at about one second. To go below one second, you have to pay a significantly higher costs. And there’s just so much that you can do with latencies in the seconds range. I think that is an important stopping point. (View Highlight)
George Fraser:
And all systems have latency. It always annoys me when people say real time or zero latency. What does that mean? You’re looking at this, it takes about… I used to be a neuroscientist. It takes a couple hundred milliseconds for this to hit your visual cortex. (View Highlight)
Boris Jabes:
Yeah. I mean, when we started Census in 2018, this really wasn’t a thing. I think there’s this hangover to George’s point about Teradata batch warehouses that were a day behind. Not only that, they were also very expensive once upon a time. And the idea that those could be a source for anything was weird. And so, the warehouse, I think, was by definition, a sink, to use distributed language terminology. And we just felt like, well, this is where all the interesting data modeling is happening. (View Highlight)
One lesson I learned almost a decade ago, is that, people live in their pane of glass. In a company, people have their preferred tool. They just do. A sales person likes to live in their sales tool, and a product manager likes to live in some kind of JIRA like product probably. And so, you have to bring the data to them, you really have to bring the insights to them. (View Highlight)
There’s this whole Kimball, Inmon data warehousing religious war that’s been going on since whatever, the ’90s or something and-
George Fraser:
’80s.
Tristan Handy:
… ’80s, longer than I go back. And I think that the funny thing was that that religious war did not carry over into the modern data warehouse for a long time. (View Highlight)
And so, you have a bunch of people using Redshift and Fivetran mode or Looker, and people are just doing fishing expeditions. There’s all this data inside of Redshift, and then they’re like, “I want to find out some answers.” And then you go all the way down to the raw data and you try to figure out something, and then you come all the way back up to the surface. And you’re like, “Hey, I came back with a report. And what do you know? It took me literally a full week to produce this one report and I forgot everything I learned in the process, but I got the report still.” (View Highlight)
Is it reminds me of big Spreadsheets. There reaches a point where the form does not allow… It doesn’t exceeds the content. You can have a Spreadsheet that has a ton of stuff, but ultimately only one human being can even reason about it. And I once worked at Microsoft and I would see these people who are the custodians of the mega Spreadsheet. And they’re like, “Don’t touch it. It computes the perfect number somewhere in there. Good luck ever being able to iterate on it.” (View Highlight)
Boris Jabes:
And your point about… You’d see these people do these deep spelunking sessions come up with a report, and then that would get disappeared the next day, that entire knowledge that was built out is just gone. That to me, is the side effect of what dbt has caused is so great. Because now there’s an artifact and you can collaborate, and iterate, and actually treat it like something that is shared and everyone can learn from, which I think didn’t exist before. (View Highlight)
have you ever put a screen protector on an iPhone, and then you realize there’s a little bubble, an air bubble?
Boris Jabes:
Yeah.
Tristan Handy:
It’s maddening. And so you get out your credit card and you slowly try to push the bubble out to the edge. I mean, you can eventually make the bubble disappear. But problems like this… The Excel Spreadsheet is the bubble right in the middle. And then you’re like, “Okay, now it’s in Sequel, it’s in code.” But it’s still in this 250 line report. That’s slightly closer to the edge, but still not that much progress. And then now with DVT, yes, we’ve made a lot of progress here, but we still feel like we’re pushing that bubble out to the edge. (View Highlight)
In software engineering systems, you have tech debt. Where before you didn’t really have tech debt, because it was so dysfunctional, you couldn’t even call it technology.
Tristan Handy:
Now we have these systems that people are trying to treat production systems, and yet there’s not always the governance in place to know where’s the stuff to deprecate, where’s the stuff that doesn’t have good test coverage. There’s all these classes of problems that our ecosystem is not yet mature around it. And we feel a lot of pain around in. (View Highlight)
Note: The tooling needs to be good enough to even have tech debt.
Boris Jabes:
Yeah, I think we’re at we’re at the very… I hate to use sports analogies or whatever, but this is the first inning of data turning into a software artifact. Analogy I’ve used over the last couple of years is, because… When you plug in something like Census, that means the data is going to become live in a lot more systems than just reports, which means the downside is really bad. So if you screw up, you have a lot more responsibility, and that means you have to start treating this artifact as… You have to test it and validate it and make sure it’s of high quality. And while you also modify it as you go along. (View Highlight)
Note: Exposing data in the source systems increases the criticality of analytics
Arjun Narayan:
But if you aren’t using a data warehouse as a source of joining your data together and merging all of these dozens of SAS application data, where these customer data is spread across, where else can you do it in a way where you can actually stay sane? I don’t think there’s any other system that gives you this correctness guarantee as the data warehouse. (View Highlight)
There’s a lot of behavioral things that we really care a lot about pushing people on, like test coverage. I don’t know what the number is right now. It’s somewhere between a third and a half of dbt projects run tests at all. Which means that somewhere between half and two thirds of projects are completely untested, and yet have 100 plus models. And it’s scary to me.
Boris Jabes:
It’s super scary. (View Highlight)
Arjun Narayan:
So, the thing that I really want is I want these models to feed data into production systems. And then I want to use that and say, are you really going to allow incorrect data in front of your sales reps, kicking off automated emails or whatever. It increases the level of criticality of doing things the right way. (View Highlight)
It’s amazing how far we are still in terms of testing and monitoring data. We’re just at the beginning, really barely at the beginning.
Arjun Narayan:
I think that just goes back to the fact that historically data warehouses were used for reporting, and the expectation was that the human beings would check the correctness of these reports and go fix the Sequel queries if something was wrong. It’s just in the same way that software engineering started with manual testing. Testing meant before you cut a release, a bunch of people would click every button and make sure that it works, and evolved to automated testing. It’s the exact same evolution with data. As you go from releasing a report every quarter or every week, which is checked by a person before it goes out to powering an operational system every second, you can’t do manual testing anymore. You have to do automated testing. And so you have to adopt all these best practices. The nice thing is, that software engineering came first.
Boris Jabes:
Yes, so we have all the cheat codes. We literally have all the cheat codes. (View Highlight)
If Fivetran was down for just a couple hours, people would freak out.
Arjun Narayan:
And we were like, “Why is this such a big problem for you? Okay, it’s not great that we’re down for six hours today, but it feels like it shouldn’t be the end of the world. Your dashboards are six hours behind.” And then we found out what people were doing with the data. People were running payroll off of the data. People were billing their customers off of the data. They were like, “We can’t pay our employees until Fivetran is back online.” And I was like, “Oh, my God! I did not realize you were doing that with the data.” And ultimately, it was a challenge that we just had to rise to. We had to say, “Okay, we are in part of an operational system now. We need to have the kind of reliability that operational systems have.” But realize that is a foreign concept to the data warehousing stack. (View Highlight)
George Fraser:
Let’s name and shame, guys. Which warehouse? Which of the three is actually good for production use case? (View Highlight)
Note: “Name and shame” that’s funny
My hope is that there’s this new class of humans, who at one point in time, we would have called data analysts that now are empowered to, essentially orchestrate production workloads, not just analytical workloads, but operational workloads too. But they’re not classically what you would think about as software engineers. But maybe as this stuff goes deeper and deeper towards production systems that have more and more criticality, maybe you get software engineers much more back in the mix. (View Highlight)
dbt has really leveled up the practitioner. And then because of the transitions in the infrastructure, now we have to think about, I even think about what’s the composition of the team. Where does it sit in the organization and what is in it? (View Highlight)
I feel like almost three layers of people within a data team. There’s low level data eng, which is what you and I would call it capital E engineering, software engineering, people that may even own things like George’s favorite product CASCA.
Boris Jabes:
And then you’ve got your analytics engineering organization at this point, or practitioners who are maybe building core models. And then you still have a data science team. That is doing ad hoc discovery, predictions, analysis, et cetera. I don’t even know exactly where to… I feel like AI and ML fit into that mix somewhere. (View Highlight)
where does that report up to? I mean, I have seen BI teams to this day that report up to a CFO. That’s pretty fricking common. And if you’re going to become more engineering minded or you’re going to mix in with engineers, you really can’t report up a CFO. That doesn’t make sense. And so, is that a dotted line relationship? Actually, I’m not sure. (View Highlight)
Tristan Handy:
Most analytics engineering teams today don’t have things like on-call rotations. They don’t use PagerDuty.
Boris Jabes:
Right. It’s just a matter of time.
Tristan Handy:
Maybe it is. But also that could be a rude awakening, when you say, “Hey, you added the term engineer to your title and welcome to weekend PagerDuty.” (View Highlight)
Note: What are the weekend work best practices? How do you balance out the work week?
I would say what you just said, Matt, is you’re at, what I would call the extreme of operationalizing data, because you’re operationalizing it in your application, in the product that is Shopify?
Matt:
Absolutely.
Boris Jabes:
That’s the most live ammo you get. I call that the final form of operationalizing. (View Highlight)