For some parts of data science / analytics, such a transference of engineering behaviors is suitable: analytics infra, analytics engineering, deployment of machine learning models, libraries — all of these workflows are inherently code-based and benefit from the rigorous test + PR culture familiar in engineering organizations. (View Highlight)
90% of our job, by the nature of our work as analysts and data scientists, is exploratory. And here, unfortunately, engineering practices not only fall short, but can be detrimental to the org. Why?
Blindly enforcing version control and code review as a gatekeeper for sharing exploratory work leads to unshared exploratory work. (View Highlight)
Note: I think this is the distinction between governed and ungoverned work. You may still do all the exploratory analysis you want, but to be approved as a corporate number, you must do these x other engineering type things.
In engineering, maintainability, reliability, and scalability are the objectives that underpin practices like version control, code reviews, code coverage, validation testing. (View Highlight)
Note: Scalability is not so much a concern for data products, but maintainability and reliability should certainly be objectives!
in analytics work, the underlying objectives are necessarily different: reliability, maintainability, and scalability are still important, but they manifest differently. Let’s ditch the emperor’s clothes and replace these concepts with what we really want: discoverability and reproducibility. In other words, we need to put the “science” back into data science (View Highlight)
Validation is cumbersome/blind. For SQL, you have to copy and paste and re-execute queries to validate work in a repo. It’s a small inconvenience, but because results aren’t generally tracked in git, it’s difficult to follow the flow of logic from query to query without re-running queries yourself (View Highlight)
Note: I disagree with this point. The CI runner does the results validation with tests. It is not git’s responsibility. Same thing with code development. Code does not store the results, tests validate results, not git.
Code reviews are great in an ideal world, but reviews cannot keep up with the pace of analytics. Code should still be checked, but the speed of business decisions often necessitates this happen after-the-fact and/or on a case-by-case basis, rather than as mandatory peer reviews. (View Highlight)
Note: I can feel this pressure. I still believe we provide the flexibility with development environments, but governance over production environments.
While some amount of very polished work gets shared (e.g. through Airbnb’s knowledge-repo), but this only captures ~1% of work done. The remainder of work lives in tabs in SQL IDEs, local files, local Jupyter notebooks, and the work thus stands to be duplicated unnecessarily, over and over by different analysts and data scientists. (View Highlight)