On Data Engineering Code Reviews

Metadata

Author: Julien Kervizic
Full Title: On Data Engineering Code Reviews
Category: #articles
URL: https://medium.com/analytics-and-data/on-data-engineering-code-reviews-abb25570f28

Highlights

There are different pillars of code review for Data Engineering code: Conformance, Engineering, Logic, and Scoping, where the focus differs from traditional engineering (View Highlight)
The first pillar of code review for data engineers is accessible for feedback from even the most Junior Engineers. This first pillar of code review focuses on code style, consistency, applying the correct conventions, readability of the code, comments, documentation, and providing appropriate namings. (View Highlight)
Contrary to Software engineering, in Data Engineers, the specific names used might need to be constantly typed by the data consumers. It might also be more challenging to change these names. Unlike traditional web APIs, Data Engineering does not have a universal and well-accepted way to deal with the versioning of data structure/assets. Schema registry exists and supports versioning, but is not ubiquitous across the data landscape. (View Highlight)
- Note: I need to learn more about these schema registries
There are many aspects within the engineering pillar to consider, performance, testing, code duplication, approaches to dealing with problems, or patterns and methodologies. (View Highlight)
- Note: When reviewing from an engineering perspective look for performance, testing, patterns and methods
For Data Engineering table design, of particular importance is the concept of granularity. At what level of granularity is your dataset, whether the transformations conform to that level of granularity, and whether that grain is adequate for what is being modeled. (View Highlight)
Does the code do what it intends to do overall, i.e., does it match the requirements? Does the code handle the different edge cases possible? Does the code change the logic of what has been implemented — what is the downstream impact of such change? When looking at the logic aspect of code review for data engineering, there is, however, an added layer of complexity in that the logic being implemented needs to also tie-up with the data available, as well as be robust to the logic of new incoming data, i.e., how does the code deal with the “unknown.” (View Highlight)
- Note: Review for logic
A data engineering code review is more involved than just looking at the code; it requires reading the requirements, running the code, looking at the input and output data created, and seeing if there are any gaps within. Reading the code provides one layer of value. It allows to give feedback on the overall approach taken, whether the code matches the code style and approach agreed within a team, identify some logic mistakes, or propose alternative methods that could end up being more efficient. Reading and analyzing the requirements provide another layer of safety that the code will be able to handle the necessary use cases and be more robust for future changes. Running the code is another layer. It allows first to see if the code can be run on a different environment if the code is well documented or if there are gaps that need to be set up and identify gaps in files not being correctly committed. (View Highlight)
- Note: Layers of a code review
  - reading the code
  - understanding requirements
  - running the code