Databricks Notebooks Code Review Guidelines

The following is a guideline for code reviewing Spark Applications.  The code review doesn’t just end at the code, there are several configurations to review too.  A clean configuration is as important as clean code.

Before delving into the code review, it might also be a good time to familiarize yourself with the Software Development Life Cycle for Databricks Notebooks.

Configurations

The following are the configurations to review.

Data Structures Endpoint JSON

Each endpoint needs to have a descriptor JSON file, located here.  It is important to check that these are configured correctly.  The descriptor contains several names related to the endpoint.  It is important to ensure that these match the endpoint being reviewed.

Another important piece of information here is the descriptor for the endpoint Swagger documentation.  This includes the description of the endpoint, the data it consumes, and the return object.  The description should be clean and contain no spelling mistakes.  The input/output data should be compared against the code to ensure it is in sync.  We don’t want endpoint Swagger documentation to say the endpoint returns a certain object, but the implementation actually returns something different.  Below are a few examples of what to look out for:


{
  "name": "FCAPI_FL_NB_MYENDPOINT",
  "profile": {
    "eventId": "FCAPI_MYENDPOINT"
    ...
  }
  ...
  "documentation": {
    "swagger": "2.0",
      "info": {
        "version": "1.0.0",
        "title": "FTS My Endpoint",
        "description": "This is a description of my endpoint."
      }
      ...
      "paths": {
        "/": {
          "get": {
          "description": "Get the resource for my endpoint.",
          "produces": [
            "application/json"
          ],
          "responses": {
            "200": {
              "description": "The resource of my endpoint",
              "schema": {
                "type": "object",
                "$ref": "#/definitions/MyObject"
              }
           ...
}

Delivery Platform Files

The DP expects several files to be included in the notebook projects.  These are for various tasks such as Jenkins, Sonar, etc.  These files should be located at the root of the project.  The following are the list of configuration file to be included in the root of each notebook:

  • Jenkinsfile *
  • scalastyle-config.xml
  • sonar-project.properties *
  • sonar-update-version.sh
  • version.sbt

The files marked with an asterix contain information specific to the project.  These need to be checked to ensure they have been updated as expected.

Jenkinsfile example of details to change:


...

notebookCiStage(
...

component: 'FTS_API_RFX_FL_NOTEBOOK_MYNOTEBOOK',
...

)

Sonar-project.properties example of details to change:


# Project details
sonar.projectKey=my-endpoint-notebook-implementation
sonar.projectName=My endpoint Notebook
sonar.projectVersion=0.0.1
...

Notebook Configuration

Within the resources folder (src/main/resources) there is a notebook.conf file.  Again, there is information here specific to the Notebook.  This should be checked to ensure the name and GIT information has been correctly inserted.  It also includes dependencies used by the DP.  It is important that these are in sync with the imports in the Notebook.  Any unused dependencies that are included here, will be included with the build by DP.

Example extract:


{"notebook":{
  "lane": [
    "fast"
  ],
  "id": {
    "name": "my-endpoint",
    "APIID": "4512586457DSGFA44DSA87GF",
    "creation": 1528378405,
    "last-update": 1528378521,
    "git-repo": "git@gitlab-delivery-platform.fexcofts.com:cp/FCAPI_FL_NB_MYENDPOINT.git",
    "version": "0.0.1-SNAPSHOT"
  },
  ...
  "dependencies":[
    "commons-dbcp2-2.5.0.jar",
    "accord-core_2.11-0.7.2.jar",
    "spray-json_2.11-1.3.3.jar",
    "accord-api_2.11-0.7.2.jar",
    "commons-pool2-2.6.0.jar",
    ...
  ],
  ...

Code

The following are the guidelines for reviewing the code.

DBC file vs Testable spec

Within the resources folder there should be a notebook folder containing the DBC file for Databricks.  The reviewer should import this into their own personal workspace.

Within the main scala directly should be a testable Scala class.  The reviewer needs to compare this testable class with the DBC file.

This Scala class should mimic the logic in the DBC file.  This includes ensuring that both are expecting the same event, headers & body.  Both should be doing the same RDB query, or ODB post (if any).  The same data manipulation should be happening on both.  And finally, both should be returning the same object.  Any discrepancies here should fail the code review.  It should be noted that they are not identical, the testable spec will have extra logic to help with testing it.  It is the business logic that should be focused on.  The reason for this check is that the DBC file can only be run within Databricks.  So the logic needs to be extracted to a local Scala class so that it can be tested locally.  Without a proper code review, we risk the DBC file going out of sync with the testable spec, thus rendering any test results redundant.

At this point, the reviewer can open a discussion on the logic.  This could be anything from suggesting a more efficient way of performing a task, to highlighting a possible bug or missed requirement.

Testable Spec vs Tests

Once the logic has been verified, the reviewer need to check that the scenarios have been adequately tested.  All business logic should be tested along with proper assertions on the responses.

Reviewing the tests should also include reviewing the functional test.  A JMeter functional test will be included with each notebook.  The reviewer should ensure that this is testing the correct endpoint, and validating the endpoint response.

Conclusion

There may seem like there is a lot involved here, but the most of it is just a quick visual to ensure everything is in order.

The code review is a positive task.  It is not a criticism of another developers work.  It is there as safety net to try and catch any errors/omissions, or to open constructive discussions on how to approach certain tasks.  The code review also acts as a kind of knowledge transfer, as now the reviewer will also be aware of how the Notebook operates.

When done properly, the code review will ensure the quality of product being released meets the quality standards we have set.

 

 

Leave a Reply

Your e-mail address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.