Handling the SDLC for Databricks Notebooks and Workflows

Worried about how to provide a quality SDLC to Databricks based Spark applications?

REQUIREMENTS FOR THE READER

  1. Please read first this post previously to understand the current topic.
  2. Have a look on the Databricks documentation, specially to the sections about the Databricks REST API and the Databricks CLI.

 

BACKGROUND

The work with Databricks in Azure is really amazing.  As you have seen in previous posts the Databricks workspace contains the elements we need to perform complex operations through our Spark Applications as isolated notebooks or workflows (chained notebooks in related operations and sub-operations using the same data sets).

On the other hand, there is not a structured way to handle these notebooks as standard components. Software Development Life Cycle (SDLC) common operations like testing, quality metrics, etc are not directly possible as the notebook was born as a tool for developers more than a production component. however, notebooks are totally reliable in production stages. This post describes our methodology to provide a SDLC to notebooks and workflows.

We saw in the mentioned post that the developer keeps the notebook code in his/her personal workspace in sync with the Git repository. However, this is not enough to provide SDLC features to the component.

GOALS

  1. Assure the full automation of SDLC processes with Notebooks.
  2. Guarantee the full testability of all components in notebooks with maximum coverage.

DESCRIPTION

We provide the SDLC to notebooks and workflows by including all components in the same package. We’ll call this as the companion package. For now our scope will be restricted to Scala notebooks. So, the companion package extension will be jar. For Python and R we’ll set new extensions and structures more accordingly to the different ways of providing SDLC stuff to the notebooks.

The companion package is a file containing all we need for SDLC purposes: Unit Tests, Integration Tests, quality thresholds and even the configuration settings for the different stages/environments.

The idea behind the companion package is to give the pipeline something to use in order to:

  • Run tests (Unit and Integration) in build time.
  • Place the notebooks in right place. To do this the pipeline will need the description of the notebooks such as the name, id, parameters, level of access, etc.

Testing Notebooks

The notebooks are not directly executable nor testable in a standard run time.   They need to be executed against a run time environment provided by specific platforms capable to provide a function/server scenario. Databricks, Jupyter, Zeppelin are the main commonly used ones.

The notebooks are not directly executable nor testable in a standard run time. We need a workaround.

So, we need to include in the companion package some required files:

  1. Mock data files for I/O emulation.
  2. Code for Unit Tests.
  3. Code for Integration Tests.
  4. Testable Scala class or classes with the code of the notebook or notebooks. This is a slightly different version of the code in the notebook, more conventional as it is included in regular objects and classes and allowing the test framework to run the test components against them and check the assertions from the obtained outputs.
  5. Configuration settings for the notebook/workflow. These settings can contain environment/stage specific information such as database settings, data sinks, queues connection, stream configuration, etc.

Configuration Settings (Stage specific)

All settings files have the application.conf.<stage> naming convention:

  • application.conf.beta
  • application.conf.candidate
  • application.conf.production

The application.conf files follow the Scala applications configuration defined by Lightbend. There are two ways to use the application.conf file in notebooks and workflows:

  • Use the Lightbend library as a dependency:
libraryDependencies += "com.typesafe" % "config" % "1.3.3"
  • ..or use a custom configuration loader. Given we are not using Lightbend libraries or other framework dependencies probably this is the best option

Additionally, the Companion prototype will provide standard values for the configuration settings (src/test/resources/application.conf) to make possible the tests can run as expected, for instance regarding the companion application embedded databases settings (Operational and Read-Only) and others.

 

Describing Notebooks

The developer is in charge of provide the configuration settings for the description of the notebook or workflow. If we are talking about a workflow it is expected a description file for the workflow and another one for each individual notebook managed by the workflow.

This is an example of the description of a notebook:

 {"notebook":{
   "lane": [
     "fast"
   ],
   "id": {
     "name": "ratesRequest",
     "APIID": "4512586457DSGFA44DSA87GF",
     "creation": 1528378405,
     "last-update": 1528378521,
     "git-repo": "rates-request",
     "version": "2.0.9"
   },
   "endpoint-reference": {
     "APIID": "0BB6F63DDCA4365C040F0AA7437E0E8709FB9960"
   },
   "workflow": {
     "included": false,
     "name": "",
     "APIID": "",
     "series": "",
     "position": ""
   },
   "execution": {
     "job":{
         "enabled": false,
         "id": ""
      },
      "streaming":{
          "enabled": true,
          "stream-id": "rate-stream"
      },
     "parameters": [
       {"name": "country", "type": "UK"},
       {"name": "Rate", "type": "EUR"}
     ]
   },
   "dependencies":[
     "azure-eventhubs-databricks_2.11-3.3.0.jar",
     "azure-datalake-databricks_2.11-0.2.0.jar"
   ],
   "performance": {
     "once-execution": 45,
     "stress": 68,
     "load": 66
   }
 }
}

This is an example of the description of a workflow, containing the references to the included notebooks:

{"workflow":{
  "lane": [
    "scheduled"
  ],
  "id": {
    "name": "worflowReports",
    "APIID": "4512586457DSGFA44DSA87GF",
    "creation": 1528378405,
    "last-update": 1528378521,
    "git-repo": "workflow-reports",
    "version": "0.0.1"
  },
  "endpoint-reference": {
    "APIID": "0BB6F63DDCA4365C040F0AA7437E0E8709FB9960"
  },
  "notebooks":[
    "reportA",
    "reportB"
    ],
  "execution": {
    "job":{
      "enabled": true,
      "id": "4512"
    },
    "streaming":{
      "enabled": false,
      "stream-id": ""
    },
    "parameters": [
      {"name": "startPeriod", "type": "Date"},
      {"name": "endPeriod", "type": "Date"}
    ]
  },
  "dependencies":[
    "azure-eventhubs-databricks_2.11-3.3.0.jar",
    "azure-datalake-databricks_2.11-0.2.0.jar"
  ],
  "performance": {
      "once-execution": 2036,
      "stress": 2544,
      "load": 2260
  }
}
}

The resulting SDLC

What we get by using this methodology is effectively to implement the required build time testing phases and establish the communication with the Delivery Platform build pipelines (the real actor here), making possible the full automation of the whole process. So, this methodology meets the initial goals.

The companion packages finally follow the standard SDLC for that kind of files, being stored in the artifact repository as usual. This allows to tack the version numbers and restore any version at any time. This means that  they are never deployed or executed. They are only containers with the real deployable component: the Notebook.

Companion packages are only containers, not executable applications. So they are never deployed but only stored in the artifact repository.

CONCLUSIONS

The main conclusions after the test using this approach have been:

  • By using the notebook/workflow packages we are able to provide all the features expected and required in the SDLC. The Unit Test, Integration Test and auxiliary components are able to test the code in the notebook.
  • The Delivery Platform performs the starring role here. The intelligent pipelines are capable to interpret the structure of the notebook/workflow packages, run the tests,
  • We are delegating to the notebook developer the task of describing the notebook and/or the workflow. This is done only once with the first version but it is still error prone and not reliable. It is clearly a point to improve.
  • Notebooks and notebooks platforms present pretty similar situations like Functions as Services. We have learned a lot from the notebooks SDLC stuff and we’ll use this experience  in order to provide the same to Functions as Services assuring the levels of automation and quality to meet.

 

NEXT STEPS

We still need to extend this approach to the other languages used in Databricks notebooks:

  • Python
  • R

We’ll include in a coming soon post how we provide the standard SDLC for the notebooks written in these languages.

 

Cheers!

 

Jesus de Diego

Author: Jesus de Diego

Software Architect and Team Lead

One Reply to “Handling the SDLC for Databricks Notebooks and Workflows”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.