As we move forward on Developing Spark Applications, it is important to highlight what was learned so far. For now, we will keep the focus on Notebooks, as they are a special type of application, with its own set of rules, but most of what we learned will be shared between other types of Spark applications.
What have we learned?
The Use of Utils Library explained in the previous post, was a really good approach. As notebooks are a single file application, the code could grow very fast, making it harder to maintain. So during the development, we need to always consider what can be reused and move that piece of code to the utils project.
Use of other external libraries needs a careful evaluation, first because as they share the same cluster of other notebooks, they need to be compatible to each other, and second need to be compliant to run against a Spark Cluster. For instance, the classes in the library that will be used inside the map-reduce process needs to implement java.io.Serializable, as the code can be sent to spark workers.
The unit tests are the first fence against any bugs that can be on the Notebook code. Make sure that they are properly implemented, considering various scenarios, like empty body or invalid data. Coverage is also essential, and you can check the percentage using tools like SonarQube.
Functional Tests(Databricks Workspace)
Even if you wrote unit test correctly and have a high percentage of code covered(above 90%), you also need to include functional tests running your code inside Databricks workspace. This will prevent a number of surprises(bugs) when deploying to a stable environment(ie.Beta stage). We are evaluating JMeter tool for that purpose now, as we already use it for Load and Stress Test(Benchmark Tool).
We have been working with Spark applications for some time now, but this is still a work in progress. These practices are successfully used on the Computing Platform Development, but this will grow while we learn more about technologies that are in your stack.
In summary, two key points need to be highlighted in this post: Libraries and Tests.