Spark App Development – Part II – Best Practices

Introduction

This is the second part of a series of posts about how to develop Spark Applications. The first post was about working with notebooks.

Objectives

As we move forward on Developing Spark Applications, it is important to highlight what was learned so far. For now, we will keep the focus on Notebooks, as they are a special type of application, with its own set of rules, but most of what we learned will be shared between other types of Spark applications.

What have we learned?

Utility Library

The Use of Utils Library explained in the previous post, was a really good approach. As notebooks are a single file application, the code could grow very fast, making it harder to maintain. So during the development, we need to always consider what can be reused and move that piece of code to the utils project.

Dependency Management

Use of other external libraries needs a careful evaluation, first because as they share the same cluster of other notebooks, they need to be compatible to each other, and second need to be compliant to run against a Spark Cluster. For instance, the classes in the library that will be used inside the map-reduce process needs to implement java.io.Serializable, as the code can be sent to spark workers.

Unit Tests(Local)

The unit tests are the first fence against any bugs that can be on the Notebook code. Make sure that they are properly implemented, considering various scenarios, like empty body or invalid data. Coverage is also essential, and you can check the percentage using tools like SonarQube.

Functional Tests(Databricks Workspace)

Even if you wrote unit test correctly and have a high percentage of code covered(above 90%), you also need to include functional tests running your code inside Databricks workspace. This will prevent a number of surprises(bugs) when deploying to a stable environment(ie.Beta stage). We are evaluating JMeter tool for that purpose now, as we already use it for Load and Stress Test(Benchmark Tool).

Conclusion

We have been working with Spark applications for some time now, but this is still a work in progress. These practices are successfully used on the Computing Platform Development, but this will grow while we learn more about technologies that are in your stack.

In summary, two key points need to be highlighted in this post: Libraries and Tests.

References

http://techblog.fexcofts.com/2018/07/09/spark-app-development-part-i-working-with-notebooks/

https://www.sonarqube.org/

https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html

https://jmeter.apache.org/

https://techblog.fexcofts.com/2018/07/09/benchmark-tool/

https://spark.apache.org/

Leave a Reply

Your e-mail address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.