This blog post details how to develop Spark Applications. Part I is related to Databricks Notebook development.
As explained in the previous posts, Computing Platform will support two types of Spark Applications, one implemented using Notebooks, and the second can be done using Uber Jars. To understand more about this read our post.
Worry about what matters, right?
Looking at the previous diagram, seems a lot of work to develop the complete pipeline for a Fast Lane Notebook application, right? Actually, most of the work is done by Delivery Platform. Everything except the Notebook code will be done automatically. So the developer just needs to write the specific code for the business logic that’s it. Too good to be true? No, it is not. We are working hard to always make the Developer life as easy as possible. You can read more about how it is done in this post
So ready to start coding? Wait! Not that easy. Before even start, you need to make sure that you have the theory in place for that. Trust me, writing code for a Spark application is completely different from a normal Java application. The internet is full of Spark Development courses that will help to understand the principles. If you already worked with Java 8 Streams API, I can say that you already understand some of the key concepts, but there is a lot to work from there.
Spark SQL makes easier for developers that are used to SQL-like calls to interact with Spark. In this module, you can find DataFrame API, that holds data in a table structure with schema and datatypes. That you help with the transformation and to join columns in different datasets.
There are many cases where the very similar in each Spark Application and should be reused. That is Utility Library Project come in place. Boilerplate code like to connect to EventHub, to the Reading Data Base(RDB) or to Data Lake. This speed up the development, because it is tested code ready for reuse.
For instance, the code above is calling StreamFactory from utils project. Internally, the utils project is calling the EventHub to receive and to send a stream of data. For our notebook code, is not necessary to know about that. It is just calling the consumer and the producer. With this encapsulation of calls, we can support more streams without changing the notebook code.
What about Tests?
One of the Key points during the development is to guarantee that all written code is testable and cover different scenarios. One option is to write the test code inside the Notebook code, but it has many disadvantages like mixing production code with test code and could make the Notebook code harder to maintain. Another way is to create a scala project(sbt or maven) and create a Testable version of the Notebook inside this project. The structure could be something like this:
The class NotebookTestable has the same code that you created inside Databricks Workspace. The only difference is to have the spark variable available locally you need to extend your class from SparkSessionWrapper class, that will make that possible:
The test should be very similar to other Spark Applications. To make sure that your notebook code is “testable” it is important to break into methods that the test class can call and analyze the results. For instance:
This method receives a DataFrame as input and returns another DataFrame with transformed data. This should be easy to test because you can control the values that you can pass through and read the expected results.
This was just one first look deeper into Spark Application Development, and this is an ongoing process for us as well. So if I could summary what I learned so far:
- All configuration should be externalized using config files;
- Any code that is not specific to your current notebook implementation should be moved to utility library;
- As your knowledge increase you will learn to code better using Spark API’s, so don’t be afraid of refactoring;
- Functional and Non-functional Tests are crutial part of Development Lifecicle, make sure you spend proper time on it.
Those are just initial tips but to fully understand Spark Development you have to get your hands dirty. Get some internet examples and start playing with it. Good luck! 🙂
In the Next part of this post, we will talk about working with Slow Lane Spark applications and Uber Jars.