Computing Platform(1): Slow Lane Parallel Processing

Objectives

The target is to achieve the ability to easily running long duration operations, commonly derived from data transformation, reporting, data load, batch processes, calculations for data sectors,  etc. that require a huge amount of computing resources. The challenge is to manage the requests for these types of operations coming from users or scheduled jobs.

The main objectives are:

  1. Manage these operations as jobs, providing new richer management of available resources.
  2. Isolate the execution of these processes. If one of them fails it should not affect to other processes.
  3. Avoid blocking storage. Minimize the impact on the data storage avoiding the high concurrency of long running operations and improving the data handling avoiding deadlocks and excessive consumption of CPU and memory by the database cluster.
  4. Avoid blocking actions and users. Improve the user experience by using asynchronous processes that do not block requests, moving the feedback and the output from the requested long running operations to manageable Managed Jobs consoles that will allow the users to monitor the status of their requested operations.
  5. By following the Big Data principle of moving the logic to the data, not the data to the logic, the user experience dramatically improves when the result of long running operations is pushed to the user application including a link to download the generated document or file with the output.

To achieve these objectives we need to define a new architecture that minimizes the needed computing resources in Cloud, handling the software components as code under demand using the server-less approach.

We need as well to take advantage of containers and Cloud services, combining to parallel processing in a native way. In this case we have chosen Apache Spark as the best platform to support this kind of operations.

Provide the biggest flexibility as possible, making viable to use JVM based languages (Scala, Java) as well as script-based languages (Python, R) commonly used in Data Science field. The usage of this variety of languages offer much more possibilities to increase our capacity to perform more complex and efficient operations.

Delegate a specific lane based on messages and events to process heavyweight operations like report creation, Data Analysis, data transformation, heavy calculation, etc.

Submission of applications/functions to parallel processing engine (Apache Spark) in an automated, scalable way by using pre-defined jobs.

 

Architecture

In above diagram you can note which is the chosen flow for heavyweight operations.

  1. The request from a client is authorized in the API Gateway and redirected to an endpoint implemented as a Function as a really small software component as a Nano service.
  2. The Endpoints process the request adding correlation information  and raises an event using a queue of the Message Bus. This queue archives in Storage all the traffic for traceability. By using messaging we decouple components and make easier the definition of IT (e.g. be making unnecessary Load Balancers for http based requests).
  3. The Mediator is basically an ultra-light software component with a very low footprint (CPU and memory) . It implements  pattern MaaA (Microservice as an Agent). Yes, we invented this pattern as an improvement to existing Integration Patterns more adapted to this kind of architectures. This factor allow us to deploy and manage a high number of very small MaaAs. Given the number of these nano/micro-services running at the same time to assure an easy and fastest as possible replacement it is critical to use system languages. In this case Go microservices are invaluable to develop MaaAs in a fast way using system languages (Go, C++). The MaaA is in charge of only one specific heavy operation (Managed Job) and is listening the queue. The mediator performs the validation and eventual transformation/enrichment/splitting/filtering of the request (according to each specific type of operation in the Job Request) and sends the job data to the next queue, to be listened by the Job Keeper.
  4. The Job Keeper is the gate to the Databricks API. It handles the Job Requests, sends the queries to the Databricks API for management. The main purposes are:
    1. Management of time for run a Job.
    2. Monitors the target Spark cluster general activity (number of all concurrent running jobs).
    3. Monitors the number of concurrent running jobs for a specific operation.
    4. Postpone/Delay running actions for Job Requests from information of the level of activity in clusters.
    5. Creates new Databricks Spark clusters if needed and terminates them when they are not needed, making the system scalable.
  5. The Job Keeper runs a Databricks job through the API. The job can be based on Notebook (written in Java, Scala, Python or R) or a driver applications in a JAR. The Databricks job has been previously defined Once Databricks receives the request  from the Job Keeper it runs the job against the Spark cluster.
  6. Once the task is finished and the result is stored in a Data Lake.
  7. The final signal is sent to the Mediator that notifies the client (through the Endpoint) about the completion of the task. Given the heavy-load tasks are asynchronous the signal is used to notify the user and/or update the task scheduler.

Description

The problems that we faced at the beginning of the PoC were, mainly:

1-There are several states and repositories to retrieve the software from. Which ones?

  • Source code in Scala and Java (so, the target is the SCM: Git)
  • Source code in Python (same)
  • Pre-built packages as JAR files (artifact repository)
  • How to handle the retrieval of the code?

2-We want to minimize the computing time of the software in Cloud (and pay less, of course). That means that we need to be flexible and cover all cases, source code and JAR files. In the case of source code, how to handle the build process needed for Scala and Java?

3-We need to isolate the execution of the push action and be able to spawn many of them at the same time. How to do this?

 

The keys of the Proof of Concept were based on:

  1. The search of the best management and control of resources, specially the Databricks cluster and the impact of long running operations using massively to the data stores.
  2. The function/application to the processing engine (Apache Spark), assuring parallelism by default.
  3. The isolation and management of the operations.

Regarding the Driver Application to be used in defined jobs to the Apache Spark cluster there are several modes we used:

  • python notebook
  • scala notebook
  • java notebook
  • R notebook
  • jar (pre-built JAR file in an artifact repository, written of course in Java or Scala)

 

Results and Applicability

The Proof of Concept has included

  • Creation, configuration and test of FaaS as endpoints and action triggers
  • Creation and configuration of Event Hubs and Message Bus Queues
  • Connection of FaaS ans Micro-services to Event Hubs (streams) and Message Bus Queues as producers and consumers
  • Configuration of Event Hubs (streams) and Queues to write traces in Storage for traceability and analysis
  • Development of Golang-based Micro-services for ultra low footprint components, saving resources in Cloud
  • Development of Scala-based Micro-services taking advantage of non blocking techniques and reactive programming
  • Full containerization. Creation of Containers descriptors for all components
  • Enablement, management and configuration of Spark Cluster (Parallel Processing) through the Databricks REST API
  • Creation and configuration of Data Storage and Data Lakes across the Job Request process pipeline

The utility of the PoC is invaluable as it has allowed us to investigate and go deeper into new architectural patterns more adapted to Cloud Native systems. All these lessons will be integrated to the final system.

The prototype supports for now Scala (with SBT as builder), Golang and Python as script language. Git is the supported SCM.

 

Deliverables

Prototypes: These software components have been designed to speed up the development of Cloud-native micro-services endpoints and mediators, including all the features needed for production environment (circuit breaking, centralized logging, etc.).

Slow Lane Endpoint Prototype

Slow Lane Mediator Prototype

Slow Lane Job Keeper

Slow Lane Driver Application Prototype

(Contact us to access to the GIt repositories)

PROS

  • This approach allows to run any kind of driver application, making much more flexible the development with Spark jobs.
  • Isolation of processes. The failure of one of them never affects the rest.
  • The approach allows as well a good visibility and control of processes.
  • The result improves dramatically the user experience and mitigates the risk for a high consumption of resources in the data layer.
  • The isolation of processes is optimal and it allows a fast fail and recovery, improving the Maximum Time to Recovery (MTTR) factor.

CONS

  • The MSA architectures make more complex the releasing and control of components in software projects.

Next Steps

  1. It is planned to integrate Java (with Maven and Gradle as builders) and R as script language for creation of statistics.
  2. Functions, Scripts, etc have to be managed under the Continuous Integration/Continuous Delivery workflows.
  3. Besides, testing coverage and reporting are key aspects in any development and they must be applied to Functions and scripts as well.
  4. Traceability (Unified logging), alerts, promotions, etc are tasks that will be included in a coming post.

 

Coming Soon

Computing Platform (2): Slow Lane Job Keeper

Computing Platform (3): Fast Lane Processing

Jesus de Diego

Author: Jesus de Diego

Software Architect and Team Lead

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.