Writer : Mr. George Miguel

Big data processing tasks can be performed in parallel on a cluster of computers using Apache Spark, an open-source framework. In terms of distributed processing, it is one of the most popular frameworks in the world.


In order to understand Apache Spark, we must first understand what it is.

As data grows exponentially, Apache Spark has become one of the most widely used frameworks for distributed scale-out data processing, running on millions of servers both on-premises and in the cloud.

YARN, Apache Mesos, Kubernetes, or standalone, Apache Spark is a fast and general-purpose analytics engine for large-scale data processing. Spark makes it simple to build parallel applications in Scala, Python, R, or SQL using an interactive shell, notebooks, or packaged applications using high-level operators and libraries for SQL, stream processing, machine learning, and graph processing. A functional programming model and the corresponding query engine, Catalyst, which converts jobs into query plans and schedules operations within the query plan across nodes in a cluster, allow Spark to support batch and interactive analytics.

Additional libraries exist for SQL, DataFrames, Machine Learning, GraphX and Stream Processing on top of the Spark core data processing engine. Massive datasets from a variety of sources, such as HDFS, Alluxio, Cassandra, HBase, and Hive can be processed with these libraries.



Over the past 15 years, Apache Hadoop's efforts to analyze big data have been continued by Apache Spark, which has become the leading framework for large-scale distributed data processing.

In the early 2010s, as big data analytics became more popular, Hadoop MapReduce's performance limitations became more of a burden. In MapReduce's model of checkpointing results to disk, performance is limited. At the same time, MapReduce's low-level programming model has hampered Hadoop's adoption.

MapReduce's scalable, distributed, and fault-tolerant processing framework was the driving force behind Apache Spark's development at the AMPLab at the University of California, Berkeley. Due to its use of lightweight multi-threaded tasks instead of starting and stopping processes and its caching of data in memory across iterations, Spark is more efficient than MapReduce when building data pipelines and iterative algorithms. Distributed fault-tolerant DataFrames enhance parallel performance and ease of SQL use in Spark.

Thousands of data scientists and engineers use Spark in more than 16,000 companies since it was elevated to an Apache Software Foundation top-level project in 2014. As a result of Spark's in-memory data processing, some tasks can be completed up to 100 times faster than MapReduce. Over 1000 contributors from more than 250 companies have worked together to build these capabilities in an open community. The founders of Databricks started this effort, and their platform alone spins up over 1 million virtual machines per day to analyze data.



Each new version of Spark brings new features that make it easier and faster to write code and run it. Spark SQL performance has been improved in Apache Spark 3.0, and NVIDIA GPU acceleration is now included.

GPUs are becoming increasingly popular because of their low price per flop (performance) and their ability to speed up multi-core servers for parallel processing, which is addressing the current compute performance bottleneck. Sequential serial processing is the primary function of a CPU's cores. With its massively parallel architecture, the GPU has thousands of smaller, more efficient cores that can handle multiple tasks at once. In comparison to CPU-only configurations, GPUs can process data much more quickly. There has been a significant rise in the development of deep learning (DL) and machine learning (ML) model training in recent years thanks to GPUs. A data scientist, on the other hand, spends 80 percent of their time preprocessing data.

While Spark uses partitions to distribute computation across nodes, computation has traditionally been performed on CPU cores within a partition. Spark added in-memory data processing to alleviate Hadoop's I/O issues, but as the number of applications grows, the bottleneck is shifting from I/O to computation. With the advent of GPU-accelerated computation, this performance bottleneck can be avoided.

By collaborating closely with the Apache Spark community and releasing Spark 3.0 and the open source RAPIDS Accelerator for Spark, NVIDIA has been able to meet and exceed modern data processing requirements. In Spark, the advantages of using GPU acceleration are numerous.


  • By speeding up data processing and queries, as well as training the model, results are delivered more quickly.
  • It's not necessary to set up separate clusters to run Spark and ML/DL, because the same GPU-accelerated infrastructure can be used for both frameworks.
  • Because fewer servers are needed, infrastructure costs can be reduced.



Data science and analytics pipelines can now be run end-to-end on GPUs using the RAPIDS suite of open-source libraries and APIs. This significantly speeds up processing of even the largest datasets. No code changes are required to use the RAPIDS Accelerator for Apache Spark, which is built on top of NVIDIA CUDA and UCX.

Accelerated SQL/DataFrame

SQL optimizer plugins in Spark 3.0 allow data to be processed in columnar batches instead of rows. The RAPIDS Accelerator plugs into the GPU-friendly columnar data, which is why SQL and DataFrame operators can be accelerated. An updated Catalyst optimization algorithm is used to identify operators that can be accelerated using the RAPIDS API, which is a one-to-one mapping, and to schedule those operators on GPUs when executing a query plan on the Spark cluster.

Accelerated Shuffle

As part of the shuffle process, which involves disk I/O, data serialization, and network I/O, operations such as sorting, grouping, or joining data by value must move data between partitions. By taking advantage of UCX's GPU-to-GPU memory intra- and inter-node transfer capabilities, the new RAPIDS Accelerator shuffle implementation maximizes the amount of data kept on the GPU while also seeking the fastest possible path between nodes.

Scheduling with consideration for accelerators

Apache Spark 3.0 includes a schedulable GPU resource as part of a major effort to better integrate deep learning and data processing on Spark. Using this feature, Spark can schedule executors with a predetermined number of GPUs, and users can specify how many GPUs each task needs to be run. A cluster manager, such as Kubernetes or YARN, receives these resource requests from Spark. Discovering which GPUs were assigned by the cluster manager is also possible. Users previously had to work around the lack of GPU scheduling in Spark applications to run ML applications that required GPUs.

Accelerated XGBoost

Distributed gradient-boosted decision trees (GBDTs) can be used to train machine learning models.

XGBoost is a leading ML library for regression, classification, and ranking problems that provides parallel tree boosting. Distributed Machine Learning Common (DMLC) XGBoost, which the RAPIDS team collaborates closely with, now includes seamless, drop-in GPU acceleration as part of its XGBoost offering. As a result of the integration with Rapids Accelerator, Spark 3.0 XGBoost is able to accelerate SQL/DataFrame operations on the GPU, speed up XGBoost training time, and utilize GPU memory more efficiently.



Fraud detection

Fast decision-making scenarios involving multiple data sources can benefit from Spark's speed. Credit card fraud is often detected by looking at the volume and location of transactions made on the same account. If the volume of transactions exceeds a single person's ability to process them, or if multiple transactions occur in remote locations, it is likely that an account has been compromised.

Based on account holder behavior patterns, financial institutions can use Apache Spark to create an integrated picture of each customer. It is possible to use machine learning to identify anomalous patterns based on previously observed patterns. In addition, this can help the institution better tailor its products and services to individual customers.



Overtaking pulmonary disease, diabetes, and pneumonia, adverse drug interactions are the leading cause of death in the United States. Every year, new medications are introduced that make the task of determining how multiple medications interact to harm a patient even more difficult.

A data scientist can create algorithms that search millions of case records for mentions of specific drug types using Spark. A patient's medical history and pre-existing conditions can influence the effectiveness of certain medication combinations. Using the results, doctors and pharmacists can be alerted to the possibility of an adverse reaction before writing or filling a prescription.


In the real world, examples of how SPARK can speed up machine learning from beginning to end.


An early adopter of Databricks' preview release of Spark 3.0, Adobe was able to leverage its strategic AI partnership with NVIDIA. The evaluation results of a GPU-based Spark 3.0 and XGBoost intelligent email solution to optimize the delivery of marketing messages were presented at the NVIDIA GTC conference by Adobe Intelligent Services. In the first tests, Adobe saw a 7x performance increase and a 90% cost reduction.. By allowing scientists to train models with larger datasets and retrain models more frequently, Spark 3.0 improves model accuracy. If you're a data scientist who needs to process terabytes of new information every day, then this is the solution for you. As a result, less hardware is required to deliver results, which results in significant cost savings. As William Yan, senior director of machine learning at Adobe said when asked about these advancements: We are seeing significantly faster performance with NVIDIA-accelerated Spark 3.0 compared to running Spark on CPUs. Our AI-powered features in Adobe Experience Cloud can now take advantage of these game-changing GPU performance gains," says Adobe.

Verizon media

Verizon Media developed a distributed Spark ML pipeline for XGBoost model training and hyperparameter tuning on a GPU based cluster to predict customer churn for their tens of millions of subscribers. For hyperparameter search, Verizon Media saw a 3x improvement over a CPU-based XGBoost solution. This improved their ability to find the best hyperparameters for optimized models and maximum accuracy.


As a company that focuses on self-driving research as well as trip forecasting and fraud prevention, Uber makes extensive use of deep learning. With the help of a distributed deep learning framework called Horovod, Uber has made it easier to speed up deep learning projects with the help of GPUs and a data-parallel distributed approach to distributed training. The KerasEstimator class now uses Spark Estimators with Spark ML Pipelines, making it easier to integrate with Spark 3.0 and the GPU scheduling in Horvod. TensorFlow and PyTorch models can be trained on Spark DataFrames using Horovod's ability to scale to hundreds of GPUs in parallel, without the need for specialized code for distributed training. As of Apache Spark 3.0, a production ETL job can pass data to Horovod for distributed deep learning training on GPUs, thanks to new accelerator-aware scheduling and columnar processing APIs.

The reason Apache Spark is important to you is because of the following:

Spark 3.0 is a major step forward for data scientists and engineers working together on analytics and AI, as ETL operations are now accelerated while ML and DL applications use the same GPU infrastructure.

Teams of Data Scientists

With so much work involved in making data usable, the magic of data science is put into jeopardy. Unstructured data such as ZIP codes, dates, and SKU numbers must be sorted and ordered across millions or billions of records as part of this process. The more data you have, the longer it takes to process. According to some estimates, data scientists spend up to 80% of their time preparing data.

Data scientists now have the ability to query extremely large data stores thanks to the invention of Hadoop, a ground-breaking technology for large-scale data analysis. When sorting and discovering data, it was necessary to run multiple scans on an existing dataset, which meant that processing times could take a long time.

For iterative queries across large data sets, Spark was designed specifically. It was an instant hit with data scientists because of its speeds of up to 100 times faster than Hadoop/MapReduce. Data science-oriented development languages like Python, R, and Scala were no problem for Spark. When it comes to working with data, the majority of data scientists prefer to use one programming language.

Additionally, Spark SQL introduced a data abstraction concept known as DataFrames, which can be used to store and manipulate both structured and semi-structured data. Unstructured data previously couldn't have been accessed in this way thanks to SQL. It is possible to build ML pipelines or workflows using Spark ML's uniform set of high-level APIs built on top of DataFrames. The scalability of partitioned data processing is combined with the ease of SQL data manipulation in ML pipelines built on top of DataFrames.


Data Engineering Teams

Data engineers help bridge the gap between data scientists and software developers.. To build big data analytics applications, data engineers collaborate with data scientists and developers to build data pipelines for data extraction, transformation, storage, and analysis. Data scientists choose the right types of data and algorithms to solve a problem.

Spark simplifies the storage equation by abstracting away the complexities. In comparison to Hadoop, the framework is more adaptable and can work with both on-premises and cloud-based storage, including the Hadoop Distributed File System (HDFS). Spark is well-suited for the next generation of internet-of-things applications because it can easily incorporate streaming data sources.

Read more:

Apache Spark