Introduction to Apache Spark By AWS

Writer : Mr. George Miguel

Do you know what Spark is?

An open source, distributed processing system called Apache Spark is used for large-scale datasets. Data of any size can be quickly analyzed using in-memory caching and optimized query execution. Code can be reused across multiple workloads—batch processing (queries), real-time analytics (machine learning), and graph processing—using the same APIs. There are a wide range of businesses using it, including FINRA, Yelp, Zillow, DataXu, Urban Institute, and CrowdStrike, among others. In 2017, Apache Spark had 365,000 meetup members, making it one of the most popular big data distributed processing frameworks.

When and where did Apache Spark get its start?

The Apache Spark project began in 2009 as a collaboration between students, researchers, and faculty at UC Berkley's AMPLab, which focuses on data-intensive applications. Created for fast, iterative processing such as machine learning and interactive data analysis, while keeping Hadoop MapReduce's scale and fault tolerance. First published in June 2010, "Spark: Cluster Computing with Working Sets" was open sourced under a BSD license. Spark was accepted into the Apache Software Foundation's (ASF) incubation program in June of 2013 and became an Apache Top-Level Project in February of 2014. On Apache Mesos, Apache Hadoop, or as a stand-alone application, Spark can be used.

Many organizations now use Spark in conjunction with Hadoop to process large amounts of data, making it one of the most active projects in the Hadoop ecosystem. Spark had 365,000 meetup members in 2017, a five-fold increase in the space of two years. Since its inception, more than 1,000 developers from more than 200 organizations have contributed to its development.


What's the deal with Apache Spark?

Hadoop Big data sets can be processed in parallel and distributed using MapReduce, a programming model. Fault tolerance and work distribution are no longer concerns for developers writing massively parallelized operators. To run a MapReduce job sequentially, it requires a number of steps. After performing an operation on a dataset, MapReduce returns the results back to HDFS. The latency of disk I/O makes MapReduce jobs slower because each step necessitates a read and write to disk.

Rather than relying on MapReduce to handle large datasets, Spark uses data reuse and in-memory processing to accomplish these tasks in parallel. A single step is required to read data into memory, perform operations, and return the results—resulting in a significantly faster execution. To speed up machine learning algorithms that repeatedly use the same dataset, Spark makes use of an in-memory cache to reuse data. An abstraction over Resilient Distributed Dataset (RDD) is used for data re-use by creating DataFrames, which are collections of objects that are cached in memory and reused across multiple Spark operations. Especially for machine learning and interactive analytics, this reduces latency so drastically that Spark is several times faster than MapReduce in terms of performance.


In the battle of the big data platforms, we pit Apache Spark against Apache Hadoop.

Spark and Hadoop MapReduce are complementary big data frameworks, and many organizations are using them together to solve a larger business problem.

There are three components to the Hadoop platform: HDFS for file storage, YARN to manage computing resources, and MapReduce as a programming model for processing large amounts of data. Different execution engines, such as Spark, Tez, and Presto, are commonly used in typical Hadoop implementations.

Spark is an open-source framework that focuses on interactive queries, machine learning, and real-time workloads. There is no dedicated storage system, but it uses HDFS, Amazon S3, Couchbase, and Cassandra as well as other popular storage systems like Amazon Redshift and Amazon S3. In order to provide consistent service and response, Spark on Hadoop makes use of YARN to share a common cluster and dataset with other Hadoop engines.


What are Apache Spark's advantages?

Apache Spark is one of the most active projects in the Hadoop ecosystem because of its many advantages. These are just a few examples:


Through memory caching and query optimization, Spark can run fast analytic queries on any size of data.

Supportive of Developers

A wide range of programming languages are available to you when you use Apache Spark, including Java, Scala, R, and Python. These APIs make it simple for your developers because they hide the complexities of distributed processing behind simple, high-level operators that dramatically reduce the amount of code required..

Multitudinous obligations

Apache Spark is capable of handling a wide range of tasks, including interactive queries, real-time analytics, machine learning, and graph processing, among others. A single application can handle a variety of tasks.


Using Apache Spark in a production environment


Included in the Spark framework are

  • As the platform's core, Spark Core
  • For interactive queries, Spark SQL is the tool of choice.
  • Real-time analytics with Spark Streaming
  • Spark MLlib is a machine learning library written in Java.
  • Graph processing with Spark GraphX


Spark Core

The platform is built on top of Spark Core. Memory management, fault recovery, scheduling, job distribution and monitoring, and interaction with storage systems are all responsibilities of this component. Application programming interfaces (APIs) for Java, Scala, Python, and R are available for accessing Spark Core. Simple, high-level operators hide the distributed processing complexity behind these APIs.


Machine Learning

An algorithm library for large-scale machine learning is included in Spark as part of its MLlib module. Any Hadoop data source can be used to train a machine learning model using R or Python, which can be saved using MLlib and then imported into a Java or Scala-based pipeline.. Spark is a fast, interactive, memory-based computation engine that makes it possible to run machine learning quickly. Pattern mining and other data mining techniques are also included in the algorithms.

Spark Streaming


For real-time streaming analytics, Spark Streaming uses the scheduling power of Spark Core. Mini-batches of data are ingested and analyzed using the same application code that was originally written for batch analytics. Using the same code for batch processing and real-time streaming applications improves developer productivity. There are a wide variety of data sources supported by Spark Streaming, including Twitter, Kafka, Flume, HDFS, ZeroMQ, and many more.


Spark SQL

Interactive Queries

Up to 100 times faster than MapReduce, Spark SQL is a distributed query engine that provides low-latency, interactive queries. Fast queries can be generated with a cost-based optimizer and columnar storage while code generation is also included. Data can be queried by business analysts using either standard SQL or the Hive Query Language. APIs are available in Scala, Java, Python, and R, and can be used by developers. Data sources supported out-of-the-the-box include JDBC and ODBC as well as JSON and HDFS. The Spark Packages ecosystem includes popular stores such as Amazon Redshift, Amazon S3, Couchbase, Cassandra, MongoDB,, Elasticsearch, and many others.


Graph Processing

Spark GraphX is a Spark-based framework for distributed graph processing. An interactive graph data structure can be built and transformed at large scale using GraphX's ETL, exploratory analysis, as well as its iterative graph computation. It comes with a wide range of distributed Graph algorithms and a flexible API.


Who is using Apache Spark?

According to surveys conducted in 2016, Spark is being used in production by more than 1,000 organizations. The Powered By Spark page has a list of some of them. In 2017, Apache Spark had 365,000 meetup members, making it one of the most popular big data distributed processing frameworks. Among our many satisfied clients are the following:

The Yelp ad targeting team uses predictive models to determine how likely a user is to interact with an advertisement. With Apache Spark on Amazon EMR, Yelp was able to increase revenue and advertising click-through rates by processing large amounts of data for machine learning models.

Real estate website Zillow is owned and operated by Zillow. In order to calculate Zestimates—a home valuation tool that gives buyers and sellers an estimated market value for a specific home—it makes use of machine-learning algorithms from Spark on Amazon EMR to process large data sets in near real time.

In order to prevent data breaches, CrowdStrike provides endpoint protection. They use Amazon EMR and Spark to process hundreds of terabytes of event data and create higher-level descriptions of the hosts' behavior from it. CrowdStrike can gather event data and detect malicious activity using that information.

Customers of Hearst Corporation, a large media and information conglomerate, can access its content on more than 200 websites. With Apache Spark Streaming on Amazon EMR, Hearst's editorial staff is able to keep an eye on which articles are performing well and which themes are gaining momentum in real-time.

bigfinite uses AWS to store and analyze massive amounts of data from the pharmaceutical manufacturing industry. Spark on Amazon EMR is used to run Amazon EMR's proprietary Python and Scala-based algorithms.

Using Spark on Amazon EMR, an advertising platform for in-image and in-screen advertising uses GumGum's inventory forecasting, clickstream log processing, and on-the-fly analysis of unstructured data in Amazon S3. These processes were made more efficient by Spark's performance improvements, saving GumGum both time and money.

At Intent Media, machine learning models are trained and deployed using Spark and MLlib. Spark is a powerful data science tool that travel companies can use to increase revenue on their websites and mobile apps.

By moving from SQL batch processes on-premises to Apache Spark in the cloud, FINRA aims to gain real-time insights into billions of time-ordered market events. FINRA's ability to protect investors and promote market integrity has been bolstered thanks to the use of Apache Spark on Amazon EMR in testing.

a few real-world applications for Apache Spark

Distributed processing system Spark is used for large-scale data processing. It has been used to detect patterns and provide real-time insight in a wide range of big data use cases. The following are some examples of use cases:

Financial Services

Predicting customer churn and suggesting new financial products are some of the uses of Spark in banking. Spark is a stock price trend forecasting tool used by investment banks.


For every patient encounter, Spark provides front-line health workers with access to patient data. As a result, Spark can also be used to predict/recommend treatment plans for patients.


In order to minimize the amount of time equipment connected to the internet is unavailable, Spark recommends when preventive maintenance should be performed.



Customers are enticed and retained through personalized services and offers that are tailored to their needs.

Using Apache Spark in a cloud environment

Using Spark in the cloud is a great way to get the best out of the cloud's scalability and reliability, as well as massive cost savings. According to ESG research, 43% of respondents are considering Spark's primary deployment in the cloud. Faster deployment times, better availability, more frequent feature/functionality updates, increased elasticity, broader geographic reach, and cost-based usage models are just a few of the reasons customers like Spark in the cloud.


Read more:

Apache Spark