Nitendra Gautam

Apache Spark Streaming

Apache Spark provides a unified engine that natively supports both batch and streaming workloads. Spark Streaming’s execution model is advantageous over traditional streaming systems for its fast recovery from failures, dynamic load balancing, s treaming and interactive analytics, and native integration.

Checkpointing in Spark

It is mechanism available in spark that enables us to save an RDD to a reliable storage system like S3 or HDFS so that spark does not forgets the RDD’s lineage completely.When RDD’s lineage dependencies becomes too long it is suggested to keep track of it using checkpoints. It is similar to the process where Hadoop stores intermediate results in disk so that it can recover from failures .Since checkpointing is on external file location it can be reused again by other applications.

Checkpointing in SPark Streaming

For a streaming application to operate continuously it mush be resilient to any kind of failures. A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.). For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures.

There are two types of checkpointing available

  • Metadata Checkpointing

  • Data checkpointing

To summarize metadata checkpointing is primarily needed for recovery from driver failures, whereas data or RDD checkpointing is necessary even for basic functioning if stateful transformations are used.

Spark Checkpointing

Preventing Driver Failures in Spark

In order to have an automatic recover mechanism for driver failure ,different cluster manager provides different techniques

  • Spark Standalone - Use the -- supervise command while deploying the spark Job so that driver can be restarted upon failures
  • YARN - To make sure that Spark streaming runs continuously in Yarn Cluster ,increase the default value of two properties namely spark.yarn.maxAppAttempts and spark.yarn.max.executor.failures [from reference 2 &3]
  • Mesos - Marathon has been used to achieve this with Mesos.

Significance of Sliding Window operation

Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.

DStream in Spark

Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations – • Transformations that produce a new DStream. • Output operations that write data to an external system.

Different types of transformations on DStreams

  • Stateless Transformations- Processing of the batch does not depend on the output of the previous batch. Examples – map (), reduceByKey (), filter ().
  • Stateful Transformations- Processing of the batch depends on the intermediary results of the previous batch. Examples –Transformations that depend on sliding windows.


[1] Apache Spark Streaming

[2] Spark Streaming on Yarn in Production

[3] Spark Streaming on Yarn