SlideShare una empresa de Scribd logo
1 de 68
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
What to expect?
 What is Streaming?
 Spark Ecosystem
 Why Spark Streaming?
 Spark Streaming Overview
 DStreams
 DStream Transformations
 Caching/ Persistence
 Accumulators, Broadcast Variables and Checkpoints
 Use Case – Twitter Sentiment Analysis
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
What is Streaming?
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
What is Streaming?
 Data Streaming is a technique for transferring data so that it can be
processed as a steady and continuous stream.
 Streaming technologies are becoming increasingly important with the
growth of the Internet.
“Without stream processing
there’s no big data and no
Internet of Things” – Dana
Sandu, SQLstream
User
Streaming Sources Live Stream Data
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Ecosystem
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Used for structured
data. Can run
unmodified hive
queries on existing
Hadoop deployment
Spark Core Engine
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
Learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
Enables analytical
and interactive
apps for live
streaming data
Package for R language to
enable R-users to leverage
Spark power from R shell
Machine learning
libraries being built
on top of Spark
The core engine for entire Spark framework. Provides
utilities and architecture for other components
Graph Computation
engine (Similar to
Giraph). Combines data-
parallel and graph-
parallel concepts
Spark Ecosystem
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Used for structured
data. Can run
unmodified hive
queries on existing
Hadoop deployment
Spark Core Engine
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
Learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
Package for R language to
enable R-users to leverage
Spark power from R shell
Machine learning
libraries being built
on top of Spark
The core engine for entire Spark framework. Provides
utilities and architecture for other components
Graph Computation
engine (Similar to
Giraph). Combines data-
parallel and graph-
parallel concepts
Enables analytical
and interactive
apps for live
streaming data
Spark Ecosystem
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Why Spark Streaming?
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Why Spark Streaming?
We will be using Spark Streaming to perform Twitter Sentiment Analysis which is by companies around the world.
We will explore the same after we learn all the concepts of Spark Streaming.
Spark Streaming is used to stream real-time
data from various sources like Twitter, Stock
Market and Geographical Systems and
perform powerful analytics to help businesses.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Streaming Overview
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Streaming Overview
 Spark Streaming is used for processing real-time streaming data
 It is a useful addition to the core Spark API
 Spark Streaming enables high-throughput and fault-tolerant
stream processing of live data streams
 The fundamental stream unit is DStream which is basically a
series of RDDs to process the real-time data Figure: Streams In Spark Streaming
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Streaming Features
Business
Analysis
Integration
Scaling
Achieves low latency
Integrates with batch
and real-time
processing
Used to track
behaviour of
customers
Scales to hundreds of
nodes
Speed
Fault
Tolerance
Efficiently recover
from failures
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Streaming Workflow
Figure: Overview Of Spark Streaming
MLlib
Machine Learning
Spark SQL
SQL + DataFrames
Spark Streaming
Streaming Data
Sources
Static Data
Sources
Data Storage
Systems
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Streaming Workflow
Kafka
HDFS/ S3
Flume
Streaming
Twitter
Kinesis
Databases
HDFS
Dashboards
Figure: Data from a variety of sources to various storage systems
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Streaming Workflow
Kafka
HDFS/ S3
Flume
Streaming
Twitter
Kinesis
Databases
HDFS
Dashboards
Figure: Data from a variety of sources to various storage systems
Streaming Engine
Input Data
Stream
Batches Of
Input Data
Batches Of
Processed Data
Figure: Incoming streams of data divided into batches
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Streaming Workflow
Kafka
HDFS/ S3
Flume
Streaming
Twitter
Kinesis
Databases
HDFS
Dashboards
Figure: Data from a variety of sources to various storage systems
Streaming Engine
Input Data
Stream
Batches Of
Input Data
Batches Of
Processed Data
Figure: Incoming streams of data divided into batches
Data From
Time 0 to 1
Data From
Time 1 to 2
Data From
Time 2 to 3
Data From
Time 3 to 4
RDD @ Time 1 RDD @ Time 2 RDD @ Time 3 RDD @ Time 4
DStream
Figure: Input data stream divided into discrete chunks of data
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Streaming Workflow
Kafka
HDFS/ S3
Flume
Streaming
Twitter
Kinesis
Databases
HDFS
Dashboards
Figure: Data from a variety of sources to various storage systems
Streaming Engine
Input Data
Stream
Batches Of
Input Data
Batches Of
Processed Data
Figure: Incoming streams of data divided into batches
Figure: Extracting words from an InputStream
Data From
Time 0 to 1
Data From
Time 1 to 2
Data From
Time 2 to 3
Data From
Time 3 to 4
RDD @ Time 1 RDD @ Time 2 RDD @ Time 3 RDD @ Time 4
DStream
Figure: Input data stream divided into discrete chunks of data
Data From
Time 0 to 1
Data From
Time 1 to 2
Data From
Time 2 to 3
Data From
Time 3 to 4
DStream
Words From
Time 0 to 1
Words From
Time 1 to 2
Words From
Time 2 to 3
Words From
Time 3 to 4
Words
DStream
flatMap
Operation
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Streaming
Fundamentals
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Streaming Fundamentals
The following gives a flow of the fundamentals of Spark Streaming that we will discuss in the coming slides:
Streaming
Context
DStream
Input
DStream
DStream
Transformations
Output
DStream
Caching
Accumulators,
Broadcast
Variables and
Checkpoints
1 2 3 4
2.1 2.32.2
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
DStream
Input
DStream
DStream
Transformations
Output
DStream
Caching
Accumulators,
Broadcast
Variables and
Checkpoints
2 3 4
2.1 2.32.2
Streaming
Context
1
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Streaming Context
StreamingContext
 Consumes a stream of data in Spark.
 Registers an InputDStream to produce a Receiver
object.
Figure: Default Implementation Sources
Streaming
Context
Input Data
Stream
Batches Of
Input Data
Figure: Spark Streaming Context
 It is the main entry point for Spark functionality.
 Spark provides a number of default
implementations of sources like Twitter, Akka
Actor and ZeroMQ that are accessible from the
context.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Streaming Context – Initialization
 A StreamingContext object can be created from a SparkContext object.
 A SparkContext represents the connection to a Spark cluster and can be used to
create RDDs, accumulators and broadcast variables on that cluster.
import org.apache.spark._
import org.apache.spark.streaming._
var ssc = new StreamingContext(sc,Seconds(1))
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Streaming
Context
Input
DStream
DStream
Transformations
Output
DStream
Caching
Accumulators,
Broadcast
Variables and
Checkpoints
1 3 4
2.1 2.32.2
DStream
2
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
DStream
 Discretized Stream (DStream) is the basic abstraction provided by Spark Streaming.
 It is a continuous stream of data.
Data From
Time 0 to 1
Data From
Time 1 to 2
Data From
Time 2 to 3
Data From
Time 3 to 4
RDD @ Time 1 RDD @ Time 2 RDD @ Time 3 RDD @ Time 4
DStream
Figure: Input data stream divided into discrete chunks of data
 It is received from source or from a processed data stream generated by transforming
the input stream.
 Internally, a DStream is represented by a continuous series of RDDs and each RDD
contains data from a certain interval.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
DStream Operation
 Any operation applied on a DStream translates to operations on the underlying RDDs.
 For example, in the example of converting a stream of lines to words,
the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs
of the words DStream.
Figure: Extracting words from an InputStream
Data From
Time 0 to 1
Data From
Time 1 to 2
Data From
Time 2 to 3
Data From
Time 3 to 4
DStream
Words From
Time 0 to 1
Words From
Time 1 to 2
Words From
Time 2 to 3
Words From
Time 3 to 4
Words
DStream
flatMap
Operation
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Streaming
Context
DStream
Transformations
Output
DStream
Caching
Accumulators,
Broadcast
Variables and
Checkpoints
1 3 4
2.32.2
Input
DStream
DStream
2.1
2
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Input DStreams
Input
DStream
Basic
Source
File
Systems
Socket
Connections
Advanced
Source
Kafka
Flume
Kinesis
Input DStreams are DStreams representing the stream of input data
received from streaming sources.
Data From
Time 0 to 1
Data From
Time 1 to 2
Data From
Time 2 to 3
Data From
Time 3 to 4
RDD @ Time 1 RDD @ Time 2 RDD @ Time 3 RDD @ Time 4
DStream
Figure: Input data stream divided into discrete chunks of data
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Receiver
Figure: The Receiver sends data onto the DStream where each Batch contains RDDs
Every input DStream is associated with a Receiver object which receives the data from a source
and stores it in Spark’s memory for processing.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Streaming
Context
Input
DStream
Output
DStream
Caching
Accumulators,
Broadcast
Variables and
Checkpoints
1 3 4
2.1 2.3
DStream
Transformations
DStream
2.2
2
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Transformations on DStreams
Transformations allow the data from the input DStream to be modified similar
to RDDs. DStreams support many of the transformations available on normal
Spark RDDs.
flatMap
filter
reduce
groupBy
map
Most Popular
Spark Streaming
Transformations
DStream 1
Transform
Drop split
point
DStream 2
Figure: DStream Transformations
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Transformations on DStreams
map
map(func)
map(func) returns a new DStream by passing each element of the source
DStream through a function func.
map
Input Data
Stream
Batches Of
Input Data
Node
Figure: Input DStream being converted through map(func) Figure: Map Function
filter
flatMap
groupBy
reduce
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Transformations on DStreams
flatMap
map
filter
reduce
groupBy
Figure: flatMap Function
flatMap(func)
flatMap(func) is similar to map(func) but each input item can be mapped to 0 or
more output items and returns a new DStream by passing each source element
through a function func.
Figure: Input DStream being converted through flatMap(func)
flatMap
Input Data
Stream
Batches Of
Input Data
Node
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Transformations on DStreams
filter
map
flatMap
reduce
groupBy
filter(func)
filter(func) returns a new DStream by selecting only the records of the source
DStream on which func returns true.
Figure: Input DStream being converted through filter(func)
filter
Input Data
Stream
Batches Of
Input Data
Node
Figure: Filter Function For Even Numbers
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Transformations on DStreams
reduce
map
filter
flatMap
groupBy
Figure: Reduce Function To Get Cumulative Sum
reduce(func)
reduce(func) returns a new DStream of single-element RDDs by aggregating the
elements in each RDD of the source DStream using a function func.
Figure: Input DStream being converted through reduce(func)
reduce
Input Data
Stream
Batches Of
Input Data
Node
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Transformations on DStreams
groupBy
map
filter
reduce
flatMap
groupBy(func)
groupBy(func) returns the new RDD which basically is
made up with a key and corresponding list of items of
that group.
Figure: Input DStream being converted through groupBy(func)
groupBy
Input Data
Stream
Batches Of
Input Data
Node
Figure: Grouping By First Letters
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
DStream Window
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
DStream Window
 Spark Streaming also provides windowed computations which allow us to
apply transformations over a sliding window of data.
 The following figure illustrates this sliding window:
Figure: DStream Window Transformation
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Streaming
Context
Input
DStream
DStream
Transformations
Caching
Accumulators,
Broadcast
Variables and
Checkpoints
1 3 4
2.1 2.2
DStream
2
Output
DStream
2.3
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Output Operations on DStreams
 Output operations allow DStream’s data to be pushed out to external systems like
databases or file systems.
 Output operations trigger the actual execution of all the DStream transformations.
Output
Operations
Transformed
DStream
File System
DatabaseOutput
DStream
Figure: Output Operations on DStreams
External Systems
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Output Operations on DStreams
Currently, the following output operations are defined:
Output Operations
print()
saveAsTextFiles
(prefix, [suffix])
saveAsObjectFiles
(prefix, [suffix])
saveAsHadoopFiles
(prefix, [suffix])
foreachRDD(func)
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Output Operations Example - foreachRDD
foreachRDD
 dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems.
 The lazy evaluation achieves the most efficient transfer of data.
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
// Return to the pool for future reuse
ConnectionPool.returnConnection(connection)
}
}
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Streaming
Context
DStream
Input
DStream
DStream
Transformations
Output
DStream
Accumulators,
Broadcast
Variables and
Checkpoints
1 2 4
2.1 2.32.2
Caching
3
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Caching/Persistence
 DStreams allow developers to cache/ persist the stream’s data in memory. This
is useful if the data in the DStream will be computed multiple times.
 This can be done using the persist() method on a DStream.
 For input streams that receive data over the network (such as Kafka, Flume,
Sockets, etc.), the default persistence level is set to replicate the data to two
nodes for fault-tolerance.
Figure: Caching Into 2 Nodes
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Streaming
Context
DStream
Input
DStream
DStream
Transformations
Output
DStream
Caching
1 2 3
2.1 2.32.2
Accumulators,
Broadcast
Variables and
Checkpoints
4
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Accumulators
 Accumulators are variables that are only added through an associative and commutative
operation.
 They are used to implement counters or sums.
 Tracking accumulators in the UI can be useful for understanding the progress of running stages
 Spark natively supports numeric accumulators. We can create named or unnamed accumulators.
Figure: Accumulators In Spark Streaming
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Broadcast Variables
 Broadcast variables allow the programmer to keep a read-only variable cached on
each machine rather than shipping a copy of it with tasks.
 They can be used to give every node a copy of a large input dataset in an efficient
manner.
 Spark also attempts to distribute broadcast variables using efficient broadcast
algorithms to reduce communication cost.
Figure: Broadcasting A Value To Executors
SparkContext
ContextCleaner
BroadcastManager
newBroadcast[T](value, isLocal)
registerBroadcastForCleanup
broadcast(value)
Figure: SparkContext and Broadcasting
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Checkpoints
Checkpoints are similar to checkpoints in gaming. They make it run 24/7 and make
it resilient to failures unrelated to the application logic.
It is the saving of the
information defining the
streaming computation
Data
Checkpoints
Checkpoints
Metadata
Checkpoints
It is saving of the
generated RDDs to
reliable storage
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case - Twitter
Sentiment Analysis
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Twitter Sentiment Analysis
Trending Topics can be used to create campaigns and attract larger audience.
Sentiment Analytics helps in crisis management, service adjusting and target marketing.
 Sentiment refers to the emotion behind a social media mention
online.
 Sentiment Analysis is categorising the tweets related to
particular topic and performing data mining using Sentiment
Automation Analytics Tools.
 We will be performing Twitter Sentiment Analysis as our Use
Case for Spark Streaming.
Figure: Facebook And Twitter Trending Topics
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Problem Statement
Problem Statement
To design a Twitter Sentiment Analysis System where we populate
real time sentiments for crisis management, service adjusting and
target marketing
Figure: Twitter Sentiment Analysis For Adidas
Figure: Twitter Sentiment Analysis For Nike
Sentiment Analysis is used to:
 Predict the success of a movie
 Predict political campaign success
 Decide whether to invest in a certain company
 Targeted advertising
 Review products and services
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case - Performing
Sentiment Analysis
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Importing Packages
//Import the necessary packages into the Spark Program
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.twitter._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
import org.apache.spark.sql
import org.apache.spark.storage.StorageLevel
import scala.io.Source
import scala.collection.mutable.HashMap
import java.io.File
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Twitter Token Authorization
object mapr {
def main(args: Array[String]) {
if (args.length < 4) {
System.err.println("Usage: TwitterPopularTags <consumer key>
<consumer secret> " +
"<access token> <access token secret> [<filters>]")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
//Passing our Twitter keys and tokens as arguments for authorization
val Array(consumerKey, consumerSecret, accessToken,
accessTokenSecret) = args.take(4)
val filters = args.takeRight(args.length - 4)
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – DStream Transformation
// Set the system properties so that Twitter4j library used by twitter stream
// Use them to generate OAuth credentials
System.setProperty("twitter4j.oauth.consumerKey", consumerKey)
System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret)
System.setProperty("twitter4j.oauth.accessToken", accessToken)
System.setProperty("twitter4j.oauth.accessTokenSecret",
accessTokenSecret)
val sparkConf = new
SparkConf().setAppName("Sentiments").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val stream = TwitterUtils.createStream(ssc, None, filters)
//Input DStream transformation using flatMap
val tags = stream.flatMap { status =>
status.getHashtagEntities.map(_.getText)}
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Generating Tweet Data
//RDD transformation using sortBy and then map function
tags.countByValue()
.foreachRDD { rdd =>
val now = org.joda.time.DateTime.now()
rdd
.sortBy(_._2)
.map(x => (x, now))
//Saving our output at ~/twitter/ directory
.saveAsTextFile(s"~/twitter/$now")
}
//DStream transformation using filter and map functions
val tweets = stream.filter {t =>
val tags = t.getText.split("
").filter(_.startsWith("#")).map(_.toLowerCase)
tags.exists { x => true }
}
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Extracting Sentiments
val data = tweets.map { status =>
val sentiment = SentimentAnalysisUtils.detectSentiment(status.getText)
val tagss = status.getHashtagEntities.map(_.getText.toLowerCase)
(status.getText, sentiment.toString, tagss.toString())
}
data.print()
//Saving our output at ~/ with filenames starting like twitterss
data.saveAsTextFiles("~/twitterss","20000")
ssc.start()
ssc.awaitTermination()
}
}
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case - Results
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Output In Eclipse
Figure: Sentiment Analysis Output In Eclipse IDE
Positive
Neutral
Negative
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Output Directory
Figure: Output folders inside our ‘twitter’ project folder
Output folder
containing tags and
their sentiments
Output folder
directory
Output folder
containing
Twitter profiles
for analysis
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Output Usernames
Figure: Output file containing Twitter username and timestamp
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Output Tweets and Sentiments
Figure: Output file containing tweet and its sentiment
Positive
Neutral
Negative
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Sentiment For Trump
Figure: Performing Sentiment Analysis on Tweets with ‘Trump’ Keyword
Positive
Neutral
Negative
‘Trump’
Keyword
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Applying Sentiment Analysis
 As we have seen from our Sentiment Analysis demonstration, we can
extract sentiments of particular topics just like we did for ‘Trump’.
 Hence Sentiment Analytics can be used in crisis management, service
adjusting and target marketing by companies around the world.
Companies using Spark Streaming for Sentiment Analysis have
applied the same approach to achieve the following:
1. Enhancing the customer experience
2. Gaining competitive advantage
3. Gaining Business Intelligence
4. Revitalizing a losing brand
1 32
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Summary
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Summary
Why Spark Streaming?
Caching / PersistenceDStream & Transformations
Spark Streaming Overview
Use Case - Twitter Sentiment
Streaming Context
Spark Streaming is used to stream real-time
data from various sources like Twitter, Stock
Market and Geographical Systems and
perform powerful analytics to help
businesses.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Conclusion
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Conclusion
Congrats!
We have hence demonstrated the power of Spark Streaming in Real Time Data
Analytics.
The hands-on examples will give you the required confidence to work on any future
projects you encounter in Apache Spark.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Thank You …
Questions/Queries/Feedback

Más contenido relacionado

La actualidad más candente

Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 

La actualidad más candente (20)

Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache spark
Apache sparkApache spark
Apache spark
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 

Similar a Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training | Edureka

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCMark Smith
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!Edureka!
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!Edureka!
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
 
Data Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkData Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkEren Avşaroğulları
 
Solution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline AcceleratorSolution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline AcceleratorBlueData, Inc.
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics Franco Ucci
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3Databricks
 
Intro to apache spark
Intro to apache sparkIntro to apache spark
Intro to apache sparkAmine Sagaama
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks
 

Similar a Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training | Edureka (20)

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Data Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkData Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup Talk
 
Solution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline AcceleratorSolution Brief: Real-Time Pipeline Accelerator
Solution Brief: Real-Time Pipeline Accelerator
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
Intro to apache spark
Intro to apache sparkIntro to apache spark
Intro to apache spark
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
 

Más de Edureka!

What to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaWhat to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaEdureka!
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaEdureka!
 
Top 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaTop 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaEdureka!
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaEdureka!
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaEdureka!
 
Top 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaTop 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaEdureka!
 
Top Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaTop Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaEdureka!
 
Linux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaLinux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaEdureka!
 
How to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaHow to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaEdureka!
 
Importance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaImportance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaEdureka!
 
RPA in 2020 | Edureka
RPA in 2020 | EdurekaRPA in 2020 | Edureka
RPA in 2020 | EdurekaEdureka!
 
Email Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEmail Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEdureka!
 
EA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEdureka!
 
Cognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaCognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaEdureka!
 
AWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaAWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaEdureka!
 
Blue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaBlue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaEdureka!
 
Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Edureka!
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaEdureka!
 
Kubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaKubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaEdureka!
 
Introduction to DevOps | Edureka
Introduction to DevOps | EdurekaIntroduction to DevOps | Edureka
Introduction to DevOps | EdurekaEdureka!
 

Más de Edureka! (20)

What to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaWhat to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | Edureka
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
 
Top 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaTop 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | Edureka
 
Top 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaTop 5 PMP Certifications | Edureka
Top 5 PMP Certifications | Edureka
 
Top Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaTop Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | Edureka
 
Linux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaLinux Mint Tutorial | Edureka
Linux Mint Tutorial | Edureka
 
How to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaHow to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| Edureka
 
Importance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaImportance of Digital Marketing | Edureka
Importance of Digital Marketing | Edureka
 
RPA in 2020 | Edureka
RPA in 2020 | EdurekaRPA in 2020 | Edureka
RPA in 2020 | Edureka
 
Email Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEmail Notifications in Jenkins | Edureka
Email Notifications in Jenkins | Edureka
 
EA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | Edureka
 
Cognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaCognitive AI Tutorial | Edureka
Cognitive AI Tutorial | Edureka
 
AWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaAWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
 
Blue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaBlue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | Edureka
 
Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
 
Kubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaKubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | Edureka
 
Introduction to DevOps | Edureka
Introduction to DevOps | EdurekaIntroduction to DevOps | Edureka
Introduction to DevOps | Edureka
 

Último

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Último (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training | Edureka

  • 2. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING What to expect?  What is Streaming?  Spark Ecosystem  Why Spark Streaming?  Spark Streaming Overview  DStreams  DStream Transformations  Caching/ Persistence  Accumulators, Broadcast Variables and Checkpoints  Use Case – Twitter Sentiment Analysis
  • 4. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING What is Streaming?  Data Streaming is a technique for transferring data so that it can be processed as a steady and continuous stream.  Streaming technologies are becoming increasingly important with the growth of the Internet. “Without stream processing there’s no big data and no Internet of Things” – Dana Sandu, SQLstream User Streaming Sources Live Stream Data
  • 6. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Used for structured data. Can run unmodified hive queries on existing Hadoop deployment Spark Core Engine Spark SQL (SQL) Spark Streaming (Streaming) MLlib (Machine Learning) GraphX (Graph Computation) SparkR (R on Spark) Enables analytical and interactive apps for live streaming data Package for R language to enable R-users to leverage Spark power from R shell Machine learning libraries being built on top of Spark The core engine for entire Spark framework. Provides utilities and architecture for other components Graph Computation engine (Similar to Giraph). Combines data- parallel and graph- parallel concepts Spark Ecosystem
  • 7. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Used for structured data. Can run unmodified hive queries on existing Hadoop deployment Spark Core Engine Spark SQL (SQL) Spark Streaming (Streaming) MLlib (Machine Learning) GraphX (Graph Computation) SparkR (R on Spark) Package for R language to enable R-users to leverage Spark power from R shell Machine learning libraries being built on top of Spark The core engine for entire Spark framework. Provides utilities and architecture for other components Graph Computation engine (Similar to Giraph). Combines data- parallel and graph- parallel concepts Enables analytical and interactive apps for live streaming data Spark Ecosystem
  • 9. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Why Spark Streaming? We will be using Spark Streaming to perform Twitter Sentiment Analysis which is by companies around the world. We will explore the same after we learn all the concepts of Spark Streaming. Spark Streaming is used to stream real-time data from various sources like Twitter, Stock Market and Geographical Systems and perform powerful analytics to help businesses.
  • 11. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Streaming Overview  Spark Streaming is used for processing real-time streaming data  It is a useful addition to the core Spark API  Spark Streaming enables high-throughput and fault-tolerant stream processing of live data streams  The fundamental stream unit is DStream which is basically a series of RDDs to process the real-time data Figure: Streams In Spark Streaming
  • 12. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Streaming Features Business Analysis Integration Scaling Achieves low latency Integrates with batch and real-time processing Used to track behaviour of customers Scales to hundreds of nodes Speed Fault Tolerance Efficiently recover from failures
  • 13. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Streaming Workflow Figure: Overview Of Spark Streaming MLlib Machine Learning Spark SQL SQL + DataFrames Spark Streaming Streaming Data Sources Static Data Sources Data Storage Systems
  • 14. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Streaming Workflow Kafka HDFS/ S3 Flume Streaming Twitter Kinesis Databases HDFS Dashboards Figure: Data from a variety of sources to various storage systems
  • 15. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Streaming Workflow Kafka HDFS/ S3 Flume Streaming Twitter Kinesis Databases HDFS Dashboards Figure: Data from a variety of sources to various storage systems Streaming Engine Input Data Stream Batches Of Input Data Batches Of Processed Data Figure: Incoming streams of data divided into batches
  • 16. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Streaming Workflow Kafka HDFS/ S3 Flume Streaming Twitter Kinesis Databases HDFS Dashboards Figure: Data from a variety of sources to various storage systems Streaming Engine Input Data Stream Batches Of Input Data Batches Of Processed Data Figure: Incoming streams of data divided into batches Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 RDD @ Time 1 RDD @ Time 2 RDD @ Time 3 RDD @ Time 4 DStream Figure: Input data stream divided into discrete chunks of data
  • 17. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Streaming Workflow Kafka HDFS/ S3 Flume Streaming Twitter Kinesis Databases HDFS Dashboards Figure: Data from a variety of sources to various storage systems Streaming Engine Input Data Stream Batches Of Input Data Batches Of Processed Data Figure: Incoming streams of data divided into batches Figure: Extracting words from an InputStream Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 RDD @ Time 1 RDD @ Time 2 RDD @ Time 3 RDD @ Time 4 DStream Figure: Input data stream divided into discrete chunks of data Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 DStream Words From Time 0 to 1 Words From Time 1 to 2 Words From Time 2 to 3 Words From Time 3 to 4 Words DStream flatMap Operation
  • 19. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Streaming Fundamentals The following gives a flow of the fundamentals of Spark Streaming that we will discuss in the coming slides: Streaming Context DStream Input DStream DStream Transformations Output DStream Caching Accumulators, Broadcast Variables and Checkpoints 1 2 3 4 2.1 2.32.2
  • 20. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING DStream Input DStream DStream Transformations Output DStream Caching Accumulators, Broadcast Variables and Checkpoints 2 3 4 2.1 2.32.2 Streaming Context 1
  • 21. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Streaming Context StreamingContext  Consumes a stream of data in Spark.  Registers an InputDStream to produce a Receiver object. Figure: Default Implementation Sources Streaming Context Input Data Stream Batches Of Input Data Figure: Spark Streaming Context  It is the main entry point for Spark functionality.  Spark provides a number of default implementations of sources like Twitter, Akka Actor and ZeroMQ that are accessible from the context.
  • 22. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Streaming Context – Initialization  A StreamingContext object can be created from a SparkContext object.  A SparkContext represents the connection to a Spark cluster and can be used to create RDDs, accumulators and broadcast variables on that cluster. import org.apache.spark._ import org.apache.spark.streaming._ var ssc = new StreamingContext(sc,Seconds(1))
  • 23. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Streaming Context Input DStream DStream Transformations Output DStream Caching Accumulators, Broadcast Variables and Checkpoints 1 3 4 2.1 2.32.2 DStream 2
  • 24. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING DStream  Discretized Stream (DStream) is the basic abstraction provided by Spark Streaming.  It is a continuous stream of data. Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 RDD @ Time 1 RDD @ Time 2 RDD @ Time 3 RDD @ Time 4 DStream Figure: Input data stream divided into discrete chunks of data  It is received from source or from a processed data stream generated by transforming the input stream.  Internally, a DStream is represented by a continuous series of RDDs and each RDD contains data from a certain interval.
  • 25. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING DStream Operation  Any operation applied on a DStream translates to operations on the underlying RDDs.  For example, in the example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. Figure: Extracting words from an InputStream Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 DStream Words From Time 0 to 1 Words From Time 1 to 2 Words From Time 2 to 3 Words From Time 3 to 4 Words DStream flatMap Operation
  • 26. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Streaming Context DStream Transformations Output DStream Caching Accumulators, Broadcast Variables and Checkpoints 1 3 4 2.32.2 Input DStream DStream 2.1 2
  • 27. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Input DStreams Input DStream Basic Source File Systems Socket Connections Advanced Source Kafka Flume Kinesis Input DStreams are DStreams representing the stream of input data received from streaming sources. Data From Time 0 to 1 Data From Time 1 to 2 Data From Time 2 to 3 Data From Time 3 to 4 RDD @ Time 1 RDD @ Time 2 RDD @ Time 3 RDD @ Time 4 DStream Figure: Input data stream divided into discrete chunks of data
  • 28. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Receiver Figure: The Receiver sends data onto the DStream where each Batch contains RDDs Every input DStream is associated with a Receiver object which receives the data from a source and stores it in Spark’s memory for processing.
  • 29. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Streaming Context Input DStream Output DStream Caching Accumulators, Broadcast Variables and Checkpoints 1 3 4 2.1 2.3 DStream Transformations DStream 2.2 2
  • 30. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Transformations on DStreams Transformations allow the data from the input DStream to be modified similar to RDDs. DStreams support many of the transformations available on normal Spark RDDs. flatMap filter reduce groupBy map Most Popular Spark Streaming Transformations DStream 1 Transform Drop split point DStream 2 Figure: DStream Transformations
  • 31. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Transformations on DStreams map map(func) map(func) returns a new DStream by passing each element of the source DStream through a function func. map Input Data Stream Batches Of Input Data Node Figure: Input DStream being converted through map(func) Figure: Map Function filter flatMap groupBy reduce
  • 32. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Transformations on DStreams flatMap map filter reduce groupBy Figure: flatMap Function flatMap(func) flatMap(func) is similar to map(func) but each input item can be mapped to 0 or more output items and returns a new DStream by passing each source element through a function func. Figure: Input DStream being converted through flatMap(func) flatMap Input Data Stream Batches Of Input Data Node
  • 33. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Transformations on DStreams filter map flatMap reduce groupBy filter(func) filter(func) returns a new DStream by selecting only the records of the source DStream on which func returns true. Figure: Input DStream being converted through filter(func) filter Input Data Stream Batches Of Input Data Node Figure: Filter Function For Even Numbers
  • 34. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Transformations on DStreams reduce map filter flatMap groupBy Figure: Reduce Function To Get Cumulative Sum reduce(func) reduce(func) returns a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func. Figure: Input DStream being converted through reduce(func) reduce Input Data Stream Batches Of Input Data Node
  • 35. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Transformations on DStreams groupBy map filter reduce flatMap groupBy(func) groupBy(func) returns the new RDD which basically is made up with a key and corresponding list of items of that group. Figure: Input DStream being converted through groupBy(func) groupBy Input Data Stream Batches Of Input Data Node Figure: Grouping By First Letters
  • 37. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING DStream Window  Spark Streaming also provides windowed computations which allow us to apply transformations over a sliding window of data.  The following figure illustrates this sliding window: Figure: DStream Window Transformation
  • 38. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Streaming Context Input DStream DStream Transformations Caching Accumulators, Broadcast Variables and Checkpoints 1 3 4 2.1 2.2 DStream 2 Output DStream 2.3
  • 39. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Output Operations on DStreams  Output operations allow DStream’s data to be pushed out to external systems like databases or file systems.  Output operations trigger the actual execution of all the DStream transformations. Output Operations Transformed DStream File System DatabaseOutput DStream Figure: Output Operations on DStreams External Systems
  • 40. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Output Operations on DStreams Currently, the following output operations are defined: Output Operations print() saveAsTextFiles (prefix, [suffix]) saveAsObjectFiles (prefix, [suffix]) saveAsHadoopFiles (prefix, [suffix]) foreachRDD(func)
  • 41. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Output Operations Example - foreachRDD foreachRDD  dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems.  The lazy evaluation achieves the most efficient transfer of data. dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords => // ConnectionPool is a static, lazily initialized pool of connections val connection = ConnectionPool.getConnection() partitionOfRecords.foreach(record => connection.send(record)) // Return to the pool for future reuse ConnectionPool.returnConnection(connection) } }
  • 42. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Streaming Context DStream Input DStream DStream Transformations Output DStream Accumulators, Broadcast Variables and Checkpoints 1 2 4 2.1 2.32.2 Caching 3
  • 43. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Caching/Persistence  DStreams allow developers to cache/ persist the stream’s data in memory. This is useful if the data in the DStream will be computed multiple times.  This can be done using the persist() method on a DStream.  For input streams that receive data over the network (such as Kafka, Flume, Sockets, etc.), the default persistence level is set to replicate the data to two nodes for fault-tolerance. Figure: Caching Into 2 Nodes
  • 44. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Streaming Context DStream Input DStream DStream Transformations Output DStream Caching 1 2 3 2.1 2.32.2 Accumulators, Broadcast Variables and Checkpoints 4
  • 45. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Accumulators  Accumulators are variables that are only added through an associative and commutative operation.  They are used to implement counters or sums.  Tracking accumulators in the UI can be useful for understanding the progress of running stages  Spark natively supports numeric accumulators. We can create named or unnamed accumulators. Figure: Accumulators In Spark Streaming
  • 46. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Broadcast Variables  Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.  They can be used to give every node a copy of a large input dataset in an efficient manner.  Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost. Figure: Broadcasting A Value To Executors SparkContext ContextCleaner BroadcastManager newBroadcast[T](value, isLocal) registerBroadcastForCleanup broadcast(value) Figure: SparkContext and Broadcasting
  • 47. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Checkpoints Checkpoints are similar to checkpoints in gaming. They make it run 24/7 and make it resilient to failures unrelated to the application logic. It is the saving of the information defining the streaming computation Data Checkpoints Checkpoints Metadata Checkpoints It is saving of the generated RDDs to reliable storage
  • 48. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case - Twitter Sentiment Analysis
  • 49. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Twitter Sentiment Analysis Trending Topics can be used to create campaigns and attract larger audience. Sentiment Analytics helps in crisis management, service adjusting and target marketing.  Sentiment refers to the emotion behind a social media mention online.  Sentiment Analysis is categorising the tweets related to particular topic and performing data mining using Sentiment Automation Analytics Tools.  We will be performing Twitter Sentiment Analysis as our Use Case for Spark Streaming. Figure: Facebook And Twitter Trending Topics
  • 50. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Problem Statement Problem Statement To design a Twitter Sentiment Analysis System where we populate real time sentiments for crisis management, service adjusting and target marketing Figure: Twitter Sentiment Analysis For Adidas Figure: Twitter Sentiment Analysis For Nike Sentiment Analysis is used to:  Predict the success of a movie  Predict political campaign success  Decide whether to invest in a certain company  Targeted advertising  Review products and services
  • 51. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case - Performing Sentiment Analysis
  • 52. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Importing Packages //Import the necessary packages into the Spark Program import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.SparkContext._ import org.apache.spark.streaming.twitter._ import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark._ import org.apache.spark.rdd._ import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext._ import org.apache.spark.sql import org.apache.spark.storage.StorageLevel import scala.io.Source import scala.collection.mutable.HashMap import java.io.File
  • 53. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Twitter Token Authorization object mapr { def main(args: Array[String]) { if (args.length < 4) { System.err.println("Usage: TwitterPopularTags <consumer key> <consumer secret> " + "<access token> <access token secret> [<filters>]") System.exit(1) } StreamingExamples.setStreamingLogLevels() //Passing our Twitter keys and tokens as arguments for authorization val Array(consumerKey, consumerSecret, accessToken, accessTokenSecret) = args.take(4) val filters = args.takeRight(args.length - 4)
  • 54. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – DStream Transformation // Set the system properties so that Twitter4j library used by twitter stream // Use them to generate OAuth credentials System.setProperty("twitter4j.oauth.consumerKey", consumerKey) System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret) System.setProperty("twitter4j.oauth.accessToken", accessToken) System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret) val sparkConf = new SparkConf().setAppName("Sentiments").setMaster("local[2]") val ssc = new StreamingContext(sparkConf, Seconds(5)) val stream = TwitterUtils.createStream(ssc, None, filters) //Input DStream transformation using flatMap val tags = stream.flatMap { status => status.getHashtagEntities.map(_.getText)}
  • 55. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Generating Tweet Data //RDD transformation using sortBy and then map function tags.countByValue() .foreachRDD { rdd => val now = org.joda.time.DateTime.now() rdd .sortBy(_._2) .map(x => (x, now)) //Saving our output at ~/twitter/ directory .saveAsTextFile(s"~/twitter/$now") } //DStream transformation using filter and map functions val tweets = stream.filter {t => val tags = t.getText.split(" ").filter(_.startsWith("#")).map(_.toLowerCase) tags.exists { x => true } }
  • 56. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Extracting Sentiments val data = tweets.map { status => val sentiment = SentimentAnalysisUtils.detectSentiment(status.getText) val tagss = status.getHashtagEntities.map(_.getText.toLowerCase) (status.getText, sentiment.toString, tagss.toString()) } data.print() //Saving our output at ~/ with filenames starting like twitterss data.saveAsTextFiles("~/twitterss","20000") ssc.start() ssc.awaitTermination() } }
  • 58. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Output In Eclipse Figure: Sentiment Analysis Output In Eclipse IDE Positive Neutral Negative
  • 59. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Output Directory Figure: Output folders inside our ‘twitter’ project folder Output folder containing tags and their sentiments Output folder directory Output folder containing Twitter profiles for analysis
  • 60. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Output Usernames Figure: Output file containing Twitter username and timestamp
  • 61. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Output Tweets and Sentiments Figure: Output file containing tweet and its sentiment Positive Neutral Negative
  • 62. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Sentiment For Trump Figure: Performing Sentiment Analysis on Tweets with ‘Trump’ Keyword Positive Neutral Negative ‘Trump’ Keyword
  • 63. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Applying Sentiment Analysis  As we have seen from our Sentiment Analysis demonstration, we can extract sentiments of particular topics just like we did for ‘Trump’.  Hence Sentiment Analytics can be used in crisis management, service adjusting and target marketing by companies around the world. Companies using Spark Streaming for Sentiment Analysis have applied the same approach to achieve the following: 1. Enhancing the customer experience 2. Gaining competitive advantage 3. Gaining Business Intelligence 4. Revitalizing a losing brand 1 32
  • 65. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Summary Why Spark Streaming? Caching / PersistenceDStream & Transformations Spark Streaming Overview Use Case - Twitter Sentiment Streaming Context Spark Streaming is used to stream real-time data from various sources like Twitter, Stock Market and Geographical Systems and perform powerful analytics to help businesses.
  • 67. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Conclusion Congrats! We have hence demonstrated the power of Spark Streaming in Real Time Data Analytics. The hands-on examples will give you the required confidence to work on any future projects you encounter in Apache Spark.
  • 68. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Thank You … Questions/Queries/Feedback