SlideShare una empresa de Scribd logo
1 de 40
© 2015 MapR Technologies 1© 2014 MapR Technologies
Overview of Apache Spark Streaming
© 2015 MapR Technologies 2
Agenda
• Why Apache Spark Streaming ?
• What is Apache Spark Streaming?
– Key Concepts and Architecture
• How it works by Example
© 2015 MapR Technologies 3
Why Spark Streaming?
• Process Time Series data :
– Results in near-real-time
• Use Cases
– Social network trends
– Website statistics, monitoring
– Fraud detection
– Advertising click monetization
put
put
put
put
Time stamped data
data
• Sensor, System Metrics, Events, log files
• Stock Ticker, User Activity
• Hi Volume, Velocity
Data for real-time
monitoring
© 2015 MapR Technologies 4
What is time series data?
• Stuff with timestamps
– Sensor data
– log files
– Phones..
Credit Card Transactions Web user behaviour
Social media
Log files
Geodata
Sensors
© 2015 MapR Technologies 5
Why Spark Streaming ?
What If?
• You want to analyze data as it arrives?
For Example Time Series Data: Sensors, Clicks, Logs, Stats
© 2015 MapR Technologies 6
Batch Processing
It's 6:01 and 72 degrees
It's 6:02 and 75 degrees
It's 6:03 and 77 degrees
It's 6:04 and 85 degrees
It's 6:05 and 90 degrees
It's 6:06 and 85 degrees
It's 6:07 and 77 degrees
It's 6:08 and 75 degrees
It was hot at 6:05
yesterday!
Batch processing may be too late for some events
© 2015 MapR Technologies 7
Event Processing
It's 6:05 and
90 degrees
Someone should
open a window!
Streaming
Its becoming important to process events as they arrive
© 2015 MapR Technologies 8
What is Spark Streaming?
• extension of the core Spark AP
• enables scalable, high-throughput, fault-tolerant stream
processing of live data
Data Sources Data Sinks
© 2015 MapR Technologies 9
Stream Processing Architecture
Streaming
Sources/Apps
MapR-FS
Data Ingest
Topics
MapR-DB
Data Storage
MapR-FS
Apps
Stream
Processing
© 2015 MapR Technologies 10
Key Concepts
• Data Sources:
– File Based: HDFS
– Network Based: TCP sockets,
Twitter, Kafka, Flume, ZeroMQ, Akka Actor
• Transformations
• Output Operations
MapR-FS
Topics
© 2015 MapR Technologies 11
Spark Streaming Architecture
• Divide data stream into batches of X seconds
– Called DStream = sequence of RDDs
Spark
Streaming
input data
stream
DStream RDD batches
Batch
interval
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1
© 2015 MapR Technologies 12
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• read only collection of
elements
© 2015 MapR Technologies 13
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• read only collection of
elements
• operated on in parallel
• Cached in memory
– Or on disk
• Fault tolerant
© 2015 MapR Technologies 14
Working With RDDs
RDD
RDD
RDD
RDD
Transformations
Action Value
linesWithErrorRDD.count()
6
linesWithErrorRDD.first()
# Error line
textFile = sc.textFile(”SomeFile.txt”)
linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)
© 2015 MapR Technologies 15
Process DStream
transform
Transform
map
reduceByValue
count
DStream
RDDs
Dstream
RDDs
transformtransform
• Process using transformations
– creates new RDDs
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1
RDD @ time 1 RDD @ time 2 RDD @ time 3
© 2015 MapR Technologies 16
Key Concepts
• Data Sources
• Transformations: create new DStream
– Standard RDD operations: map, filter, union, reduce, join, …
– Stateful operations: UpdateStateByKey(function),
countByValueAndWindow, …
• Output Operations
© 2015 MapR Technologies 17
Spark Streaming Architecture
• processed results are pushed out in batches
Spark
batches of processed
results
Spark
Streaming
input data
stream
DStream RDD batches
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1
© 2015 MapR Technologies 18
Key Concepts
• Data Sources
• Transformations
• Output Operations: trigger Computation
– saveAsHadoopFiles – save to HDFS
– saveAsHadoopDataset – save to Hbase
– saveAsTextFiles
– foreach – do anything with each batch of RDDs
MapR-DB
MapR-FS
© 2015 MapR Technologies 19
Learning Goals
• How it works by example
© 2015 MapR Technologies 20
Use Case: Time Series Data
Data for
real-time monitoring
read
Spark Processing
Spark
Streaming
Oil Pump Sensor data
© 2015 MapR Technologies 21
Convert Line of CSV data to Sensor Object
case class Sensor(resid: String, date: String, time: String,
hz: Double, disp: Double, flo: Double, sedPPM: Double,
psi: Double, chlPPM: Double)
def parseSensor(str: String): Sensor = {
val p = str.split(",")
Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble,
p(6).toDouble, p(7).toDouble, p(8).toDouble)
}
© 2015 MapR Technologies 22
Schema
• All events stored, data CF could be set to expire data
• Filtered alerts put in alerts CF
• Daily summaries put in Stats CF
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0
© 2015 MapR Technologies 23
Basic Steps for Spark Streaming code
These are the basic steps for Spark Streaming code:
1. create a Dstream
1. Apply transformations
2. Apply output operations
2. Start receiving data and processing it
– using streamingContext.start().
3. Wait for the processing to be stopped
– using streamingContext.awaitTermination().
© 2015 MapR Technologies 24
Create a DStream
val ssc = new StreamingContext(sparkConf, Seconds(2))
val linesDStream = ssc.textFileStream(“/mapr/stream")
batch
time 0-1
linesDStream
batch
time 1-2
batch
time 1-2
DStream: a sequence of RDDs representing a
stream of data
stored in memory as an
RDD
© 2015 MapR Technologies 25
Process DStream
val linesDStream = ssc.textFileStream(”directory path")
val sensorDStream = linesDStream.map(parseSensor)
map
new RDDs created
for every batch
batch
time 0-1
linesDStream
RDDs
sensorDstream
RDDs
batch
time 1-2
mapmap
batch
time 1-2
© 2015 MapR Technologies 26
Process DStream
// for Each RDD
sensorDStream.foreachRDD { rdd =>
// filter sensor data for low psi
val alertRDD = rdd.filter(sensor => sensor.psi < 5.0)
. . .
}
© 2015 MapR Technologies 27
DataFrame and SQL Operations
// for Each RDD parse into a sensor object filter
sensorDStream.foreachRDD { rdd =>
. . .
alertRdd.toDF().registerTempTable(”alert”)
// join alert data with pump maintenance info
val alertViewDF = sqlContext.sql(
"select s.resid,s.psi, p.pumpType
from alert s join pump p on s.resid = p.resid
join maint m on p.resid=m.resid")
. . .
}
© 2015 MapR Technologies 28
Save to HBase
// for Each RDD parse into a sensor object filter
sensorDStream.foreachRDD { rdd =>
. . .
// convert alert to put object write to HBase alerts
rdd.map(Sensor.convertToPutAlert)
.saveAsHadoopDataset(jobConfig)
}
© 2015 MapR Technologies 29
Save to HBase
rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
map
Put objects written
To HBase
batch
time 0-1
linesRDD
DStream
sensorRDD
Dstream
batch
time 1-2
mapmap
batch
time 1-2
HBase
save save save
output operation: persist data to external storage
© 2015 MapR Technologies 30
Start Receiving Data
sensorDStream.foreachRDD { rdd =>
. . .
}
// Start the computation
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()
© 2015 MapR Technologies 31
Using HBase as a Source and Sink
read
write
Spark applicationHBase database
EXAMPLE: calculate and store summaries,
Pre-Computed, Materialized View
© 2015 MapR Technologies 32
HBase
HBase Read and Write
val hBaseRDD = sc.newAPIHadoopRDD(
conf,classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
keyStatsRDD.map { case (k, v) => convertToPut(k, v)
}.saveAsHadoopDataset(jobConfig)
newAPIHadoopRDD
Row key Result
saveAsHadoopDataset
Key Put
HBase
Scan Result
© 2015 MapR Technologies 33
Read HBase
// Load an RDD of (rowkey, Result) tuples from HBase table
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
// get Result
val resultRDD = hBaseRDD.map(tuple => tuple._2)
// transform into an RDD of (RowKey, ColumnValue)s
val keyValueRDD = resultRDD.map(
result => (Bytes.toString(result.getRow()).split(" ")(0),
Bytes.toDouble(result.value)))
// group by rowkey , get statistics for column value
val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list =>
StatCounter(list))
© 2015 MapR Technologies 34
Write HBase
// save to HBase table CF data
val jobConfig: JobConf = new JobConf(conf, this.getClass)
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName)
// convert psi stats to put and write to hbase table stats column family
keyStatsRDD.map { case (k, v) =>
convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)
© 2015 MapR Technologies 35
MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data
• https://www.mapr.com/blog/spark-streaming-hbase
© 2015 MapR Technologies 36
Free HBase On Demand Training
(includes Hive and MapReduce with HBase)
• https://www.mapr.com/services/mapr-academy/big-data-hadoop-
online-training
© 2015 MapR Technologies 37
Soon to Come
• Spark On Demand Training
– https://www.mapr.com/services/mapr-academy/
© 2015 MapR Technologies 38
References
• Spark web site: http://spark.apache.org/
• https://databricks.com/
• Spark on MapR:
– http://www.mapr.com/products/apache-spark
• Spark SQL and DataFrame Guide
• Apache Spark vs. MapReduce – Whiteboard Walkthrough
• Learning Spark - O'Reilly Book
• Apache Spark
© 2015 MapR Technologies 39
References
• Spark web site: http://spark.apache.org/
• https://databricks.com/
• Spark on MapR:
– http://www.mapr.com/products/apache-spark
• Spark SQL and DataFrame Guide
• Apache Spark vs. MapReduce – Whiteboard Walkthrough
• Learning Spark - O'Reilly Book
• Apache Spark
© 2015 MapR Technologies 40
Q&A
@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies

Más contenido relacionado

La actualidad más candente

NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark StreamingGerard Maas
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Carol McDonald
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBaseCarol McDonald
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRyan Bosshart
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...DataWorks Summit
 

La actualidad más candente (20)

NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBase
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 

Destacado

Data-Ed Online Webinar: Metadata Strategies
Data-Ed Online Webinar: Metadata StrategiesData-Ed Online Webinar: Metadata Strategies
Data-Ed Online Webinar: Metadata StrategiesDATAVERSITY
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Edureka!
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...DataWorks Summit/Hadoop Summit
 
HBaseとSparkでセンサーデータを有効活用 #hbasejp
HBaseとSparkでセンサーデータを有効活用 #hbasejpHBaseとSparkでセンサーデータを有効活用 #hbasejp
HBaseとSparkでセンサーデータを有効活用 #hbasejpFwardNetwork
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandJosh Elser
 
Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)tatsuya6502
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtMichael Stack
 
Apache HBase 入門 (第1回)
Apache HBase 入門 (第1回)Apache HBase 入門 (第1回)
Apache HBase 入門 (第1回)tatsuya6502
 

Destacado (11)

Data-Ed Online Webinar: Metadata Strategies
Data-Ed Online Webinar: Metadata StrategiesData-Ed Online Webinar: Metadata Strategies
Data-Ed Online Webinar: Metadata Strategies
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
 
HBaseとSparkでセンサーデータを有効活用 #hbasejp
HBaseとSparkでセンサーデータを有効活用 #hbasejpHBaseとSparkでセンサーデータを有効活用 #hbasejp
HBaseとSparkでセンサーデータを有効活用 #hbasejp
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to Understand
 
Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)Apache HBase 入門 (第2回)
Apache HBase 入門 (第2回)
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
 
Spark + HBase
Spark + HBase Spark + HBase
Spark + HBase
 
Apache HBase 入門 (第1回)
Apache HBase 入門 (第1回)Apache HBase 入門 (第1回)
Apache HBase 入門 (第1回)
 

Similar a Free Code Friday - Spark Streaming with HBase

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Spark Streaming Data Pipelines
Spark Streaming Data PipelinesSpark Streaming Data Pipelines
Spark Streaming Data PipelinesMapR Technologies
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streamingTao Li
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Tugdual Grall
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkNicola Ferraro
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Sri Ambati
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introductionRick Chang
 

Similar a Free Code Friday - Spark Streaming with HBase (20)

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Spark Streaming Data Pipelines
Spark Streaming Data PipelinesSpark Streaming Data Pipelines
Spark Streaming Data Pipelines
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Data Science
Data ScienceData Science
Data Science
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and CapabilitiesMATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introduction
 

Más de MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Más de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Último

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Free Code Friday - Spark Streaming with HBase

  • 1. © 2015 MapR Technologies 1© 2014 MapR Technologies Overview of Apache Spark Streaming
  • 2. © 2015 MapR Technologies 2 Agenda • Why Apache Spark Streaming ? • What is Apache Spark Streaming? – Key Concepts and Architecture • How it works by Example
  • 3. © 2015 MapR Technologies 3 Why Spark Streaming? • Process Time Series data : – Results in near-real-time • Use Cases – Social network trends – Website statistics, monitoring – Fraud detection – Advertising click monetization put put put put Time stamped data data • Sensor, System Metrics, Events, log files • Stock Ticker, User Activity • Hi Volume, Velocity Data for real-time monitoring
  • 4. © 2015 MapR Technologies 4 What is time series data? • Stuff with timestamps – Sensor data – log files – Phones.. Credit Card Transactions Web user behaviour Social media Log files Geodata Sensors
  • 5. © 2015 MapR Technologies 5 Why Spark Streaming ? What If? • You want to analyze data as it arrives? For Example Time Series Data: Sensors, Clicks, Logs, Stats
  • 6. © 2015 MapR Technologies 6 Batch Processing It's 6:01 and 72 degrees It's 6:02 and 75 degrees It's 6:03 and 77 degrees It's 6:04 and 85 degrees It's 6:05 and 90 degrees It's 6:06 and 85 degrees It's 6:07 and 77 degrees It's 6:08 and 75 degrees It was hot at 6:05 yesterday! Batch processing may be too late for some events
  • 7. © 2015 MapR Technologies 7 Event Processing It's 6:05 and 90 degrees Someone should open a window! Streaming Its becoming important to process events as they arrive
  • 8. © 2015 MapR Technologies 8 What is Spark Streaming? • extension of the core Spark AP • enables scalable, high-throughput, fault-tolerant stream processing of live data Data Sources Data Sinks
  • 9. © 2015 MapR Technologies 9 Stream Processing Architecture Streaming Sources/Apps MapR-FS Data Ingest Topics MapR-DB Data Storage MapR-FS Apps Stream Processing
  • 10. © 2015 MapR Technologies 10 Key Concepts • Data Sources: – File Based: HDFS – Network Based: TCP sockets, Twitter, Kafka, Flume, ZeroMQ, Akka Actor • Transformations • Output Operations MapR-FS Topics
  • 11. © 2015 MapR Technologies 11 Spark Streaming Architecture • Divide data stream into batches of X seconds – Called DStream = sequence of RDDs Spark Streaming input data stream DStream RDD batches Batch interval data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1
  • 12. © 2015 MapR Technologies 12 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements
  • 13. © 2015 MapR Technologies 13 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements • operated on in parallel • Cached in memory – Or on disk • Fault tolerant
  • 14. © 2015 MapR Technologies 14 Working With RDDs RDD RDD RDD RDD Transformations Action Value linesWithErrorRDD.count() 6 linesWithErrorRDD.first() # Error line textFile = sc.textFile(”SomeFile.txt”) linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)
  • 15. © 2015 MapR Technologies 15 Process DStream transform Transform map reduceByValue count DStream RDDs Dstream RDDs transformtransform • Process using transformations – creates new RDDs data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1 RDD @ time 1 RDD @ time 2 RDD @ time 3
  • 16. © 2015 MapR Technologies 16 Key Concepts • Data Sources • Transformations: create new DStream – Standard RDD operations: map, filter, union, reduce, join, … – Stateful operations: UpdateStateByKey(function), countByValueAndWindow, … • Output Operations
  • 17. © 2015 MapR Technologies 17 Spark Streaming Architecture • processed results are pushed out in batches Spark batches of processed results Spark Streaming input data stream DStream RDD batches data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1
  • 18. © 2015 MapR Technologies 18 Key Concepts • Data Sources • Transformations • Output Operations: trigger Computation – saveAsHadoopFiles – save to HDFS – saveAsHadoopDataset – save to Hbase – saveAsTextFiles – foreach – do anything with each batch of RDDs MapR-DB MapR-FS
  • 19. © 2015 MapR Technologies 19 Learning Goals • How it works by example
  • 20. © 2015 MapR Technologies 20 Use Case: Time Series Data Data for real-time monitoring read Spark Processing Spark Streaming Oil Pump Sensor data
  • 21. © 2015 MapR Technologies 21 Convert Line of CSV data to Sensor Object case class Sensor(resid: String, date: String, time: String, hz: Double, disp: Double, flo: Double, sedPPM: Double, psi: Double, chlPPM: Double) def parseSensor(str: String): Sensor = { val p = str.split(",") Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble, p(6).toDouble, p(7).toDouble, p(8).toDouble) }
  • 22. © 2015 MapR Technologies 22 Schema • All events stored, data CF could be set to expire data • Filtered alerts put in alerts CF • Daily summaries put in Stats CF Row key CF data CF alerts CF stats hz … psi psi … hz_avg … psi_min COHUTTA_3/10/14_1:01 10.37 84 0 COHUTTA_3/10/14 10 0
  • 23. © 2015 MapR Technologies 23 Basic Steps for Spark Streaming code These are the basic steps for Spark Streaming code: 1. create a Dstream 1. Apply transformations 2. Apply output operations 2. Start receiving data and processing it – using streamingContext.start(). 3. Wait for the processing to be stopped – using streamingContext.awaitTermination().
  • 24. © 2015 MapR Technologies 24 Create a DStream val ssc = new StreamingContext(sparkConf, Seconds(2)) val linesDStream = ssc.textFileStream(“/mapr/stream") batch time 0-1 linesDStream batch time 1-2 batch time 1-2 DStream: a sequence of RDDs representing a stream of data stored in memory as an RDD
  • 25. © 2015 MapR Technologies 25 Process DStream val linesDStream = ssc.textFileStream(”directory path") val sensorDStream = linesDStream.map(parseSensor) map new RDDs created for every batch batch time 0-1 linesDStream RDDs sensorDstream RDDs batch time 1-2 mapmap batch time 1-2
  • 26. © 2015 MapR Technologies 26 Process DStream // for Each RDD sensorDStream.foreachRDD { rdd => // filter sensor data for low psi val alertRDD = rdd.filter(sensor => sensor.psi < 5.0) . . . }
  • 27. © 2015 MapR Technologies 27 DataFrame and SQL Operations // for Each RDD parse into a sensor object filter sensorDStream.foreachRDD { rdd => . . . alertRdd.toDF().registerTempTable(”alert”) // join alert data with pump maintenance info val alertViewDF = sqlContext.sql( "select s.resid,s.psi, p.pumpType from alert s join pump p on s.resid = p.resid join maint m on p.resid=m.resid") . . . }
  • 28. © 2015 MapR Technologies 28 Save to HBase // for Each RDD parse into a sensor object filter sensorDStream.foreachRDD { rdd => . . . // convert alert to put object write to HBase alerts rdd.map(Sensor.convertToPutAlert) .saveAsHadoopDataset(jobConfig) }
  • 29. © 2015 MapR Technologies 29 Save to HBase rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig) map Put objects written To HBase batch time 0-1 linesRDD DStream sensorRDD Dstream batch time 1-2 mapmap batch time 1-2 HBase save save save output operation: persist data to external storage
  • 30. © 2015 MapR Technologies 30 Start Receiving Data sensorDStream.foreachRDD { rdd => . . . } // Start the computation ssc.start() // Wait for the computation to terminate ssc.awaitTermination()
  • 31. © 2015 MapR Technologies 31 Using HBase as a Source and Sink read write Spark applicationHBase database EXAMPLE: calculate and store summaries, Pre-Computed, Materialized View
  • 32. © 2015 MapR Technologies 32 HBase HBase Read and Write val hBaseRDD = sc.newAPIHadoopRDD( conf,classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig) newAPIHadoopRDD Row key Result saveAsHadoopDataset Key Put HBase Scan Result
  • 33. © 2015 MapR Technologies 33 Read HBase // Load an RDD of (rowkey, Result) tuples from HBase table val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) // get Result val resultRDD = hBaseRDD.map(tuple => tuple._2) // transform into an RDD of (RowKey, ColumnValue)s val keyValueRDD = resultRDD.map( result => (Bytes.toString(result.getRow()).split(" ")(0), Bytes.toDouble(result.value))) // group by rowkey , get statistics for column value val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list => StatCounter(list))
  • 34. © 2015 MapR Technologies 34 Write HBase // save to HBase table CF data val jobConfig: JobConf = new JobConf(conf, this.getClass) jobConfig.setOutputFormat(classOf[TableOutputFormat]) jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName) // convert psi stats to put and write to hbase table stats column family keyStatsRDD.map { case (k, v) => convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)
  • 35. © 2015 MapR Technologies 35 MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data • https://www.mapr.com/blog/spark-streaming-hbase
  • 36. © 2015 MapR Technologies 36 Free HBase On Demand Training (includes Hive and MapReduce with HBase) • https://www.mapr.com/services/mapr-academy/big-data-hadoop- online-training
  • 37. © 2015 MapR Technologies 37 Soon to Come • Spark On Demand Training – https://www.mapr.com/services/mapr-academy/
  • 38. © 2015 MapR Technologies 38 References • Spark web site: http://spark.apache.org/ • https://databricks.com/ • Spark on MapR: – http://www.mapr.com/products/apache-spark • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
  • 39. © 2015 MapR Technologies 39 References • Spark web site: http://spark.apache.org/ • https://databricks.com/ • Spark on MapR: – http://www.mapr.com/products/apache-spark • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
  • 40. © 2015 MapR Technologies 40 Q&A @mapr maprtech Engage with us! MapR maprtech mapr-technologies

Notas del editor

  1. Spark is really cool… This is NOT a silver bullet… This is another GREAT tool to have… Stop using hammers for putting in screws.
  2. a significant amount of data needs to be quickly processed in near real time to gain insights. Common examples include activity stream data from a web or mobile application, time-stamped log data, transactional data, and event streams from sensor or device networks. the stream-based approach applies processing to the data stream as it is generated. This allows near real-time processing Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data. Spark Streaming is for use cases which require a significant amount of data to be quickly processed as soon as it arrives. Example real-time use cases are: Website monitoring , Network monitoring Fraud detection Web clicks Advertising Internet of Things: sensors
  3. Having data in log files is not good for real time processing
  4. Its becoming important to process events as they arrive
  5. Spark Streaming brings Spark's api to stream processing, letting you write streaming jobs the same way you write batch jobs. Spark streaming supports data sources such as HDFS directories, TCP sockets, Twitter, message queues and distributed stream and log transfer frameworks, such as Flume, Kafka, and Amazon Kinesis. Data Streams can be processed with Spark’s core APIS, Dataframes SQL, or machine learning APIs, and can be persisted to a filesystem, HDFS, MapR-FS, MapR-DB, HBase, or any data source offering a Hadoop OutputFormat or Spark connector. DStreams can be created from various input sources, such as Flume, Kafka, or HDFS. Once built, they offer two types of operations: transformations, which yield a new DStream, and output operations, which write data to an external system.
  6. A data stream is a continuous sequence of records or events. Common examples include activity stream data from a web or mobile application, time-stamped log data, transactional data, and event streams from sensor or device networks. the stream-based approach applies processing to the data stream as it is generated. This allows near real-time processing. A stream processing architecture is typically made of a following components: Data sources – source of data streams. examples are sensor network, mobile application, web client, a log from a server, or even a thing from Internet of Things. Message bus – messaging systems such as Kafka and Flume ActiveMQ, RabbitMQ, etc. Stream processing system – framework for processing data streams. NoSql store for storing processed data must be capable of fast read and writes. HBase, Cassandra, MongoDb are popular choices. End applications – dashboards and other applications which use the processed data. DStreams can be created from various input sources, such as Flume, Kafka, or HDFS. Once built, they offer two types of operations: transformations, which yield a new DStream, and output operations, which write data to an external system.
  7. Spark Streaming supports various input sources, including file-based sources and network-based sources such as socket-based sources, the Twitter API stream, Akka actors, or message queues and distributed stream and log transfer frameworks, such Flume, Kafka, and Amazon Kinesis. Spark Streaming provides a set of transformations available on DStreams; these transformations are similar to those available on RDDs. These include map, flatMap, filter, join, and reduceByKey. Streaming also provides operators such as reduce and count. These operators return a DStream made up of a single element . Unlike the reduce and count operators on RDDs, these do not trigger computation on Dstreams , they are not actions, they return another DStream. Stateful transformations maintain state across batches, they use data or intermediate results from previous batches to compute the results of the current batch. They include transformations based on sliding windows and on tracking state across time. updateStateByKey() is used to track state across events for each key. For example this could be used to maintain stats (state) by a key accessLogDStream.reduceByKey(SUM_REDUCER). Actions are output operators that, when invoked, trigger computation on the DStream. They are as follows: print: This prints the first 10 elements of each batch to the console and is typically used for debugging and testing. saveAsObjectFile, saveAsTextFiles, and saveAsHadoopFiles: These functions output each batch to a Hadoop-compatible filesystem. forEachRDD: This operator allows to apply any arbitrary processing to the RDDs within each batch of a DStream.
  8. Streaming data is continuous and needs to be batched to process. Spark Streaming divides the data stream into batches of X seconds called Dstreams (discretized stream) Internally, each DStream is represented as a sequence of RDDs arriving at each time step. A DStream is a sequence of mini-batches, where each mini-batch is represented as a Spark RDD . The stream is broken up into time periods equal to the batch interval Each RDD in the stream will contain the records that are received by the Spark Streaming application during a given time window called the batch interval. Spark Streaming receives data from various input sources and groups it into small batches called the batch interval. , Each input batch forms an RDD. a Dstream is a sequence of RDDs, where each RDD has one time slice of the data in the stream.
  9. Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs.
  10. An RDD is simply a distributed collection of elements. You can think of the distributed collections like of like an array or list in your single machine program, except that it’s spread out across multiple nodes in the cluster. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them. So, Spark gives you APIs and functions that lets you do something on the whole collection in parallel using all the nodes.
  11. actions, which return a value to the driver program after running a computation on the dataset The chain of transformations from RDD1 to RDDn are logged, and can be repeated in the event of data loss or the failure of a cluster node.
  12. Your Spark Application processes the Dstream RDDs using Spark transformations like map, reduce, join … which create new RDDs. Any operation applied on a DStream translates to operations on the underlying RDDs, which, in turn, applies the transformation to the elements of the RDD. Streaming data is continuous and needs to be batched to process. Spark Streaming divides the data stream into batches of X seconds called Dstreams (discretized stream), which is a sequence of RDDs. Spark Application processes the RDDs using the Spark APIs finally the processed results of the RDD operations are returned in batches.
  13. Spark Streaming supports various input sources, including file-based sources and network-based sources such as socket-based sources, the Twitter API stream, Akka actors, or message queues and distributed stream and log transfer frameworks, such Flume, Kafka, and Amazon Kinesis. Spark Streaming provides a set of transformations available on DStreams; these transformations are similar to those available on RDDs. These include map, flatMap, filter, join, and reduceByKey. Streaming also provides operators such as reduce and count. These operators return a DStream made up of a single element . Unlike the reduce and count operators on RDDs, these do not trigger computation on Dstreams , they are not actions, they return another DStream. Stateful transformations maintain state across batches, they use data or intermediate results from previous batches to compute the results of the current batch. They include transformations based on sliding windows and on tracking state across time. updateStateByKey() is used to track state across events for each key. For example this could be used to maintain stats (state) by a key accessLogDStream.reduceByKey(SUM_REDUCER). Actions are output operators that, when invoked, trigger computation on the DStream. They are as follows: print: This prints the first 10 elements of each batch to the console and is typically used for debugging and testing. saveAsObjectFile, saveAsTextFiles, and saveAsHadoopFiles: These functions output each batch to a Hadoop-compatible filesystem. forEachRDD: This operator allows to apply any arbitrary processing to the RDDs within each batch of a DStream.
  14. Streaming data is continuous and needs to be batched to process. Spark Streaming divides the data stream into batches of X seconds called Dstreams (discretized stream), which is a sequence of RDDs. Spark Application processes the RDDs using the Spark APIs finally the processed results of the RDD operations are returned in batches. Output operations are similar to RDD actions in that they write data to an external system, but in Spark Streaming they run periodically on each time step, producing output in batches.
  15. Spark Streaming supports various input sources, including file-based sources and network-based sources such as socket-based sources, the Twitter API stream, Akka actors, or message queues and distributed stream and log transfer frameworks, such Flume, Kafka, and Amazon Kinesis. Spark Streaming provides a set of transformations available on DStreams; these transformations are similar to those available on RDDs. These include map, flatMap, filter, join, and reduceByKey. Streaming also provides operators such as reduce and count. These operators return a DStream made up of a single element . Unlike the reduce and count operators on RDDs, these do not trigger computation on Dstreams , they are not actions, they return another DStream. Stateful transformations maintain state across batches, they use data or intermediate results from previous batches to compute the results of the current batch. They include transformations based on sliding windows and on tracking state across time. updateStateByKey() is used to track state across events for each key. For example this could be used to maintain stats (state) by a key accessLogDStream.reduceByKey(SUM_REDUCER). Actions are output operators that, when invoked, trigger computation on the DStream. They are as follows: print: This prints the first 10 elements of each batch to the console and is typically used for debugging and testing. saveAsObjectFile, saveAsTextFiles, and saveAsHadoopFiles: These functions output each batch to a Hadoop-compatible filesystem. forEachRDD: This operator allows to apply any arbitrary processing to the RDDs within each batch of a DStream.
  16. These are the basic steps for Spark Streaming code: Initialize a Spark StreamingContext object. Apply transformations and output operations to DStreams. Start receiving data and processing it using streamingContext.start(). Wait for the processing to be stopped using streamingContext.awaitTermination().
  17. We want to store every single event as it comes in, in a Hbase. We also want to filter for and store alarms. Daily Spark processing will store aggregated summary statistics The Spark Streaming example does the following: Reads streaming data. Processes the streaming data. Writes the processed data to an HBase Table. (non Streaming) Spark code does the following: Reads HBase Table data written by the streaming code Calculates daily summary statistics Writes summary statistics to the HBase table Column Family stats
  18. The Oil Pump Sensor data comes in as comma separated value (csv) files dropped in a directory. Spark Streaming will monitor the directory and process any files created in that directory. (As stated before, Spark Streaming supports different streaming data sources; for simplicity, this example will use files.) Below is an example of the csv file with some sample data: We use a Scala case class to define the Sensor schema corresponding to the sensor data csv files, and a parseSensor function to parse the comma separated values into the sensor case class.
  19. The HBase Table Schema for the streaming data is as follows: Composite row key of the pump name date and time stamp Column Family data with columns corresponding to the input data fields Column Family alerts with columns corresponding to any filters for alarming values Note that the data and alert column families could be set to expire values after a certain amount of time. The Schema for the daily statistics summary rollups is as follows: Composite row key of the pump name and date Column Family stats
  20. These are the basic steps for Spark Streaming code: Initialize a Spark StreamingContext object. Using this context, create a Dstream which represents streaming data from a source Apply transformations and/or output operations to DStreams. Start receiving data and processing it using streamingContext.start(). Wait for the processing to be stopped using streamingContext.awaitTermination(). We will go through these steps showing code from our use case example Spark Streaming programs are best run as standalone applications built using Maven or sbt.
  21. First we create a StreamingContext, the main entry point for streaming functionality, with a 2 second batch interval. Using this context, we can create a DStream that represents streaming data from a source . In this example we use the StreamingContext textFileStream(directory) method to create an input stream that monitors a Hadoop-compatible file system for new files and processes any files created in that directory. This ingestion type supports a workflow where new files are written to a landing directory and Spark Streaming is used to detect them, ingest them, and process the data. Only use this ingestion type with files that are moved or copied into a directory. The linesDStream represents the stream of data, each record is a line of text. Internally a DStream is a sequence of RDDs, one RDD per batch interval.
  22. Next we parse the lines of data into Sensor objects, with the map operation on the linesDStream. The map operation applies the Sensor.parseSensor function on the RDDs in the linesDStream, resulting in RDDs of Sensor objects. Any operation applied on a DStream translates to operations on the underlying RDDs. the map operation is applied on each RDD in the linesDStream to generate the RDDs of the sensorDStream.
  23. Next we use the DStream foreachRDD method to apply processing to each RDD in this DStream. We filter the sensor objects for low psi to create an RDD of alert sensor objects
  24. Here we join the alert data with pump vendor and maintenance information (the pump vendor and maintenance information was read in and cached before streaming . Each RDD is converted to a DataFrame , registered as a temporary table and then queried using SQL. rdd.toDF().registerTempTable("sensor")
  25. Next we use the DStream foreachRDD method to apply processing to each RDD in this DStream. We filter the sensor objects for low psi to create alerts, then we write the sensor and alert data to HBase by converting them to Put objects, and using the PairRDDFunctions saveAsHadoopDatasetmethod, which outputs the RDD to any Hadoop-supported storage system using a Hadoop Configuration object for that storage system (see Hadoop Configuration for HBase above).
  26. The sensorRDD objects are converted to put objects then written to HBase.
  27. To start receiving data, we must explicitly call start() on the StreamingContext, then call awaitTermination to wait for the streaming computation to finish.
  28. The last way to leverage MapReduce in Hbase is to use the Hbase database as both a source and sink in our data flow. One example of this use case is to calculate summaries across the Hbase data and then store those summaries back in the Hbase database. For summary jobs where HBase is used as a source and a sink, then writes will be coming from the Reducer step.
  29. This shows the Input and Output for the TableMapper class map method and the TableReducer class reduce method for reading from and to HBase. A scan Result object and Row key are sent to the Mapper map method one row at a time, one or more Key Value pairs are output by the Map method . The Reducer reduce method receives a Key and an iterable list of corresponding values, and outputs a put object .
  30. we register the DataFrame as a table. Registering it as a table allows us to use it in subsequent SQL statements.   Now we can inspect the data.
  31. we register the DataFrame as a table. Registering it as a table allows us to use it in subsequent SQL statements.   Now we can inspect the data.
  32. This post will help you get started using Apache Spark DataFrames with Scala on the MapR Sandbox. The new Spark DataFrames API is designed to make big data processing on tabular data easier. A Spark DataFrame is a distributed collection of data organized into named columns that provides operations to filter, group, or compute aggregates, and can be used with Spark SQL.