Free Code Friday - Spark Streaming with HBase

© 2015 MapR Technologies 1© 2014 MapR Technologies
Overview of Apache Spark Streaming

© 2015 MapR Technologies 2
Agenda
• Why Apache Spark Streaming ?
• What is Apache Spark Streaming?
– Key Concepts and Architecture
• How it works by Example

Why Spark Streaming?
• Process Time Series data :
– Results in near-real-time
• Use Cases
– Social network trends
– Website statistics, monitoring
– Fraud detection
– Advertising click monetization
put
put
put
put
Time stamped data
data
• Sensor, System Metrics, Events, log files
• Stock Ticker, User Activity
• Hi Volume, Velocity
Data for real-time
monitoring

What is time series data?
• Stuff with timestamps
– Sensor data
– log files
– Phones..
Credit Card Transactions Web user behaviour
Social media
Log files
Geodata
Sensors

Why Spark Streaming ?
What If?
• You want to analyze data as it arrives?
For Example Time Series Data: Sensors, Clicks, Logs, Stats

Batch Processing
It's 6:01 and 72 degrees
It was hot at 6:05
yesterday!
Batch processing may be too late for some events

Event Processing
It's 6:05 and
90 degrees
Someone should
open a window!
Streaming
Its becoming important to process events as they arrive

What is Spark Streaming?
• extension of the core Spark AP
• enables scalable, high-throughput, fault-tolerant stream
processing of live data
Data Sources Data Sinks

Stream Processing Architecture
Streaming
Sources/Apps
MapR-FS
Data Ingest
Topics
MapR-DB
Data Storage
MapR-FS
Apps
Stream
Processing

Key Concepts
• Data Sources:
– File Based: HDFS
– Network Based: TCP sockets,
Twitter, Kafka, Flume, ZeroMQ, Akka Actor
• Transformations
• Output Operations
MapR-FS
Topics

Spark Streaming Architecture
• Divide data stream into batches of X seconds
– Called DStream = sequence of RDDs
Spark
Streaming
input data
stream
DStream RDD batches
Batch
interval
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1

Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• read only collection of
elements

Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• read only collection of
elements
• operated on in parallel
• Cached in memory
– Or on disk
• Fault tolerant

Working With RDDs
RDD
RDD
RDD
RDD
Transformations
Action Value
linesWithErrorRDD.count()
6
linesWithErrorRDD.first()
# Error line
textFile = sc.textFile(”SomeFile.txt”)
linesWithErrorRDD = linesRDD.filter(lambda line: “ERROR” in line)

Process DStream
transform
Transform
map
reduceByValue
count
DStream
RDDs
Dstream
RDDs
transformtransform
• Process using transformations
– creates new RDDs
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 1 RDD @ time 2 RDD @ time 3

Key Concepts
• Data Sources
• Transformations: create new DStream
– Standard RDD operations: map, filter, union, reduce, join, …
– Stateful operations: UpdateStateByKey(function),
countByValueAndWindow, …
• Output Operations

Spark Streaming Architecture
• processed results are pushed out in batches
Spark
batches of processed
results
Spark
Streaming
input data
stream
DStream RDD batches
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3

Key Concepts
• Data Sources
• Transformations
• Output Operations: trigger Computation
– saveAsHadoopFiles – save to HDFS
– saveAsHadoopDataset – save to Hbase
– saveAsTextFiles
– foreach – do anything with each batch of RDDs
MapR-DB
MapR-FS

Learning Goals
• How it works by example

Use Case: Time Series Data
Data for
real-time monitoring
read
Spark Processing
Spark
Streaming
Oil Pump Sensor data

Convert Line of CSV data to Sensor Object
case class Sensor(resid: String, date: String, time: String,
hz: Double, disp: Double, flo: Double, sedPPM: Double,
psi: Double, chlPPM: Double)
def parseSensor(str: String): Sensor = {
val p = str.split(",")
Sensor(p(0), p(1), p(2), p(3).toDouble, p(4).toDouble, p(5).toDouble,
p(6).toDouble, p(7).toDouble, p(8).toDouble)
}

Schema
• All events stored, data CF could be set to expire data
• Filtered alerts put in alerts CF
• Daily summaries put in Stats CF
Row key
CF data CF alerts CF stats
hz … psi psi … hz_avg … psi_min
COHUTTA_3/10/14_1:01 10.37 84 0
COHUTTA_3/10/14 10 0

Basic Steps for Spark Streaming code
These are the basic steps for Spark Streaming code:
1. create a Dstream
1. Apply transformations
2. Apply output operations
2. Start receiving data and processing it
– using streamingContext.start().
3. Wait for the processing to be stopped
– using streamingContext.awaitTermination().

Create a DStream
val ssc = new StreamingContext(sparkConf, Seconds(2))
val linesDStream = ssc.textFileStream(“/mapr/stream")
batch
time 0-1
linesDStream
batch
time 1-2
batch
time 1-2
DStream: a sequence of RDDs representing a
stream of data
stored in memory as an
RDD

Process DStream
val linesDStream = ssc.textFileStream(”directory path")
val sensorDStream = linesDStream.map(parseSensor)
map
new RDDs created
for every batch
batch
time 0-1
linesDStream
RDDs
sensorDstream
RDDs
batch
time 1-2
mapmap
batch
time 1-2

Process DStream
// for Each RDD
sensorDStream.foreachRDD { rdd =>
// filter sensor data for low psi
val alertRDD = rdd.filter(sensor => sensor.psi < 5.0)
. . .
}

DataFrame and SQL Operations
// for Each RDD parse into a sensor object filter
. . .
alertRdd.toDF().registerTempTable(”alert”)
// join alert data with pump maintenance info
val alertViewDF = sqlContext.sql(
"select s.resid,s.psi, p.pumpType
from alert s join pump p on s.resid = p.resid
join maint m on p.resid=m.resid")
. . .
}

Save to HBase
// for Each RDD parse into a sensor object filter
. . .
// convert alert to put object write to HBase alerts
rdd.map(Sensor.convertToPutAlert)
.saveAsHadoopDataset(jobConfig)
}

Save to HBase
rdd.map(Sensor.convertToPut).saveAsHadoopDataset(jobConfig)
map
Put objects written
To HBase
batch
time 0-1
linesRDD
DStream
sensorRDD
Dstream
batch
time 1-2
mapmap
batch
time 1-2
HBase
save save save
output operation: persist data to external storage

Start Receiving Data
. . .
}
// Start the computation
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()

Using HBase as a Source and Sink
read
write
Spark applicationHBase database
EXAMPLE: calculate and store summaries,
Pre-Computed, Materialized View

HBase
HBase Read and Write
val hBaseRDD = sc.newAPIHadoopRDD(
conf,classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
keyStatsRDD.map { case (k, v) => convertToPut(k, v)
}.saveAsHadoopDataset(jobConfig)
newAPIHadoopRDD
Row key Result
saveAsHadoopDataset
Key Put
HBase
Scan Result

Read HBase
// Load an RDD of (rowkey, Result) tuples from HBase table
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
// get Result
val resultRDD = hBaseRDD.map(tuple => tuple._2)
// transform into an RDD of (RowKey, ColumnValue)s
val keyValueRDD = resultRDD.map(
result => (Bytes.toString(result.getRow()).split(" ")(0),
Bytes.toDouble(result.value)))
// group by rowkey , get statistics for column value
val keyStatsRDD = keyValueRDD.groupByKey().mapValues(list =>
StatCounter(list))

Write HBase
// save to HBase table CF data
val jobConfig: JobConf = new JobConf(conf, this.getClass)
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE, tableName)
// convert psi stats to put and write to hbase table stats column family
keyStatsRDD.map { case (k, v) =>
convertToPut(k, v) }.saveAsHadoopDataset(jobConfig)

MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data
• https://www.mapr.com/blog/spark-streaming-hbase

Free HBase On Demand Training
(includes Hive and MapReduce with HBase)
• https://www.mapr.com/services/mapr-academy/big-data-hadoop-
online-training

Soon to Come
• Spark On Demand Training
– https://www.mapr.com/services/mapr-academy/

References
• Spark web site: http://spark.apache.org/
• https://databricks.com/
• Spark on MapR:
– http://www.mapr.com/products/apache-spark
• Spark SQL and DataFrame Guide
• Apache Spark vs. MapReduce – Whiteboard Walkthrough
• Learning Spark - O'Reilly Book
• Apache Spark

Q&A
@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies

Free Code Friday - Spark Streaming with HBase

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (11)

Similar a Free Code Friday - Spark Streaming with HBase

Similar a Free Code Friday - Spark Streaming with HBase (20)

Más de MapR Technologies

Más de MapR Technologies (20)

Último

Último (20)

Free Code Friday - Spark Streaming with HBase

Notas del editor