SlideShare una empresa de Scribd logo
1 de 34
Streaming Analytics Tutorial
Ashish Gupta (LinkedIn)
Neera Agarwal
Streaming Analytics
Before doing this tutorial please read the main presentation.
http://www.slideshare.net/NeeraAgarwal2/streaming-analytics
Technical Requirements
● OS: MAC OS X
● Programming Language: Scala 2.10.x
● Open source software used in tutorial: Kafka and Spark 1.6.2
Tutorials
Wiki topic
Ads topic
Clicks topic
Wiki
Edit
Events
Ad
Events
Click
Events
Producers
Kafka
Spark Streaming
Consumers
WikiPedia
Article
Edit Metrics
Impression &
Click Metrics
Tutorial -1
Tutorial -2
Step 1: Installation
In the conference we provided USB stick with the environment ready for the
tutorials. Here are the instructions to create your own environment:
1. Check Java: java -version
java version "1.8.0_92" (Note: Java 1.7+ should work.)
1. Check Maven: mvn -v
If not installed, check instructions in the Additional slides at the end.
Step 1: Installation
3. Install Scala
curl -O http://downloads.lightbend.com/scala/2.10.6/scala-2.10.6.tgz
tar -xzf scala-2.10.6.tgz
Set scala home to the path pointing to scala-2.10.6 folder. For example on Mac:
export SCALA_HOME=/Users/<username>/scala-2.10.6
export PATH=$PATH:$SCALA_HOME/bin
4. Install Spark
curl -O http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz
Step 1: Installation
5. Install Kafka
curl -O http://apache.claz.org/kafka/0.10.0.0/kafka_2.10-0.10.0.0.tgz
tar -xzf kafka_2.10-0.10.0.0.tgz
6. Download tutorial
https://github.com/NeeraAgarwal/kdd2016-streaming-tutorial
Step 2: Start Kafka
Start a terminal window. Start Zookeeper
> cd kafka_2.10-0.10.0.0
> bin/zookeeper-server-start.sh config/zookeeper.properties
Wait for zookeeper to start. Start another terminal window and start Kafka
> cd kafka_2.10-0.10.0.0
> bin/kafka-server-start.sh config/server.properties
Tutorial 1:
Bot and Human edit counts on Wikipedia Edit stream
Step 1: Listening to WikiPedia Edit Stream
Start a new terminal window. Run WikiPedia Connector
> cd kdd2016-streaming-tutorial
> java -cp target/streamingtutorial-1.0.0-jar-with-dependencies.jar example.WikiPediaConnector
After some messages, stop using CTRL-C. We will run it again after writing
streaming code.
Does not run, build package and try again...
> mvn package
Step 1: WikiPedia Stream message structure
[[-acylglycerol O-acyltransferase]] MB
https://en.wikipedia.org/w/index.php?diff=733783045&oldid=721976415 * BU RoBOT * (-1) /* References
*/Sort into more specific stub template based on presence in [[Category:EC 2.3]] or subcategories (Task
25)
[[City Building]] https://en.wikipedia.org/w/index.php?diff=733783047&oldid=732314994 * Hmains * (+9)
refine category structures
[[Wikipedia:Articles for deletion/Log/2016 August 10]] B
https://en.wikipedia.org/w/index.php?diff=733783051&oldid=733783026 * Cyberbot
Fields: title, flags, diffUrl, user, byteDiff, summary
Flags (2nd field): ‘M’=Minor, ‘N’ = New, ‘!’ = Unpatrolled, ‘B’ = Bot Edit
WikiPedia Stream pattern = "[[(.*)]]s(.*)s(.*)s*s(.*)s*s(+?(.d*))s(.*)".r
Spark: RDD
An RDD is an immutable distributed collection of objects. Each RDD is split into
multiple partitions, which may be computed on different nodes of the cluster.
RDDs can contain any type of objects, including Python, Java, Scala or user-
defined classes.
RDDs offer two types of operations:
● Transformations construct a new RDD from a previous one.
● Actions compute a result based on an RDD, and either return it to the driver program or
save it to an external storage system.
Spark: DStream
DStream is a sequence of data arriving over time.
Internally, each DStream is represented as a sequence of RDDs arriving at each time step.
DStreams offer two types of operations:
● Transformations yield a new DStream.
● Output operations write data to an external system.
Ref: https://spark.apache.org/docs/latest/streaming-programming-guide.html
Step 2: Write Code
In kdd2016-streaming-tutorial
Change code in src/main/scala/example/WikiPediaStreaming.scala file. Use your
favorite editor.
val lines = messages.foreachRDD { rdd =>
// ADD CODE HERE
}
Note: In the conference participants were asked to write the code while in github repository full code is provided.
Step 2: Continued
rdd =>
val linesDF = rdd.map(row => row._2 match {
case pattern(title, flags, diffUrl, user, byteDiff, summary) => WikiEdit(title, flags, diffUrl, user,
byteDiff.toInt, summary)
case _ => WikiEdit("title", "flags", "diffUrl", "user", 0, "summary")
}).filter(row => row.title != "title").toDF()
Step 2 Continued
// Number of records in 10 second window.
val totalCnt = linesDF.count()
// Number of bot edited records in 10 second window.
val botEditCnt = linesDF.filter("flags like '%B%'").count()
// Number of human edited records in 10 second window.
val humanEditCnt = linesDF.filter("flags not like '%B%'").count()
val botEditPct = if (totalCnt > 0) 100 * botEditCnt / totalCnt else 0
val humanEditPct = if (totalCnt > 0) 100 * humanEditCnt / totalCnt else 0
Step 3: Build Program
Start a new terminal window.
> cd kdd2016-streaming-tutorial
> mvn package
Step 4: Run Programs
Run WikiPediaConnector in terminal window. It starts receiving data from
WikiPedia IRC channel and writes to Kafka.
>java -cp target/streamingtutorial-1.0.0-jar-with-dependencies.jar
example.WikiPediaConnector
Run WikiPediaStream in a new terminal window.
> cd kdd2016-streaming-tutorial
> ../spark-1.6.2-bin-hadoop2.6/bin/spark-submit --class
example.WikiPediaStreaming target/streamingtutorial-1.0.0-jar-with-
dependencies.jar
Output
Tutorial 2:
Impression Click metrics on Ad and Click streams
Tutorials
Wiki topic
Ads topic
Clicks topic
Wiki
Edit
Events
Ad
Events
Click
Events
Producers
Kafka
Spark Streaming
Consumers
WikiPedia
Edit Metrics
Impression &
Click Metrics
Tutorial -1
Tutorial -2
Step 1: Listening to Ads and Clicks stream
Run program to replay Ad and Click Events from file
> cd kdd2016-streaming-tutorial
> java -cp target/streamingtutorial-1.0.0-jar-with-dependencies.jar example.AdClickEventReplay
After some messages, stop using CTRL-C. We will run it again after writing
streaming code.
Does not run, build package and try again...
> mvn package
Step 1: Ad and Click Event message structure
Ad Event:
QueryID, AdId, TimeStamp: 6815, 48195, 1470632477761
Click Event:
QueryID, ClickId, TimeStamp: 6815, 93630, 1470632827088
Join on QueryId, show metrics by Ad Id.
Step 2: Write Code
In kdd2016-streaming-tutorial
Change code in src/main/scala/example/AdEventJoiner.scala file.
val adEventDStream = adStream.transform( rdd => {
rdd.map(line => line._2.split(",")).
map(row => (row(0).trim.toInt, AdEvent(row(0).trim.toInt, row(1).trim.toInt, row(2).trim.toLong)))
})
// ADD CODE HERE..
Note: In the conference participants were asked to write the code while in github repository full code is provided.
Step 2 Continued
//Connects Spark Streaming to Kafka Topic and gets DStream of RDDs (click event message)
val clickStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,
kafkaParams, clickStreamTopicSet)
//Create a new DStream by extracting kafka message and converting it to DStream[queryId, ClickEvent]
val clickEventDStream = clickStream.transform{ rdd =>
rdd.map(line => line._2.split(",")).
map(row => (row(0).trim.toInt, ClickEvent(row(0).trim.toInt, row(1).trim.toInt, row(2).trim.toLong)))
}
Step 2 Continued
// Join adEvent and clickEvent DStreams and output DStream[queryId, (adEvent, clickEvent)]
val joinByQueryId = adEventDStream.join(clickEventDStream)
joinByQueryId.print()
// Transform DStream to DStream[adId, count(adId)] for each RDD
val countByAdId = joinByQueryId.map(rdd => (rdd._2._1.adId,1)).reduceByKey(_+_)
Step 2 Continued
// Update the state [adId, countCummulative(adId)] by values from the next RDDs
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.sum
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
val countByAdIdCumm = countByAdId.updateStateByKey(updateFunc)
// Transform (key, value) pair to (adId, count(adId), countCummulative(adId))
val ad = countByAdId.join(countByAdIdCumm).map {case (adId, (count, cumCount)) => (adId, count, cumCount)}
Step 2 Continued
//Print report
ad.foreachRDD( ad => {
println("%5s %10s %12s".format("AdId", "AdCount", "AdCountCumm"))
ad.foreach( row => println("%5s %10s %12s".format(row._1, row._2, row._3)))
})
Step 3: Build Program
> cd kdd2016-streaming-tutorial
> mvn package
Step 4: Run Programs
Run AdClickEventReplay in terminal window. It reads data from Ad and Click event
files writes to Kafka.
> cd kdd2016-streaming-tutorial
> java -cp target/streamingtutorial-1.0.0-jar-with-dependencies.jar
example.AdClickEventReplay
Run AdEventJoiner in a new terminal window.
> cd kdd2016-streaming-tutorial
> ../spark-1.6.2-bin-hadoop2.6/bin/spark-submit --class example.AdEventJoiner
target/streamingtutorial-1.0.0-jar-with-dependencies.jar
Output
Contact Us
Ashish Gupta - ahgupta@linkedin.com
https://www.linkedin.com/in/guptash
Neera Agarwal - neera8work@gmail.com
https://www.linkedin.com/in/neera-agarwal-21b9473
Additional Notes - Install Java (Mac 10.11)
● java -version
java version "1.8.0_92"
If java does not exist, try
In .bash_profie add export JAVA_HOME=$(/usr/libexec/java_home)
(Install Java - https://java.com/en/download/help/mac_install.xml)
Additional Notes - Install Maven
● Check Maven on a terminal window
○ mvn -v
○ Apache Maven 3.2.5+
● Install Maven
○ brew install maven
OR if you do not have brew then do:
1. curl -O http://mirror.nexcess.net/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-
bin.tar.gz
2. tar -xzvf apache-maven-3.3.9-bin.tar.gz

Más contenido relacionado

La actualidad más candente

Проведение криминалистической экспертизы и анализа руткит-программ на примере...
Проведение криминалистической экспертизы и анализа руткит-программ на примере...Проведение криминалистической экспертизы и анализа руткит-программ на примере...
Проведение криминалистической экспертизы и анализа руткит-программ на примере...Alex Matrosov
 
Introduction to apache zoo keeper
Introduction to apache zoo keeper Introduction to apache zoo keeper
Introduction to apache zoo keeper Omid Vahdaty
 
Object Oriented Code RE with HexraysCodeXplorer
Object Oriented Code RE with HexraysCodeXplorerObject Oriented Code RE with HexraysCodeXplorer
Object Oriented Code RE with HexraysCodeXplorerAlex Matrosov
 
Apache zookeeper 101
Apache zookeeper 101Apache zookeeper 101
Apache zookeeper 101Quach Tung
 
Node 관계형 데이터베이스_바인딩
Node 관계형 데이터베이스_바인딩Node 관계형 데이터베이스_바인딩
Node 관계형 데이터베이스_바인딩HyeonSeok Choi
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템HyeonSeok Choi
 
Android development with Scala and SBT
Android development with Scala and SBTAndroid development with Scala and SBT
Android development with Scala and SBTAnton Yalyshev
 
Cassandra Summit 2013 Keynote
Cassandra Summit 2013 KeynoteCassandra Summit 2013 Keynote
Cassandra Summit 2013 Keynotejbellis
 
MongoDB Live Hacking
MongoDB Live HackingMongoDB Live Hacking
MongoDB Live HackingTobias Trelle
 
HKG15-211: Advanced Toolchain Usage Part 4
HKG15-211: Advanced Toolchain Usage Part 4HKG15-211: Advanced Toolchain Usage Part 4
HKG15-211: Advanced Toolchain Usage Part 4Linaro
 
Tokyo cassandra conference 2014
Tokyo cassandra conference 2014Tokyo cassandra conference 2014
Tokyo cassandra conference 2014jbellis
 
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2Antonios Giannopoulos
 
Windows Remote Management - EN
Windows Remote Management - ENWindows Remote Management - EN
Windows Remote Management - ENKirill Nikolaev
 
[C++ gui programming with qt4] chap9
[C++ gui programming with qt4] chap9[C++ gui programming with qt4] chap9
[C++ gui programming with qt4] chap9Shih-Hsiang Lin
 
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016Susan Potter
 
Cutting Edge Data Processing with PHP & XQuery
Cutting Edge Data Processing with PHP & XQueryCutting Edge Data Processing with PHP & XQuery
Cutting Edge Data Processing with PHP & XQueryWilliam Candillon
 

La actualidad más candente (19)

Проведение криминалистической экспертизы и анализа руткит-программ на примере...
Проведение криминалистической экспертизы и анализа руткит-программ на примере...Проведение криминалистической экспертизы и анализа руткит-программ на примере...
Проведение криминалистической экспертизы и анализа руткит-программ на примере...
 
Introduction to apache zoo keeper
Introduction to apache zoo keeper Introduction to apache zoo keeper
Introduction to apache zoo keeper
 
Object Oriented Code RE with HexraysCodeXplorer
Object Oriented Code RE with HexraysCodeXplorerObject Oriented Code RE with HexraysCodeXplorer
Object Oriented Code RE with HexraysCodeXplorer
 
Apache zookeeper 101
Apache zookeeper 101Apache zookeeper 101
Apache zookeeper 101
 
Apache ZooKeeper
Apache ZooKeeperApache ZooKeeper
Apache ZooKeeper
 
Return of c++
Return of c++Return of c++
Return of c++
 
Node 관계형 데이터베이스_바인딩
Node 관계형 데이터베이스_바인딩Node 관계형 데이터베이스_바인딩
Node 관계형 데이터베이스_바인딩
 
COLLADA & WebGL
COLLADA & WebGLCOLLADA & WebGL
COLLADA & WebGL
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템
 
Android development with Scala and SBT
Android development with Scala and SBTAndroid development with Scala and SBT
Android development with Scala and SBT
 
Cassandra Summit 2013 Keynote
Cassandra Summit 2013 KeynoteCassandra Summit 2013 Keynote
Cassandra Summit 2013 Keynote
 
MongoDB Live Hacking
MongoDB Live HackingMongoDB Live Hacking
MongoDB Live Hacking
 
HKG15-211: Advanced Toolchain Usage Part 4
HKG15-211: Advanced Toolchain Usage Part 4HKG15-211: Advanced Toolchain Usage Part 4
HKG15-211: Advanced Toolchain Usage Part 4
 
Tokyo cassandra conference 2014
Tokyo cassandra conference 2014Tokyo cassandra conference 2014
Tokyo cassandra conference 2014
 
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
 
Windows Remote Management - EN
Windows Remote Management - ENWindows Remote Management - EN
Windows Remote Management - EN
 
[C++ gui programming with qt4] chap9
[C++ gui programming with qt4] chap9[C++ gui programming with qt4] chap9
[C++ gui programming with qt4] chap9
 
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
 
Cutting Edge Data Processing with PHP & XQuery
Cutting Edge Data Processing with PHP & XQueryCutting Edge Data Processing with PHP & XQuery
Cutting Edge Data Processing with PHP & XQuery
 

Destacado

Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processingYogi Devendra Vyavahare
 
The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...Audrey Neveu
 
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day MunichReal Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day MunichGuido Schmutz
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitGyula Fóra
 
RBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at KingRBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at KingGyula Fóra
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 
Real-time analytics as a service at King
Real-time analytics as a service at King Real-time analytics as a service at King
Real-time analytics as a service at King Gyula Fóra
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinGuido Schmutz
 
Data Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operationsData Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operationsVincenzo Gulisano
 
Stream Analytics in the Enterprise
Stream Analytics in the EnterpriseStream Analytics in the Enterprise
Stream Analytics in the EnterpriseJesus Rodriguez
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?MapR Technologies
 
Reliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTReliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTGuido Schmutz
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream ProcessingGyula Fóra
 
Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingGuido Schmutz
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Guido Schmutz
 
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...Amazon Web Services
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
 

Destacado (20)

Streaming Analytics
Streaming AnalyticsStreaming Analytics
Streaming Analytics
 
Introduction to Real-time data processing
Introduction to Real-time data processingIntroduction to Real-time data processing
Introduction to Real-time data processing
 
The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...
 
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day MunichReal Time Analytics with Apache Cassandra - Cassandra Day Munich
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop Summit
 
RBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at KingRBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at King
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Real-time analytics as a service at King
Real-time analytics as a service at King Real-time analytics as a service at King
Real-time analytics as a service at King
 
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
 
Data Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operationsData Streaming (in a Nutshell) ... and Spark's window operations
Data Streaming (in a Nutshell) ... and Spark's window operations
 
Stream Analytics in the Enterprise
Stream Analytics in the EnterpriseStream Analytics in the Enterprise
Stream Analytics in the Enterprise
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?
 
Reliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTReliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoT
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 
Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream Processing
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016
 
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 

Similar a KDD 2016 Streaming Analytics Tutorial

Productionalizing spark streaming applications
Productionalizing spark streaming applicationsProductionalizing spark streaming applications
Productionalizing spark streaming applicationsRobert Sanders
 
Hands-on Lab: Comparing Redis with Relational
Hands-on Lab: Comparing Redis with RelationalHands-on Lab: Comparing Redis with Relational
Hands-on Lab: Comparing Redis with RelationalAmazon Web Services
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand fordThu Hiền
 
The Ring programming language version 1.2 book - Part 51 of 84
The Ring programming language version 1.2 book - Part 51 of 84The Ring programming language version 1.2 book - Part 51 of 84
The Ring programming language version 1.2 book - Part 51 of 84Mahmoud Samir Fayed
 
Hands-on Lab: Amazon ElastiCache
Hands-on Lab: Amazon ElastiCacheHands-on Lab: Amazon ElastiCache
Hands-on Lab: Amazon ElastiCacheAmazon Web Services
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Codemotion
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Codemotion
 
Hands-on Lab - Combaring Redis with Relational
Hands-on Lab - Combaring Redis with RelationalHands-on Lab - Combaring Redis with Relational
Hands-on Lab - Combaring Redis with RelationalAmazon Web Services
 
DevFest 2022 - Skaffold 2 Deep Dive Taipei.pdf
DevFest 2022 - Skaffold 2 Deep Dive Taipei.pdfDevFest 2022 - Skaffold 2 Deep Dive Taipei.pdf
DevFest 2022 - Skaffold 2 Deep Dive Taipei.pdfKAI CHU CHUNG
 
Hopping in clouds - phpuk 17
Hopping in clouds - phpuk 17Hopping in clouds - phpuk 17
Hopping in clouds - phpuk 17Michele Orselli
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1Hajime Tazaki
 
Lab Manual Combaring Redis with Relational
Lab Manual Combaring Redis with RelationalLab Manual Combaring Redis with Relational
Lab Manual Combaring Redis with RelationalAmazon Web Services
 
The Ring programming language version 1.8 book - Part 77 of 202
The Ring programming language version 1.8 book - Part 77 of 202The Ring programming language version 1.8 book - Part 77 of 202
The Ring programming language version 1.8 book - Part 77 of 202Mahmoud Samir Fayed
 
The Ring programming language version 1.7 book - Part 75 of 196
The Ring programming language version 1.7 book - Part 75 of 196The Ring programming language version 1.7 book - Part 75 of 196
The Ring programming language version 1.7 book - Part 75 of 196Mahmoud Samir Fayed
 
Wtf is happening_inside_my_android_phone_public
Wtf is happening_inside_my_android_phone_publicWtf is happening_inside_my_android_phone_public
Wtf is happening_inside_my_android_phone_publicJaime Blasco
 
Jaime Blasco & Pablo Rincón - Lost in translation: WTF is happening inside m...
Jaime Blasco & Pablo Rincón -  Lost in translation: WTF is happening inside m...Jaime Blasco & Pablo Rincón -  Lost in translation: WTF is happening inside m...
Jaime Blasco & Pablo Rincón - Lost in translation: WTF is happening inside m...RootedCON
 
Technical Report Vawtrak v2
Technical Report Vawtrak v2Technical Report Vawtrak v2
Technical Report Vawtrak v2Blueliv
 
Scaling docker with kubernetes
Scaling docker with kubernetesScaling docker with kubernetes
Scaling docker with kubernetesLiran Cohen
 

Similar a KDD 2016 Streaming Analytics Tutorial (20)

Productionalizing spark streaming applications
Productionalizing spark streaming applicationsProductionalizing spark streaming applications
Productionalizing spark streaming applications
 
Hands-on Lab: Comparing Redis with Relational
Hands-on Lab: Comparing Redis with RelationalHands-on Lab: Comparing Redis with Relational
Hands-on Lab: Comparing Redis with Relational
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
The Ring programming language version 1.2 book - Part 51 of 84
The Ring programming language version 1.2 book - Part 51 of 84The Ring programming language version 1.2 book - Part 51 of 84
The Ring programming language version 1.2 book - Part 51 of 84
 
Hands-on Lab: Amazon ElastiCache
Hands-on Lab: Amazon ElastiCacheHands-on Lab: Amazon ElastiCache
Hands-on Lab: Amazon ElastiCache
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Book
BookBook
Book
 
Hands-on Lab - Combaring Redis with Relational
Hands-on Lab - Combaring Redis with RelationalHands-on Lab - Combaring Redis with Relational
Hands-on Lab - Combaring Redis with Relational
 
DevFest 2022 - Skaffold 2 Deep Dive Taipei.pdf
DevFest 2022 - Skaffold 2 Deep Dive Taipei.pdfDevFest 2022 - Skaffold 2 Deep Dive Taipei.pdf
DevFest 2022 - Skaffold 2 Deep Dive Taipei.pdf
 
Hopping in clouds - phpuk 17
Hopping in clouds - phpuk 17Hopping in clouds - phpuk 17
Hopping in clouds - phpuk 17
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1
 
Lab Manual Combaring Redis with Relational
Lab Manual Combaring Redis with RelationalLab Manual Combaring Redis with Relational
Lab Manual Combaring Redis with Relational
 
The Ring programming language version 1.8 book - Part 77 of 202
The Ring programming language version 1.8 book - Part 77 of 202The Ring programming language version 1.8 book - Part 77 of 202
The Ring programming language version 1.8 book - Part 77 of 202
 
The Ring programming language version 1.7 book - Part 75 of 196
The Ring programming language version 1.7 book - Part 75 of 196The Ring programming language version 1.7 book - Part 75 of 196
The Ring programming language version 1.7 book - Part 75 of 196
 
Wtf is happening_inside_my_android_phone_public
Wtf is happening_inside_my_android_phone_publicWtf is happening_inside_my_android_phone_public
Wtf is happening_inside_my_android_phone_public
 
Jaime Blasco & Pablo Rincón - Lost in translation: WTF is happening inside m...
Jaime Blasco & Pablo Rincón -  Lost in translation: WTF is happening inside m...Jaime Blasco & Pablo Rincón -  Lost in translation: WTF is happening inside m...
Jaime Blasco & Pablo Rincón - Lost in translation: WTF is happening inside m...
 
Technical Report Vawtrak v2
Technical Report Vawtrak v2Technical Report Vawtrak v2
Technical Report Vawtrak v2
 
NodeJS for Beginner
NodeJS for BeginnerNodeJS for Beginner
NodeJS for Beginner
 
Scaling docker with kubernetes
Scaling docker with kubernetesScaling docker with kubernetes
Scaling docker with kubernetes
 

Último

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 

Último (20)

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 

KDD 2016 Streaming Analytics Tutorial

  • 1. Streaming Analytics Tutorial Ashish Gupta (LinkedIn) Neera Agarwal
  • 2. Streaming Analytics Before doing this tutorial please read the main presentation. http://www.slideshare.net/NeeraAgarwal2/streaming-analytics
  • 3. Technical Requirements ● OS: MAC OS X ● Programming Language: Scala 2.10.x ● Open source software used in tutorial: Kafka and Spark 1.6.2
  • 4. Tutorials Wiki topic Ads topic Clicks topic Wiki Edit Events Ad Events Click Events Producers Kafka Spark Streaming Consumers WikiPedia Article Edit Metrics Impression & Click Metrics Tutorial -1 Tutorial -2
  • 5. Step 1: Installation In the conference we provided USB stick with the environment ready for the tutorials. Here are the instructions to create your own environment: 1. Check Java: java -version java version "1.8.0_92" (Note: Java 1.7+ should work.) 1. Check Maven: mvn -v If not installed, check instructions in the Additional slides at the end.
  • 6. Step 1: Installation 3. Install Scala curl -O http://downloads.lightbend.com/scala/2.10.6/scala-2.10.6.tgz tar -xzf scala-2.10.6.tgz Set scala home to the path pointing to scala-2.10.6 folder. For example on Mac: export SCALA_HOME=/Users/<username>/scala-2.10.6 export PATH=$PATH:$SCALA_HOME/bin 4. Install Spark curl -O http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz
  • 7. Step 1: Installation 5. Install Kafka curl -O http://apache.claz.org/kafka/0.10.0.0/kafka_2.10-0.10.0.0.tgz tar -xzf kafka_2.10-0.10.0.0.tgz 6. Download tutorial https://github.com/NeeraAgarwal/kdd2016-streaming-tutorial
  • 8. Step 2: Start Kafka Start a terminal window. Start Zookeeper > cd kafka_2.10-0.10.0.0 > bin/zookeeper-server-start.sh config/zookeeper.properties Wait for zookeeper to start. Start another terminal window and start Kafka > cd kafka_2.10-0.10.0.0 > bin/kafka-server-start.sh config/server.properties
  • 9. Tutorial 1: Bot and Human edit counts on Wikipedia Edit stream
  • 10. Step 1: Listening to WikiPedia Edit Stream Start a new terminal window. Run WikiPedia Connector > cd kdd2016-streaming-tutorial > java -cp target/streamingtutorial-1.0.0-jar-with-dependencies.jar example.WikiPediaConnector After some messages, stop using CTRL-C. We will run it again after writing streaming code. Does not run, build package and try again... > mvn package
  • 11. Step 1: WikiPedia Stream message structure [[-acylglycerol O-acyltransferase]] MB https://en.wikipedia.org/w/index.php?diff=733783045&oldid=721976415 * BU RoBOT * (-1) /* References */Sort into more specific stub template based on presence in [[Category:EC 2.3]] or subcategories (Task 25) [[City Building]] https://en.wikipedia.org/w/index.php?diff=733783047&oldid=732314994 * Hmains * (+9) refine category structures [[Wikipedia:Articles for deletion/Log/2016 August 10]] B https://en.wikipedia.org/w/index.php?diff=733783051&oldid=733783026 * Cyberbot Fields: title, flags, diffUrl, user, byteDiff, summary Flags (2nd field): ‘M’=Minor, ‘N’ = New, ‘!’ = Unpatrolled, ‘B’ = Bot Edit WikiPedia Stream pattern = "[[(.*)]]s(.*)s(.*)s*s(.*)s*s(+?(.d*))s(.*)".r
  • 12. Spark: RDD An RDD is an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of objects, including Python, Java, Scala or user- defined classes. RDDs offer two types of operations: ● Transformations construct a new RDD from a previous one. ● Actions compute a result based on an RDD, and either return it to the driver program or save it to an external storage system.
  • 13. Spark: DStream DStream is a sequence of data arriving over time. Internally, each DStream is represented as a sequence of RDDs arriving at each time step. DStreams offer two types of operations: ● Transformations yield a new DStream. ● Output operations write data to an external system. Ref: https://spark.apache.org/docs/latest/streaming-programming-guide.html
  • 14. Step 2: Write Code In kdd2016-streaming-tutorial Change code in src/main/scala/example/WikiPediaStreaming.scala file. Use your favorite editor. val lines = messages.foreachRDD { rdd => // ADD CODE HERE } Note: In the conference participants were asked to write the code while in github repository full code is provided.
  • 15. Step 2: Continued rdd => val linesDF = rdd.map(row => row._2 match { case pattern(title, flags, diffUrl, user, byteDiff, summary) => WikiEdit(title, flags, diffUrl, user, byteDiff.toInt, summary) case _ => WikiEdit("title", "flags", "diffUrl", "user", 0, "summary") }).filter(row => row.title != "title").toDF()
  • 16. Step 2 Continued // Number of records in 10 second window. val totalCnt = linesDF.count() // Number of bot edited records in 10 second window. val botEditCnt = linesDF.filter("flags like '%B%'").count() // Number of human edited records in 10 second window. val humanEditCnt = linesDF.filter("flags not like '%B%'").count() val botEditPct = if (totalCnt > 0) 100 * botEditCnt / totalCnt else 0 val humanEditPct = if (totalCnt > 0) 100 * humanEditCnt / totalCnt else 0
  • 17. Step 3: Build Program Start a new terminal window. > cd kdd2016-streaming-tutorial > mvn package
  • 18. Step 4: Run Programs Run WikiPediaConnector in terminal window. It starts receiving data from WikiPedia IRC channel and writes to Kafka. >java -cp target/streamingtutorial-1.0.0-jar-with-dependencies.jar example.WikiPediaConnector Run WikiPediaStream in a new terminal window. > cd kdd2016-streaming-tutorial > ../spark-1.6.2-bin-hadoop2.6/bin/spark-submit --class example.WikiPediaStreaming target/streamingtutorial-1.0.0-jar-with- dependencies.jar
  • 20. Tutorial 2: Impression Click metrics on Ad and Click streams
  • 21. Tutorials Wiki topic Ads topic Clicks topic Wiki Edit Events Ad Events Click Events Producers Kafka Spark Streaming Consumers WikiPedia Edit Metrics Impression & Click Metrics Tutorial -1 Tutorial -2
  • 22. Step 1: Listening to Ads and Clicks stream Run program to replay Ad and Click Events from file > cd kdd2016-streaming-tutorial > java -cp target/streamingtutorial-1.0.0-jar-with-dependencies.jar example.AdClickEventReplay After some messages, stop using CTRL-C. We will run it again after writing streaming code. Does not run, build package and try again... > mvn package
  • 23. Step 1: Ad and Click Event message structure Ad Event: QueryID, AdId, TimeStamp: 6815, 48195, 1470632477761 Click Event: QueryID, ClickId, TimeStamp: 6815, 93630, 1470632827088 Join on QueryId, show metrics by Ad Id.
  • 24. Step 2: Write Code In kdd2016-streaming-tutorial Change code in src/main/scala/example/AdEventJoiner.scala file. val adEventDStream = adStream.transform( rdd => { rdd.map(line => line._2.split(",")). map(row => (row(0).trim.toInt, AdEvent(row(0).trim.toInt, row(1).trim.toInt, row(2).trim.toLong))) }) // ADD CODE HERE.. Note: In the conference participants were asked to write the code while in github repository full code is provided.
  • 25. Step 2 Continued //Connects Spark Streaming to Kafka Topic and gets DStream of RDDs (click event message) val clickStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, clickStreamTopicSet) //Create a new DStream by extracting kafka message and converting it to DStream[queryId, ClickEvent] val clickEventDStream = clickStream.transform{ rdd => rdd.map(line => line._2.split(",")). map(row => (row(0).trim.toInt, ClickEvent(row(0).trim.toInt, row(1).trim.toInt, row(2).trim.toLong))) }
  • 26. Step 2 Continued // Join adEvent and clickEvent DStreams and output DStream[queryId, (adEvent, clickEvent)] val joinByQueryId = adEventDStream.join(clickEventDStream) joinByQueryId.print() // Transform DStream to DStream[adId, count(adId)] for each RDD val countByAdId = joinByQueryId.map(rdd => (rdd._2._1.adId,1)).reduceByKey(_+_)
  • 27. Step 2 Continued // Update the state [adId, countCummulative(adId)] by values from the next RDDs val updateFunc = (values: Seq[Int], state: Option[Int]) => { val currentCount = values.sum val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } val countByAdIdCumm = countByAdId.updateStateByKey(updateFunc) // Transform (key, value) pair to (adId, count(adId), countCummulative(adId)) val ad = countByAdId.join(countByAdIdCumm).map {case (adId, (count, cumCount)) => (adId, count, cumCount)}
  • 28. Step 2 Continued //Print report ad.foreachRDD( ad => { println("%5s %10s %12s".format("AdId", "AdCount", "AdCountCumm")) ad.foreach( row => println("%5s %10s %12s".format(row._1, row._2, row._3))) })
  • 29. Step 3: Build Program > cd kdd2016-streaming-tutorial > mvn package
  • 30. Step 4: Run Programs Run AdClickEventReplay in terminal window. It reads data from Ad and Click event files writes to Kafka. > cd kdd2016-streaming-tutorial > java -cp target/streamingtutorial-1.0.0-jar-with-dependencies.jar example.AdClickEventReplay Run AdEventJoiner in a new terminal window. > cd kdd2016-streaming-tutorial > ../spark-1.6.2-bin-hadoop2.6/bin/spark-submit --class example.AdEventJoiner target/streamingtutorial-1.0.0-jar-with-dependencies.jar
  • 32. Contact Us Ashish Gupta - ahgupta@linkedin.com https://www.linkedin.com/in/guptash Neera Agarwal - neera8work@gmail.com https://www.linkedin.com/in/neera-agarwal-21b9473
  • 33. Additional Notes - Install Java (Mac 10.11) ● java -version java version "1.8.0_92" If java does not exist, try In .bash_profie add export JAVA_HOME=$(/usr/libexec/java_home) (Install Java - https://java.com/en/download/help/mac_install.xml)
  • 34. Additional Notes - Install Maven ● Check Maven on a terminal window ○ mvn -v ○ Apache Maven 3.2.5+ ● Install Maven ○ brew install maven OR if you do not have brew then do: 1. curl -O http://mirror.nexcess.net/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9- bin.tar.gz 2. tar -xzvf apache-maven-3.3.9-bin.tar.gz

Notas del editor

  1. Walkthrough the code.
  2. Transformations: Map, filter, join, groupByKey, reduceByKey,
  3. Now explain wikipedia and processing code.
  4. xcode-select --install