SlideShare a Scribd company logo
1 of 62
Spark Streaming 
and Friends 
Chris Fregly 
Global Big Data Conference 
Sept 2014 
Kinesis 
Streaming
Who am I? 
Former Netflixโ€™er: 
netflix.github.io 
Spark Contributor: 
github.com/apache/spark 
Founder: 
fluxcapacitor.com 
Author: 
effectivespark.com 
sparkinaction.com
Spark Streaming-Kinesis Jira
Quick Poll 
โ€ข Hadoop, Hive, Pig? 
โ€ข Spark, Spark streaming? 
โ€ข EMR, Redshift? 
โ€ข Flume, Kafka, Kinesis, Storm? 
โ€ข Lambda Architecture? 
โ€ข Bloom Filters, HyperLogLog?
โ€œStreamingโ€ 
Kinesis 
Streaming 
Video 
Streaming 
Piping 
Big Data 
Streaming
Agenda 
โ€ข Spark, Spark Streaming Overview 
โ€ข Use Cases 
โ€ข API and Libraries 
โ€ข Execution Model 
โ€ข Fault Tolerance 
โ€ข Cluster Deployment 
โ€ข Monitoring 
โ€ข Scaling and Tuning 
โ€ข Lambda Architecture 
โ€ข Approximations
Spark Overview (1/2) 
โ€ข Berkeley AMPLab ~2009 
โ€ข Part of Berkeley Data Analytics Stack (BDAS, 
aka โ€œbadassโ€)
Spark Overview (2/2) 
โ€ข Based on Microsoft Dryad paper ~2007 
โ€ข Written in Scala 
โ€ข Supports Java, Python, SQL, and R 
โ€ข In-memory when possible, not required 
โ€ข Improved efficiency over MapReduce 
โ€“ 100x in-memory, 2-10x on-disk 
โ€ข Compatible with Hadoop 
โ€“ File formats, SerDes, and UDFs
Spark Use Cases 
โ€ข Ad hoc, exploratory, interactive analytics 
โ€ข Real-time + Batch Analytics 
โ€“ Lambda Architecture 
โ€ข Real-time Machine Learning 
โ€ข Real-time Graph Processing 
โ€ข Approximate, Time-bound Queries
Explosion of Specialized Systems
Unified Spark Libraries 
โ€ข Spark SQL (Data Processing) 
โ€ข Spark Streaming (Streaming) 
โ€ข MLlib (Machine Learning) 
โ€ข GraphX (Graph Processing) 
โ€ข BlinkDB (Approximate Queries) 
โ€ข Statistics (Correlations, Sampling, etc) 
โ€ข Others 
โ€“ Shark (Hive on Spark) 
โ€“ Spork (Pig on Spark)
Unified Benefits 
โ€ข Advancements in higher-level libraries 
pushed down into core and vice-versa 
โ€ข Examples 
โ€“ Spark Streaming: GC and memory 
management improvements 
โ€“ Spark GraphX: IndexedRDD for random, 
hashed access within a partition versus 
scanning entire partition
Spark API
Resilient Distributed Dataset (RDD) 
โ€ข Core Spark abstraction 
โ€ข Represents partitions 
across the cluster nodes 
โ€ข Enables parallel processing 
on data sets 
โ€ข Partitions can be in-memory or 
on-disk 
โ€ข Immutable, recomputable, 
fault tolerant 
โ€ข Contains transformation lineage on data set
RDD Lineage
Spark API Overview 
โ€ข Richer, more expressive than MapReduce 
โ€ข Native support for Java, Scala, Python, 
SQL, and R (mostly) 
โ€ข Unified API across all libraries 
โ€ข Operations = Transformations + Actions
Transformations
Actions
Spark Execution Model
Spark Execution Model Overview 
โ€ข Parallel, distributed 
โ€ข DAG-based 
โ€ข Lazy evaluation 
โ€ข Allows optimizations 
โ€“ Reduce disk I/O 
โ€“ Reduce shuffle I/O 
โ€“ Parallel execution 
โ€“ Task pipelining 
โ€ข Data locality and rack awareness 
โ€ข Worker node fault tolerance using RDD 
lineage graphs per partition
Execution Optimizations
Spark Cluster Deployment
Spark Cluster Deployment
Master High Availability 
โ€ข Multiple Master Nodes 
โ€ข ZooKeeper maintains current Master 
โ€ข Existing applications and workers will be 
notified of new Master election 
โ€ข New applications and workers need to 
explicitly specify current Master 
โ€ข Alternatives (Not recommended) 
โ€“ Local filesystem 
โ€“ NFS Mount
Spark Streaming
Spark Streaming Overview 
โ€ข Low latency, high throughput, fault-tolerance 
(mostly) 
โ€ข Long-running Spark application 
โ€ข Supports Flume, Kafka, Twitter, Kinesis, 
Socket, File, etc. 
โ€ข Graceful shutdown, in-flight message 
draining 
โ€ข Uses Spark Core, DAG Execution Model, 
and Fault Tolerance
Spark Streaming Use Cases 
โ€ข ETL on streaming data during ingestion 
โ€ข Anomaly, malware, and fraud detection 
โ€ข Operational dashboards 
โ€ข Lambda architecture 
โ€“ Unified batch and streaming 
โ€“ ie. Different machine learning models for different 
time frames 
โ€ข Predictive maintenance 
โ€“ Sensors 
โ€ข NLP analysis 
โ€“ Twitter firehose
Discretized Stream (DStream) 
โ€ข Core Spark Streaming abstraction 
โ€ข Micro-batches of RDDs 
โ€ข Operations similar to RDD 
โ€ข Fault tolerance using DStream/RDD lineage
Spark Streaming API
Spark Streaming API Overview 
โ€ข Rich, expressive API similar to core 
โ€ข Operations 
โ€“ Transformations 
โ€“ Actions 
โ€ข Window and State Operations 
โ€ข Requires checkpointing to snip long-running 
DStream lineage 
โ€ข Register DStream as a Spark SQL table 
for querying!
DStream Transformations
DStream Actions
Window and State DStream Operations
DStream Example
Spark Streaming Cluster 
Deployment
Spark Streaming Cluster Deployment
Scaling Receivers
Scaling Processors
Spark Streaming 
+ 
Kinesis
Spark Streaming + Kinesis Architecture 
Kinesis 
Producer 
Kinesis 
Producer 
Spark St reaming Kinesis Archit ec t ure 
Kinesis Spark St reaming, 
Kinesis Cl ient Library 
Appl icat ion 
Kinesis St ream 
Shard 1 
Shard 2 
Shard 3 
Kinesis 
Producer 
Kinesis Receiver DSt ream 1 
Kinesis Cl ient Library 
Kinesis Record 
Processor Thread 1 
Kinesis Record 
Processor Thread 2 
Kinesis Receiver DSt ream 2 
Kinesis Cl ient Library 
Kinesis Record 
Processor Thread 1
Throughput and Pricing 
Spark 
Kinesis 
Producer 
Spark St reaming Kinesis Throughput and Pr ic ing 
< 10 second delay 
Kinesis 
Spark St reaming 
Appl icat ion 
Kinesis St ream 
Shard 1 
Spark 
Kinesis 
Receiver 
Shard 1 
Shard 1 
1 MB/ sec 
per shard 
1000 PUTs/ sec 
50K/ PUT 
2 MB/ sec 
per shard 
Shard Cost : $ 0.36 per day per shard 
PUT Cost : $ 2.50 per day per shard 
Net work Transf er Cost : Free wit hin Region! !
Demo! 
Kinesis 
Streaming 
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/โ€ฆ 
Scala: โ€ฆ/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala 
Java: โ€ฆ/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
Spark Streaming 
Fault Tolerance
Fault Tolerance 
โ€ข Points of Failure 
โ€“ Receiver 
โ€“ Driver 
โ€“ Worker/Processor 
โ€ข Solutions 
โ€“ Data Replication 
โ€“ Secondary/Backup Nodes 
โ€“ Checkpoints
Streaming Receiver Failure 
โ€ข Use a backup receiver 
โ€ข Use multiple receivers pulling from multiple 
shards 
โ€“ Use checkpoint-enabled, sharded streaming 
source (ie. Kafka and Kinesis) 
โ€ข Data is replicated to 2 nodes immediately 
upon ingestion 
โ€ข Possible data loss 
โ€ข Possible at-least once 
โ€ข Use buffered sources (ie. Kafka and Kinesis)
Streaming Driver Failure 
โ€ข Use a backup Driver 
โ€“ Use DStream metadata checkpoint info to 
recover 
โ€ข Single point of failure โ€“ interrupts stream 
processing 
โ€ข Streaming Driver is a long-running Spark 
application 
โ€“ Schedules long-running stream receivers 
โ€ข State and Window RDD checkpoints help 
avoid data loss (mostly)
Stream Worker/Processor Failure 
โ€ข No problem! 
โ€ข DStream RDD partitions will be 
recalculated from lineage
Types of Checkpoints 
Spark 
1. Spark checkpointing of StreamingContext 
DStreams and metadata 
2. Lineage of state and window DStream 
operations 
Kinesis 
3. Kinesis Client Library (KCL) checkpoints 
current position within shard 
โ€“ Checkpoint info is stored in DynamoDB per 
Kinesis application keyed by shard
Spark Streaming 
Monitoring and Tuning
Monitoring 
โ€ข Monitor driver, receiver, worker nodes, and 
streams 
โ€ข Alert upon failure or unusually high latency 
โ€ข Spark Web UI 
โ€“ Streaming tab 
โ€ข Ganglia, CloudWatch 
โ€ข StreamingListener callback
Spark Web UI
Tuning 
โ€ข Batch interval 
โ€“ High: reduce overhead of submitting new tasks for each batch 
โ€“ Low: keeps latencies low 
โ€“ Sweet spot: DStream job time (scheduling + processing) is 
steady and less than batch interval 
โ€ข Checkpoint interval 
โ€“ High: reduce load on checkpoint overhead 
โ€“ Low: reduce amount of data loss on failure 
โ€“ Recommendation: 5-10x sliding window interval 
โ€ข Use DStream.repartition() to increase parallelism of processing 
DStream jobs across cluster 
โ€ข Use spark.streaming.unpersist=true to let the Streaming Framework 
figure out when to unpersist 
โ€ข Use CMS GC for consistent processing times
Lambda Architecture
Lambda Architecture Overview 
โ€ข Batch Layer 
โ€“ Immutable, 
Batch read, 
Append-only write 
โ€“ Source of truth 
โ€“ ie. HDFS 
โ€ข Speed Layer 
โ€“ Mutable, 
Random read/write 
โ€“ Most complex 
โ€“ Recent data only 
โ€“ ie. Cassandra 
โ€ข Serving Layer 
โ€“ Immutable, 
Random read, 
Batch write 
โ€“ ie. ElephantDB
Spark + AWS + Lambda
Spark + AWS + Lambda + ML
Approximations
Approximation Overview 
โ€ข Required for scaling 
โ€ข Speed up analysis of large datasets 
โ€ข Reduce size of working dataset 
โ€ข Data is messy 
โ€ข Collection of data is messy 
โ€ข Exact isnโ€™t always necessary 
โ€ข โ€œApproximate is the new Exactโ€
Some Approximation Methods 
โ€ข Approximate time-bound queries 
โ€“ BlinkDB 
โ€ข Bernouilli and Poisson Sampling 
โ€“ RDD: sample(), RDD.takeSample() 
โ€ข HyperLogLog 
PairRDD: countApproxDistinctByKey() 
โ€ข Count-min Sketch 
โ€ข Spark Streaming and Twitter Algebird 
โ€ข Bloom Filters 
โ€“ Everywhere!
Approximations In Action 
Figure: Memory Savings with Approximation Techniques 
(http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
Spark Statistics Library 
โ€ข Correlations 
โ€“ Dependence between 2 random variables 
โ€“ Pearson, Spearman 
โ€ข Hypothesis Testing 
โ€“ Measure of statistical significance 
โ€“ Chi-squared test 
โ€ข Stratified Sampling 
โ€“ Sample separately from different sub-populations 
โ€“ Bernoulli and Poisson sampling 
โ€“ With and without replacement 
โ€ข Random data generator 
โ€“ Uniform, standard normal, and Poisson distribution
Summary 
โ€ข Spark, Spark Streaming Overview 
โ€ข Use Cases 
โ€ข API and Libraries 
โ€ข Execution Model 
โ€ข Fault Tolerance 
โ€ข Cluster Deployment 
โ€ข Monitoring 
โ€ข Scaling and Tuning 
โ€ข Lambda Architecture 
โ€ข Approximations 
Oct 2014 MEAP Early Access 
http://sparkinaction.com

More Related Content

What's hot

Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit
ย 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
ย 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
ย 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
ย 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
ย 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesAnand Narayanan
ย 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil
ย 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
ย 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
ย 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
ย 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
ย 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
ย 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
ย 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
ย 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
ย 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
ย 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
ย 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
ย 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
ย 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
ย 

What's hot (20)

Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
ย 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
ย 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
ย 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
ย 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
ย 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
ย 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
ย 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
ย 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
ย 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
ย 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
ย 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
ย 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
ย 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
ย 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
ย 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
ย 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
ย 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
ย 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
ย 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
ย 

Similar to Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly
ย 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache SparkVenkata Naga Ravi
ย 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark ComponentsGirish Khanzode
ย 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
ย 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
ย 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks
ย 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
ย 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
ย 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
ย 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Chris Fregly
ย 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
ย 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
ย 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
ย 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APIshareddatamsft
ย 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkRobert Sanders
ย 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Sparkclairvoyantllc
ย 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overviewAvi Levi
ย 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge
ย 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
ย 

Similar to Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture (20)

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
ย 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
ย 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
ย 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
ย 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
ย 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
ย 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
ย 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
ย 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
ย 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
ย 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
ย 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
ย 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
ย 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
ย 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
ย 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
ย 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
ย 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overview
ย 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
ย 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
ย 

More from Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
ย 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfChris Fregly
ย 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly
ย 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
ย 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
ย 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
ย 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketChris Fregly
ย 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly
ย 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapChris Fregly
ย 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...Chris Fregly
ย 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
ย 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
ย 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Chris Fregly
ย 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
ย 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
ย 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
ย 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
ย 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
ย 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
ย 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...Chris Fregly
ย 

More from Chris Fregly (20)

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
ย 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
ย 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
ย 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
ย 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
ย 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
ย 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
ย 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
ย 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
ย 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
ย 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
ย 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
ย 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
ย 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
ย 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
ย 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
ย 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
ย 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
ย 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
ย 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
ย 

Recently uploaded

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
ย 
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR
ย 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
ย 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
ย 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
ย 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธanilsa9823
ย 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female serviceCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
ย 
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...Steffen Staab
ย 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
ย 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
ย 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
ย 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
ย 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
ย 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto Gonzรกlez Trastoy
ย 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
ย 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
ย 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
ย 

Recently uploaded (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
ย 
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
ย 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
ย 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
ย 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
ย 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
ย 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female serviceCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
ย 
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
ย 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
ย 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
ย 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
ย 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
ย 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
ย 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
ย 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
ย 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
ย 
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS LiveVip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
ย 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
ย 

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

  • 1. Spark Streaming and Friends Chris Fregly Global Big Data Conference Sept 2014 Kinesis Streaming
  • 2. Who am I? Former Netflixโ€™er: netflix.github.io Spark Contributor: github.com/apache/spark Founder: fluxcapacitor.com Author: effectivespark.com sparkinaction.com
  • 4. Quick Poll โ€ข Hadoop, Hive, Pig? โ€ข Spark, Spark streaming? โ€ข EMR, Redshift? โ€ข Flume, Kafka, Kinesis, Storm? โ€ข Lambda Architecture? โ€ข Bloom Filters, HyperLogLog?
  • 5. โ€œStreamingโ€ Kinesis Streaming Video Streaming Piping Big Data Streaming
  • 6. Agenda โ€ข Spark, Spark Streaming Overview โ€ข Use Cases โ€ข API and Libraries โ€ข Execution Model โ€ข Fault Tolerance โ€ข Cluster Deployment โ€ข Monitoring โ€ข Scaling and Tuning โ€ข Lambda Architecture โ€ข Approximations
  • 7. Spark Overview (1/2) โ€ข Berkeley AMPLab ~2009 โ€ข Part of Berkeley Data Analytics Stack (BDAS, aka โ€œbadassโ€)
  • 8. Spark Overview (2/2) โ€ข Based on Microsoft Dryad paper ~2007 โ€ข Written in Scala โ€ข Supports Java, Python, SQL, and R โ€ข In-memory when possible, not required โ€ข Improved efficiency over MapReduce โ€“ 100x in-memory, 2-10x on-disk โ€ข Compatible with Hadoop โ€“ File formats, SerDes, and UDFs
  • 9. Spark Use Cases โ€ข Ad hoc, exploratory, interactive analytics โ€ข Real-time + Batch Analytics โ€“ Lambda Architecture โ€ข Real-time Machine Learning โ€ข Real-time Graph Processing โ€ข Approximate, Time-bound Queries
  • 11. Unified Spark Libraries โ€ข Spark SQL (Data Processing) โ€ข Spark Streaming (Streaming) โ€ข MLlib (Machine Learning) โ€ข GraphX (Graph Processing) โ€ข BlinkDB (Approximate Queries) โ€ข Statistics (Correlations, Sampling, etc) โ€ข Others โ€“ Shark (Hive on Spark) โ€“ Spork (Pig on Spark)
  • 12. Unified Benefits โ€ข Advancements in higher-level libraries pushed down into core and vice-versa โ€ข Examples โ€“ Spark Streaming: GC and memory management improvements โ€“ Spark GraphX: IndexedRDD for random, hashed access within a partition versus scanning entire partition
  • 14. Resilient Distributed Dataset (RDD) โ€ข Core Spark abstraction โ€ข Represents partitions across the cluster nodes โ€ข Enables parallel processing on data sets โ€ข Partitions can be in-memory or on-disk โ€ข Immutable, recomputable, fault tolerant โ€ข Contains transformation lineage on data set
  • 16. Spark API Overview โ€ข Richer, more expressive than MapReduce โ€ข Native support for Java, Scala, Python, SQL, and R (mostly) โ€ข Unified API across all libraries โ€ข Operations = Transformations + Actions
  • 20. Spark Execution Model Overview โ€ข Parallel, distributed โ€ข DAG-based โ€ข Lazy evaluation โ€ข Allows optimizations โ€“ Reduce disk I/O โ€“ Reduce shuffle I/O โ€“ Parallel execution โ€“ Task pipelining โ€ข Data locality and rack awareness โ€ข Worker node fault tolerance using RDD lineage graphs per partition
  • 24. Master High Availability โ€ข Multiple Master Nodes โ€ข ZooKeeper maintains current Master โ€ข Existing applications and workers will be notified of new Master election โ€ข New applications and workers need to explicitly specify current Master โ€ข Alternatives (Not recommended) โ€“ Local filesystem โ€“ NFS Mount
  • 26. Spark Streaming Overview โ€ข Low latency, high throughput, fault-tolerance (mostly) โ€ข Long-running Spark application โ€ข Supports Flume, Kafka, Twitter, Kinesis, Socket, File, etc. โ€ข Graceful shutdown, in-flight message draining โ€ข Uses Spark Core, DAG Execution Model, and Fault Tolerance
  • 27. Spark Streaming Use Cases โ€ข ETL on streaming data during ingestion โ€ข Anomaly, malware, and fraud detection โ€ข Operational dashboards โ€ข Lambda architecture โ€“ Unified batch and streaming โ€“ ie. Different machine learning models for different time frames โ€ข Predictive maintenance โ€“ Sensors โ€ข NLP analysis โ€“ Twitter firehose
  • 28. Discretized Stream (DStream) โ€ข Core Spark Streaming abstraction โ€ข Micro-batches of RDDs โ€ข Operations similar to RDD โ€ข Fault tolerance using DStream/RDD lineage
  • 30. Spark Streaming API Overview โ€ข Rich, expressive API similar to core โ€ข Operations โ€“ Transformations โ€“ Actions โ€ข Window and State Operations โ€ข Requires checkpointing to snip long-running DStream lineage โ€ข Register DStream as a Spark SQL table for querying!
  • 33. Window and State DStream Operations
  • 39. Spark Streaming + Kinesis
  • 40. Spark Streaming + Kinesis Architecture Kinesis Producer Kinesis Producer Spark St reaming Kinesis Archit ec t ure Kinesis Spark St reaming, Kinesis Cl ient Library Appl icat ion Kinesis St ream Shard 1 Shard 2 Shard 3 Kinesis Producer Kinesis Receiver DSt ream 1 Kinesis Cl ient Library Kinesis Record Processor Thread 1 Kinesis Record Processor Thread 2 Kinesis Receiver DSt ream 2 Kinesis Cl ient Library Kinesis Record Processor Thread 1
  • 41. Throughput and Pricing Spark Kinesis Producer Spark St reaming Kinesis Throughput and Pr ic ing < 10 second delay Kinesis Spark St reaming Appl icat ion Kinesis St ream Shard 1 Spark Kinesis Receiver Shard 1 Shard 1 1 MB/ sec per shard 1000 PUTs/ sec 50K/ PUT 2 MB/ sec per shard Shard Cost : $ 0.36 per day per shard PUT Cost : $ 2.50 per day per shard Net work Transf er Cost : Free wit hin Region! !
  • 42. Demo! Kinesis Streaming https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/โ€ฆ Scala: โ€ฆ/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala Java: โ€ฆ/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  • 44. Fault Tolerance โ€ข Points of Failure โ€“ Receiver โ€“ Driver โ€“ Worker/Processor โ€ข Solutions โ€“ Data Replication โ€“ Secondary/Backup Nodes โ€“ Checkpoints
  • 45. Streaming Receiver Failure โ€ข Use a backup receiver โ€ข Use multiple receivers pulling from multiple shards โ€“ Use checkpoint-enabled, sharded streaming source (ie. Kafka and Kinesis) โ€ข Data is replicated to 2 nodes immediately upon ingestion โ€ข Possible data loss โ€ข Possible at-least once โ€ข Use buffered sources (ie. Kafka and Kinesis)
  • 46. Streaming Driver Failure โ€ข Use a backup Driver โ€“ Use DStream metadata checkpoint info to recover โ€ข Single point of failure โ€“ interrupts stream processing โ€ข Streaming Driver is a long-running Spark application โ€“ Schedules long-running stream receivers โ€ข State and Window RDD checkpoints help avoid data loss (mostly)
  • 47. Stream Worker/Processor Failure โ€ข No problem! โ€ข DStream RDD partitions will be recalculated from lineage
  • 48. Types of Checkpoints Spark 1. Spark checkpointing of StreamingContext DStreams and metadata 2. Lineage of state and window DStream operations Kinesis 3. Kinesis Client Library (KCL) checkpoints current position within shard โ€“ Checkpoint info is stored in DynamoDB per Kinesis application keyed by shard
  • 50. Monitoring โ€ข Monitor driver, receiver, worker nodes, and streams โ€ข Alert upon failure or unusually high latency โ€ข Spark Web UI โ€“ Streaming tab โ€ข Ganglia, CloudWatch โ€ข StreamingListener callback
  • 52. Tuning โ€ข Batch interval โ€“ High: reduce overhead of submitting new tasks for each batch โ€“ Low: keeps latencies low โ€“ Sweet spot: DStream job time (scheduling + processing) is steady and less than batch interval โ€ข Checkpoint interval โ€“ High: reduce load on checkpoint overhead โ€“ Low: reduce amount of data loss on failure โ€“ Recommendation: 5-10x sliding window interval โ€ข Use DStream.repartition() to increase parallelism of processing DStream jobs across cluster โ€ข Use spark.streaming.unpersist=true to let the Streaming Framework figure out when to unpersist โ€ข Use CMS GC for consistent processing times
  • 54. Lambda Architecture Overview โ€ข Batch Layer โ€“ Immutable, Batch read, Append-only write โ€“ Source of truth โ€“ ie. HDFS โ€ข Speed Layer โ€“ Mutable, Random read/write โ€“ Most complex โ€“ Recent data only โ€“ ie. Cassandra โ€ข Serving Layer โ€“ Immutable, Random read, Batch write โ€“ ie. ElephantDB
  • 55. Spark + AWS + Lambda
  • 56. Spark + AWS + Lambda + ML
  • 58. Approximation Overview โ€ข Required for scaling โ€ข Speed up analysis of large datasets โ€ข Reduce size of working dataset โ€ข Data is messy โ€ข Collection of data is messy โ€ข Exact isnโ€™t always necessary โ€ข โ€œApproximate is the new Exactโ€
  • 59. Some Approximation Methods โ€ข Approximate time-bound queries โ€“ BlinkDB โ€ข Bernouilli and Poisson Sampling โ€“ RDD: sample(), RDD.takeSample() โ€ข HyperLogLog PairRDD: countApproxDistinctByKey() โ€ข Count-min Sketch โ€ข Spark Streaming and Twitter Algebird โ€ข Bloom Filters โ€“ Everywhere!
  • 60. Approximations In Action Figure: Memory Savings with Approximation Techniques (http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
  • 61. Spark Statistics Library โ€ข Correlations โ€“ Dependence between 2 random variables โ€“ Pearson, Spearman โ€ข Hypothesis Testing โ€“ Measure of statistical significance โ€“ Chi-squared test โ€ข Stratified Sampling โ€“ Sample separately from different sub-populations โ€“ Bernoulli and Poisson sampling โ€“ With and without replacement โ€ข Random data generator โ€“ Uniform, standard normal, and Poisson distribution
  • 62. Summary โ€ข Spark, Spark Streaming Overview โ€ข Use Cases โ€ข API and Libraries โ€ข Execution Model โ€ข Fault Tolerance โ€ข Cluster Deployment โ€ข Monitoring โ€ข Scaling and Tuning โ€ข Lambda Architecture โ€ข Approximations Oct 2014 MEAP Early Access http://sparkinaction.com