SlideShare una empresa de Scribd logo
1 de 16
Crowd DetectorCrowd Detector
Reza Asad
Insight Data Engineering June 2015
Motivation
● Avoid waiting time in crowded areas.
Data
● Lets imagine we had data about people's location.
● This could be collected form people's cell phones.
● How can we use such data?
Naive Approach
Demo
Data
● But such data is not available to me ...
● Solution : Engineer the data!
● Take data from yelp
● Perform a random walk
Pipeline
Data
Engineering Challenges
● Choosing K?
Engineering Challenges
● The area of SF: 46.87 mi ²
● For the purpose of this project each cluster is 0.09 mi ²
● This means k is roughly 500
Engineering Challenges
● Parameters to tune:
– Time it takes to produce the messages
– Processing time for k-means in Spark Streaming
– The update interval for a fixed data point in the
database
Goal
● Tune the parameters in order to have a stable system
● The total delay after processing each batch must be
constant and comparable to the batch interval.
● You can check this in the Spark API
Tackling Challenges
●
Having multiple producers and consumers ✔
● Kafka is fast with sending messages and is not the bottleneck
● Establishing some safe limits:
– Using spark.streaming.receiver.maxRate to control
the input rate ✔
– Understanding the complexity of the process in Spark
Streaming ✔
– Choosing the right batch interval ✔
Raw Data
Data Process
● Data filteration in spark streaming
Data Process
About Me
● Long time ago - B.S in pure math, University of Toronto
● More recent - M.S in applied math, University of British Columbia
● The exciting now - A data engineer who wants to go camping with other
data engineers

Más contenido relacionado

La actualidad más candente

Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward
 

La actualidad más candente (18)

Keynote: Scaling Sensu Go
Keynote: Scaling Sensu GoKeynote: Scaling Sensu Go
Keynote: Scaling Sensu Go
 
Kiwi.com Reaches Cruising Altitude with Scylla
Kiwi.com Reaches Cruising Altitude with ScyllaKiwi.com Reaches Cruising Altitude with Scylla
Kiwi.com Reaches Cruising Altitude with Scylla
 
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
 
Why Architecting for Disaster Recovery is Important for Your Time Series Data...
Why Architecting for Disaster Recovery is Important for Your Time Series Data...Why Architecting for Disaster Recovery is Important for Your Time Series Data...
Why Architecting for Disaster Recovery is Important for Your Time Series Data...
 
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
 
PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor
PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing ProcessorPEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor
PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor
 
Session 03 data_migration_at_scale_by_sameer
Session 03 data_migration_at_scale_by_sameerSession 03 data_migration_at_scale_by_sameer
Session 03 data_migration_at_scale_by_sameer
 
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | GrafanaStreaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
 
NodeTime Tool Review
NodeTime Tool ReviewNodeTime Tool Review
NodeTime Tool Review
 
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overviewFlink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
 
Slack in the Age of Prometheus
Slack in the Age of PrometheusSlack in the Age of Prometheus
Slack in the Age of Prometheus
 
Golang testing
Golang testingGolang testing
Golang testing
 
Cassandra Meetup Nov 2019 - Cassandra Resiliency
Cassandra Meetup Nov 2019 -  Cassandra ResiliencyCassandra Meetup Nov 2019 -  Cassandra Resiliency
Cassandra Meetup Nov 2019 - Cassandra Resiliency
 
Html5 devconf nodejs_devops_shubhra
Html5 devconf nodejs_devops_shubhraHtml5 devconf nodejs_devops_shubhra
Html5 devconf nodejs_devops_shubhra
 
Lambda - Building On-prem GPU Training Infrastructure
Lambda - Building On-prem GPU Training InfrastructureLambda - Building On-prem GPU Training Infrastructure
Lambda - Building On-prem GPU Training Infrastructure
 
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
 
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
 
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
 

Destacado (9)

Sobre el autor
Sobre el autorSobre el autor
Sobre el autor
 
IAF training.PDF
IAF training.PDFIAF training.PDF
IAF training.PDF
 
electronic-structure_of_aluminum_nitride_-_theory
electronic-structure_of_aluminum_nitride_-_theoryelectronic-structure_of_aluminum_nitride_-_theory
electronic-structure_of_aluminum_nitride_-_theory
 
Naim Ahmed
Naim AhmedNaim Ahmed
Naim Ahmed
 
2.5. rúbrica de evaluación individual tutores en red intef
2.5. rúbrica de evaluación individual tutores en red intef2.5. rúbrica de evaluación individual tutores en red intef
2.5. rúbrica de evaluación individual tutores en red intef
 
How To Capitalize On Opportunities While Minimizing Risk
How To Capitalize On Opportunities While Minimizing RiskHow To Capitalize On Opportunities While Minimizing Risk
How To Capitalize On Opportunities While Minimizing Risk
 
Administracion por objetivos 1 yohana
Administracion por objetivos 1 yohanaAdministracion por objetivos 1 yohana
Administracion por objetivos 1 yohana
 
Success story: Kiran Mazumdar
Success story: Kiran MazumdarSuccess story: Kiran Mazumdar
Success story: Kiran Mazumdar
 
Historia del internet en el salvador
Historia del internet en el salvadorHistoria del internet en el salvador
Historia del internet en el salvador
 

Similar a Insight Recent Demo

Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 

Similar a Insight Recent Demo (20)

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Debugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarDebugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan Kumar
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
LCU14 310- Cisco ODP v2
LCU14 310- Cisco ODP v2LCU14 310- Cisco ODP v2
LCU14 310- Cisco ODP v2
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Get Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsGet Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java Applications
 

Insight Recent Demo