SlideShare a Scribd company logo
1 of 58
Download to read offline
A REFERENCE
ARCHITECTURE
FOR INTERNET
OF THINGS
SUJEE MANIYAM
FOUNDER / PRINCIPAL @ ELEPHANTSCALE
SUJEE@ELEPHANTSCALE.COM
(c) Elephant Scale 2015
HI, I’M SUJEE MANIYAM
•  Founder / Principal @ ElephantScale
•  Consulting & Training in Big Data
•  Training in Spark / Hadoop / NoSQL /
Data Science
•  Author
•  “Hadoop illuminated” open source book
•  “HBase Design Patterns
•  Open Source contributor: github.com/sujee
•  sujee@elephantscale.com
•  http://sujee.net
(c) Elephant Scale 2015
Spark
Training
available!
INTERNET OF THINGS – A
REALITY
(c) Elephant Scale 2015
DATA INFRASTRUCTURE
?
(c) Elephant Scale 2015
DATA VOLUME ?
A NAPKIN CALCULATION
Say we have
•  Million sensors
•  Each sensor reports every minute
•  data size 1KB
This will result in :
•  1.44 Billions events / day !
•  1.44 TB / day !!
(c) Elephant Scale 2015
SENSOR DATA WORKSHEET
IoT Temperature Sensor Projection
variables description
sensors 1M 1.00E+06 1 million
signal frequency every min / 60 secs 60 secs
event size 1KB 1000 bytes
events per day per sensor 1440
total events per day 1.44E+09 1440$$millions1.44$$billion
total events / sec 1.67E+04 16,666.67$$
total data size per day 1.44E+12 1440$$$GB 1.44$$TB
(c) Elephant Scale 2015
SENSOR DATA : TEXAS UTILITIES
SMART METER DATA
Texas Smart Meter Projections
variables description
sensors 10 million customers 1.00E+07 10 million
signal frequency every 15 mins 900 secs
event size 1.4 K 1400 bytes
events per day per sensor 96
total events per day 9.60E+08 960$$millions 0.96$$billion
total events / sec 1.11E+04 11,111.11$$
total data size per day 1.34E+12 1344$$$GB 1.344$$TB
(c) Elephant Scale 2015
BIG ‘DATA’
(c) Elephant Scale 2015
DATA VELOCITY
Say we have
•  Million sensors
•  Each sensor reports every minute
•  data size 1KB
è
•  Millions events / minute
•  ~ 17,000 events / sec
(c) Elephant Scale 2015
DATA PROCESSING SPEED
•  Need (near) real time processing most of the time
•  E.g. Need to alert if temperature suddenly spikes
temp
Time
Alert
(c) Elephant Scale 2015
CHALLENGE = BIG DATA +
REAL TIME
•  Don’t loose events !
•  Any event could be important
•  Most events are mundane (e.g. temperature stays
between 68’F – 72’ F)
•  Process them in near real time
•  Store the events for a long time
•  Audit
•  Diagnose
•  Support various queries
•  Real time (what is the latest temperature for sensor id
123?)
•  Aggregate (what is the avg. temp in zipcode 12345)
(c) Elephant Scale 2015
HIGH LEVEL ARCHITECTURE
(c) Elephant Scale 2015
NEXT : (1) CAPTURE
(c) Elephant Scale 2015
(1) CAPTURE
REQUIREMENTS
Requirements:
•  Capture events coming at high speed
•  Tens of thousands events / sec (some times millions / sec)
•  Don’t loose events
•  Tolerate hardware / software failure
•  Tolerate intermittent connectivity issues
•  Scale ‘easily’
(c) Elephant Scale 2015
(1) CAPTURE
CHOICES
•  MQ (RabbitMQ ..etc)
•  Good adoption in enterprises / durable
•  FluentD
•  Data collector for various sources
•  Flume
•  Part of Hadoop eco system
•  Good for collecting logs from many sources
•  AWS Kinesis
•  Queue system in Amazon Cloud
•  Kafka
•  Distributed queue
(c) Elephant Scale 2015
(1) CAPTURE
MEET KAFKA
•  Apache Kafka is a distributed messaging system
•  Came out of LinkedIn… open sourced in 2011
•  Built to tolerate hardware / software / network failures
•  Built for high throughput and scale
•  LinkedIn : 220 Billion messages / day
•  At peak : 3+ million messages / day
(c) Elephant Scale 2015
(1) CAPTURE
KAFKA ARCHITECTURE
•  Publisher - subscriber / producer – consumer model
(c) Elephant Scale 2015
(1) CAPTURE
KAFKA ARCHITECTURE
•  Producers write data
to brokers
•  Consumers read data
from brokers
•  All of this is
distributed / parallel
•  Failure tolerant
•  Data is stored as topics
•  “sensor_data”
•  “alerts”
•  “emails”
(c) Elephant Scale 2015
(1) CAPTURE
KAFKA USERS
•  Linked In
•  Track user activities
•  Sending emails
•  Netflix
•  Real time monitoring
•  Spotify
•  Ship logs to hadoop
•  Uber… AirBnB….
(c) Elephant Scale 2015
(1) CAPTURE
ARCHITECTURE WITH KAFKA
(c) Elephant Scale 2015
NEXT : (2) PROCESSING
(c) Elephant Scale 2015
(2) PROCESSING
REQUIREMENTS
•  Process events in real time or near real time
•  High velocity
•  Tens of thousands à millions of events / sec
•  Guaranteed processing
•  Process an event at-least-once
•  Exactly-once (harder to achieve)
•  Failure tolerant
•  Scale ‘easily’
(c) Elephant Scale 2015
(2) PROCESSING
CHOICES
•  Storm
•  Process streams
•  Events based
•  Came out of twitter
•  Apache Samza
•  Stream processing framework based on Kafka + Hadoop
YARN
•  Apache NiFi
•  Data flow
•  New project / incubating
•  Spark streaming
(c) Elephant Scale 2015
(2) PROCESSING
MEET SPARK
•  Spark is the new darling of ‘Big Data’ world
•  Lot’s of activity and interest
•  Fast and Expressive Cluster Compute Engine
•  “First Big Data platform to integrate batch, streaming and
interactive computations in a unified framework” –
stratio.com
Hadoop
Spark
(c) Elephant Scale 2015
(2) PROCESSING
SPARK ECO-SYSTEM
(c) Elephant Scale 2015
Spark Core
Spark
SQL
Spark
Streaming
ML lib
Schema /
sql
Real Time
Machine
Learning
Stand alone YARN MESOS
Cluster
managers
GraphX
Graph
processing
(2) PROCESSING
SPARK DATA SOURCE
ABSTRACTION
Spark
(compute
engine)
HDFS
Amazon
S3
Cassandra ???
RDD
Hadoop
RDD
Cassandra
RDD
(c) Elephant Scale 2015
(2) PROCESSING
SPARK : ‘UNIFIED’ STACK
Spark supports multiple programming models
•  Map reduce style batch processing
•  Streaming / real time processing
•  Querying via SQL
•  Machine learning
All modules are tightly integrated
•  Facilitates rich applications
Spark can be only stack you need !
(c) Elephant Scale 2015
Image: buymeposters.com
(2) PROCESSING
SPARK STREAMING
Streaming
Sources
Storage
(c) Elephant Scale 2015
(2) PROCESSING
SPARK STREAMING
•  Provides ‘high level’ operations in time windows
•  E.g. ‘calculate X for the past 10 seconds’
•  Good adoption
(c) Elephant Scale 2015
(2) PROCESSING
ARCHITECTURE WITH SPARK
STREAMING
(c) Elephant Scale 2015
NEXT : STORAGE
(c) Elephant Scale 2015
(3) STORAGE
REQUIREMENTS
•  Handle ‘Big Data’ ( 1 TB / day !)
•  Traditional storages are not effective (or too expensive)
•  Need two types of storage
1.  ‘forever’ storage
•  Store multi terabytes of data for a long periods
•  Support Batch queries
2.  ‘fast / real-time lookup’ storage
•  Query in real time (milliseconds)
“what is the latest reading for sensor-123 ?”
•  Store latest / new data (e.g. last 3 months)
•  Flexible schema for semi-structured data
•  Both need to scale
(c) Elephant Scale 2015
(3) STORAGE
REQUIREMENTS
(c) Elephant Scale 2015
(3) STORAGE
CHOICES
•  ‘forever’ storage
•  Scalable distributed file systems
•  Hadoop ! (HDFS actually)
•  ‘real time store’
•  Traditional RDBMS won’t work
•  Don’t scale well (or too expensive)
•  Rigid schema layout
•  NoSQL !
(c) Elephant Scale 2015
(3) STORAGE
HDFS (IN 20 SECS)
•  Distributed file system
•  Runs on commodity servers
•  à high ROI
•  Can keep ticking even when nodes go down
•  à fault tolerant
•  Replicates data to prevent data loss in case of node
failures
•  à built in backup J
•  Scales to Peta bytes (horizontal scalability)
•  Proven in the field
(c) Elephant Scale 2015
(3) STORAGE
HDFS ARCHITECTURE
(c) Elephant Scale 2015
(3) STORAGE
COST OF BIG DATA
Source : hortonworks
(c) Elephant Scale 2015
(3) STORAGE
HDFS
•  Can handle big data
•  Scales easily
•  Cost effective
•  “Source of Truth”
•  Files are immutable within HDFS (new data is ‘appended’ )
•  Audit friendly
(c) Elephant Scale 2015
(3) STORAGE
CAPACITY PLANNING (HADOOP)
Variables (tweak these) description value units
Average daily ingest 1000 GB
raw data node storage eg. 12 disks x 3TB 36 TB
replication default 3 3
space allocated for HDFS HDFS 75% + Mapreduce 25% 75.00%
growth per month (not calculated) 0
Calculation
effective data storage per node 27 TB
growth 1 month 6 month 1 yr 2 yr
data size (TB) 90 540 1,080 2,160
# data nodes 3.333333333 20 40 80
(c) Elephant Scale 2015
(3) STORAGE (REAL TIME)
CHOICES FOR NOSQL
•  Too many ! J
•  HBase
•  Part of Hadoop eco system
•  Uses HDFS for storage
•  Provides consistent view of data
•  Cassandra
•  Popular NoSQL store
•  No Single Point of Failure (SPOF) – ring architecture
•  No dependency on Hadoop
•  Accumulo
•  Came out of NSA !
•  Uses HDFS for storage
•  Provides very good security (naturally !)
(c) Elephant Scale 2015
(3) STORAGE
CAP THEOREM
(c) Elephant Scale 2015
(3) STORAGE
ARCHITECTURE SO FAR
(c) Elephant Scale 2015
NEXT : QUERY
(c) Elephant Scale 2015
(3) QUERY
REQUIREMENTS
•  Real Time queries
•  “what is the latest reading for the sensor id = 123”
•  Useful for building applications /
dashboards
•  Latency : milli-seconds
•  Batch / Aggregate queries
•  “What is the average temperature in
zip code = 12345” ?
•  May need to go through large data points
•  Latency : ‘batch’ (minutes / hours)
(c) Elephant Scale 2015
(3) QUERY
SOLUTIONS
•  Batch queries
•  Query data in HDFS (and or NoSQL)
•  Hadoop mapreduce (Pig / Hive)
•  Spark batch analytics
•  Real time queries
•  Queries to go NoSQL store
HDFS
NoSQL
Real time
queries
Batch queries
(c) Elephant Scale 2015
(3) QUERY
ARCHITECTURE
(c) Elephant Scale 2015
FINAL ARCHITECTURE
(c) Elephant Scale 2015
LAMBDA ARCHITECTURE
(c) Elephant Scale 2015
LAMBDA ARCHITECTURE
EXPLAINED
1.  All new data is sent to both batch layer and speed layer
2.  Batch layer
•  Holds master data set (immutable , append-only)
•  Answers batch queries
3.  Serving layer
•  updates batch views so they can be queried adhoc
4.  Speed Layer
•  Handles new data
•  Facilitates fast / real-time queries
5.  Query layer
•  Answers queries using batch & real-time views
(c) Elephant Scale 2015
INCORPORATING LAMBDA
ARCHITECTURE
(c) Elephant Scale 2015
OUR ARCHITECTURE
•  Each component is scalable
•  Each component is fault tolerant
•  Incorporates best practices
•  All open source !
(c) Elephant Scale 2015
… AND ONE MORE THING…
•  Security !
(c) Elephant Scale 2015
Source : businessinsider.com
HOW EVER…
(c) Elephant Scale 2015
At scale nothing works as advertised !
GOOD NEWS !
•  We are building an open source, reference data platform for IoT /
connected devices!
•  Yes, open source ! J
•  ElephantScale is a strong believer in open source
•  “hadoop illuminated” – open source Hadoop book
•  Github.com/elephantscale
•  Best practices
•  Bringing together lots of expertise in Big Data systems
•  Register your interest
http://elephantscale.com/iotx/
(c) Elephant Scale 2015
GOALS FOR IOTX
http://elephantscale.com/iotx/
•  Use open-source proven components
•  Capture :
•  Kafka
•  Kinesis (AWS)
•  Processing : Spark Streaming
•  Batch storage : Hadoop / HDFS
•  Real Time Store : support multiple data stores
•  Cassandra
•  Hbase
•  Accumulo
•  ???
(c) Elephant Scale 2015
GOALS FOR IOTX…
http://elephantscale.com/iotx/
•  Query templates using
•  Spark
•  Hadoop Map Reduce (Pig / Hive)
•  Incorporate third party libraries for
•  Outlier detection (temperature is outside norms)
•  Trend detection (stock price is trending up)
•  Alerts (fire !)
•  Monitoring & Metrics (key !!)
•  What’s going in the system?
•  Host / system level (cpu / network ..etc) – easier
•  application level (e.g. find slow queries) – harder
•  Incorporate third party libraries
(c) Elephant Scale 2015
THANKS AND QUESTIONS?
“A Reference Architecture for Internet of
Things (IoT)”
Sujee Maniyam
Founder / Principal @ ElephantScale
Expert Consulting + Training in Big Data technologies
sujee@elephantscale.com
Elephantscale.com
Project sign up page : http://elephantscale.com/iotx/
(c) Elephant Scale 2015
IMAGE CREDITS
•  www.engadget.com
•  Xfinity.com
•  Tesla.com
(c) Elephant Scale 2015

More Related Content

What's hot

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Spark Summit
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
 
From SQL to NoSQL - StampedeCon 2015
From SQL to NoSQL  - StampedeCon 2015From SQL to NoSQL  - StampedeCon 2015
From SQL to NoSQL - StampedeCon 2015StampedeCon
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkDataWorks Summit
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesDatabricks
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Big Data Spain
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
 
Streaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesStreaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesDatabricks
 
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...sparktc
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSpark Summit
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
 
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Aditya Yadav
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Nathan Bijnens
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIOJozo Kovac
 

What's hot (20)

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
From SQL to NoSQL - StampedeCon 2015
From SQL to NoSQL  - StampedeCon 2015From SQL to NoSQL  - StampedeCon 2015
From SQL to NoSQL - StampedeCon 2015
 
Implementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache SparkImplementing the Lambda Architecture efficiently with Apache Spark
Implementing the Lambda Architecture efficiently with Apache Spark
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on Kubernetes
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 
Streaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesStreaming Analytics for Financial Enterprises
Streaming Analytics for Financial Enterprises
 
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
 

Similar to IoT Reference Architecture Using Kafka, Spark & HBase

First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
 
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...Dirk Petersen
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWSPaolo latella
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterJohn Adams
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architectureSudheer Kondla
 
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...PROIDEA
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time AnalyticsAmazon Web Services
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015Amazon Web Services Korea
 
The Background Noise of the Internet
The Background Noise of the InternetThe Background Noise of the Internet
The Background Noise of the InternetAndrew Morris
 
AWS Community Nordics Virtual Meetup
AWS Community Nordics Virtual MeetupAWS Community Nordics Virtual Meetup
AWS Community Nordics Virtual MeetupAnahit Pogosova
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenChristopher Whitaker
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 

Similar to IoT Reference Architecture Using Kafka, Spark & HBase (20)

First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
 
Data Science
Data ScienceData Science
Data Science
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
 
The Background Noise of the Internet
The Background Noise of the InternetThe Background Noise of the Internet
The Background Noise of the Internet
 
AWS Community Nordics Virtual Meetup
AWS Community Nordics Virtual MeetupAWS Community Nordics Virtual Meetup
AWS Community Nordics Virtual Meetup
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 

More from Sujee Maniyam

Building secure NoSQL applications nosqlnow_conf_2014
Building secure NoSQL applications nosqlnow_conf_2014Building secure NoSQL applications nosqlnow_conf_2014
Building secure NoSQL applications nosqlnow_conf_2014Sujee Maniyam
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confSujee Maniyam
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big DataSujee Maniyam
 
Hadoop security landscape
Hadoop security landscapeHadoop security landscape
Hadoop security landscapeSujee Maniyam
 
Spark Intro @ analytics big data summit
Spark  Intro @ analytics big data summitSpark  Intro @ analytics big data summit
Spark Intro @ analytics big data summitSujee Maniyam
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 
Iphone client-server app with Rails backend (v3)
Iphone client-server app with Rails backend (v3)Iphone client-server app with Rails backend (v3)
Iphone client-server app with Rails backend (v3)Sujee Maniyam
 

More from Sujee Maniyam (8)

Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Building secure NoSQL applications nosqlnow_conf_2014
Building secure NoSQL applications nosqlnow_conf_2014Building secure NoSQL applications nosqlnow_conf_2014
Building secure NoSQL applications nosqlnow_conf_2014
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA conf
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big Data
 
Hadoop security landscape
Hadoop security landscapeHadoop security landscape
Hadoop security landscape
 
Spark Intro @ analytics big data summit
Spark  Intro @ analytics big data summitSpark  Intro @ analytics big data summit
Spark Intro @ analytics big data summit
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
Iphone client-server app with Rails backend (v3)
Iphone client-server app with Rails backend (v3)Iphone client-server app with Rails backend (v3)
Iphone client-server app with Rails backend (v3)
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

IoT Reference Architecture Using Kafka, Spark & HBase

  • 1. A REFERENCE ARCHITECTURE FOR INTERNET OF THINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANTSCALE SUJEE@ELEPHANTSCALE.COM (c) Elephant Scale 2015
  • 2. HI, I’M SUJEE MANIYAM •  Founder / Principal @ ElephantScale •  Consulting & Training in Big Data •  Training in Spark / Hadoop / NoSQL / Data Science •  Author •  “Hadoop illuminated” open source book •  “HBase Design Patterns •  Open Source contributor: github.com/sujee •  sujee@elephantscale.com •  http://sujee.net (c) Elephant Scale 2015 Spark Training available!
  • 3. INTERNET OF THINGS – A REALITY (c) Elephant Scale 2015
  • 5. DATA VOLUME ? A NAPKIN CALCULATION Say we have •  Million sensors •  Each sensor reports every minute •  data size 1KB This will result in : •  1.44 Billions events / day ! •  1.44 TB / day !! (c) Elephant Scale 2015
  • 6. SENSOR DATA WORKSHEET IoT Temperature Sensor Projection variables description sensors 1M 1.00E+06 1 million signal frequency every min / 60 secs 60 secs event size 1KB 1000 bytes events per day per sensor 1440 total events per day 1.44E+09 1440$$millions1.44$$billion total events / sec 1.67E+04 16,666.67$$ total data size per day 1.44E+12 1440$$$GB 1.44$$TB (c) Elephant Scale 2015
  • 7. SENSOR DATA : TEXAS UTILITIES SMART METER DATA Texas Smart Meter Projections variables description sensors 10 million customers 1.00E+07 10 million signal frequency every 15 mins 900 secs event size 1.4 K 1400 bytes events per day per sensor 96 total events per day 9.60E+08 960$$millions 0.96$$billion total events / sec 1.11E+04 11,111.11$$ total data size per day 1.34E+12 1344$$$GB 1.344$$TB (c) Elephant Scale 2015
  • 9. DATA VELOCITY Say we have •  Million sensors •  Each sensor reports every minute •  data size 1KB è •  Millions events / minute •  ~ 17,000 events / sec (c) Elephant Scale 2015
  • 10. DATA PROCESSING SPEED •  Need (near) real time processing most of the time •  E.g. Need to alert if temperature suddenly spikes temp Time Alert (c) Elephant Scale 2015
  • 11. CHALLENGE = BIG DATA + REAL TIME •  Don’t loose events ! •  Any event could be important •  Most events are mundane (e.g. temperature stays between 68’F – 72’ F) •  Process them in near real time •  Store the events for a long time •  Audit •  Diagnose •  Support various queries •  Real time (what is the latest temperature for sensor id 123?) •  Aggregate (what is the avg. temp in zipcode 12345) (c) Elephant Scale 2015
  • 12. HIGH LEVEL ARCHITECTURE (c) Elephant Scale 2015
  • 13. NEXT : (1) CAPTURE (c) Elephant Scale 2015
  • 14. (1) CAPTURE REQUIREMENTS Requirements: •  Capture events coming at high speed •  Tens of thousands events / sec (some times millions / sec) •  Don’t loose events •  Tolerate hardware / software failure •  Tolerate intermittent connectivity issues •  Scale ‘easily’ (c) Elephant Scale 2015
  • 15. (1) CAPTURE CHOICES •  MQ (RabbitMQ ..etc) •  Good adoption in enterprises / durable •  FluentD •  Data collector for various sources •  Flume •  Part of Hadoop eco system •  Good for collecting logs from many sources •  AWS Kinesis •  Queue system in Amazon Cloud •  Kafka •  Distributed queue (c) Elephant Scale 2015
  • 16. (1) CAPTURE MEET KAFKA •  Apache Kafka is a distributed messaging system •  Came out of LinkedIn… open sourced in 2011 •  Built to tolerate hardware / software / network failures •  Built for high throughput and scale •  LinkedIn : 220 Billion messages / day •  At peak : 3+ million messages / day (c) Elephant Scale 2015
  • 17. (1) CAPTURE KAFKA ARCHITECTURE •  Publisher - subscriber / producer – consumer model (c) Elephant Scale 2015
  • 18. (1) CAPTURE KAFKA ARCHITECTURE •  Producers write data to brokers •  Consumers read data from brokers •  All of this is distributed / parallel •  Failure tolerant •  Data is stored as topics •  “sensor_data” •  “alerts” •  “emails” (c) Elephant Scale 2015
  • 19. (1) CAPTURE KAFKA USERS •  Linked In •  Track user activities •  Sending emails •  Netflix •  Real time monitoring •  Spotify •  Ship logs to hadoop •  Uber… AirBnB…. (c) Elephant Scale 2015
  • 20. (1) CAPTURE ARCHITECTURE WITH KAFKA (c) Elephant Scale 2015
  • 21. NEXT : (2) PROCESSING (c) Elephant Scale 2015
  • 22. (2) PROCESSING REQUIREMENTS •  Process events in real time or near real time •  High velocity •  Tens of thousands à millions of events / sec •  Guaranteed processing •  Process an event at-least-once •  Exactly-once (harder to achieve) •  Failure tolerant •  Scale ‘easily’ (c) Elephant Scale 2015
  • 23. (2) PROCESSING CHOICES •  Storm •  Process streams •  Events based •  Came out of twitter •  Apache Samza •  Stream processing framework based on Kafka + Hadoop YARN •  Apache NiFi •  Data flow •  New project / incubating •  Spark streaming (c) Elephant Scale 2015
  • 24. (2) PROCESSING MEET SPARK •  Spark is the new darling of ‘Big Data’ world •  Lot’s of activity and interest •  Fast and Expressive Cluster Compute Engine •  “First Big Data platform to integrate batch, streaming and interactive computations in a unified framework” – stratio.com Hadoop Spark (c) Elephant Scale 2015
  • 25. (2) PROCESSING SPARK ECO-SYSTEM (c) Elephant Scale 2015 Spark Core Spark SQL Spark Streaming ML lib Schema / sql Real Time Machine Learning Stand alone YARN MESOS Cluster managers GraphX Graph processing
  • 26. (2) PROCESSING SPARK DATA SOURCE ABSTRACTION Spark (compute engine) HDFS Amazon S3 Cassandra ??? RDD Hadoop RDD Cassandra RDD (c) Elephant Scale 2015
  • 27. (2) PROCESSING SPARK : ‘UNIFIED’ STACK Spark supports multiple programming models •  Map reduce style batch processing •  Streaming / real time processing •  Querying via SQL •  Machine learning All modules are tightly integrated •  Facilitates rich applications Spark can be only stack you need ! (c) Elephant Scale 2015 Image: buymeposters.com
  • 29. (2) PROCESSING SPARK STREAMING •  Provides ‘high level’ operations in time windows •  E.g. ‘calculate X for the past 10 seconds’ •  Good adoption (c) Elephant Scale 2015
  • 30. (2) PROCESSING ARCHITECTURE WITH SPARK STREAMING (c) Elephant Scale 2015
  • 31. NEXT : STORAGE (c) Elephant Scale 2015
  • 32. (3) STORAGE REQUIREMENTS •  Handle ‘Big Data’ ( 1 TB / day !) •  Traditional storages are not effective (or too expensive) •  Need two types of storage 1.  ‘forever’ storage •  Store multi terabytes of data for a long periods •  Support Batch queries 2.  ‘fast / real-time lookup’ storage •  Query in real time (milliseconds) “what is the latest reading for sensor-123 ?” •  Store latest / new data (e.g. last 3 months) •  Flexible schema for semi-structured data •  Both need to scale (c) Elephant Scale 2015
  • 34. (3) STORAGE CHOICES •  ‘forever’ storage •  Scalable distributed file systems •  Hadoop ! (HDFS actually) •  ‘real time store’ •  Traditional RDBMS won’t work •  Don’t scale well (or too expensive) •  Rigid schema layout •  NoSQL ! (c) Elephant Scale 2015
  • 35. (3) STORAGE HDFS (IN 20 SECS) •  Distributed file system •  Runs on commodity servers •  à high ROI •  Can keep ticking even when nodes go down •  à fault tolerant •  Replicates data to prevent data loss in case of node failures •  à built in backup J •  Scales to Peta bytes (horizontal scalability) •  Proven in the field (c) Elephant Scale 2015
  • 36. (3) STORAGE HDFS ARCHITECTURE (c) Elephant Scale 2015
  • 37. (3) STORAGE COST OF BIG DATA Source : hortonworks (c) Elephant Scale 2015
  • 38. (3) STORAGE HDFS •  Can handle big data •  Scales easily •  Cost effective •  “Source of Truth” •  Files are immutable within HDFS (new data is ‘appended’ ) •  Audit friendly (c) Elephant Scale 2015
  • 39. (3) STORAGE CAPACITY PLANNING (HADOOP) Variables (tweak these) description value units Average daily ingest 1000 GB raw data node storage eg. 12 disks x 3TB 36 TB replication default 3 3 space allocated for HDFS HDFS 75% + Mapreduce 25% 75.00% growth per month (not calculated) 0 Calculation effective data storage per node 27 TB growth 1 month 6 month 1 yr 2 yr data size (TB) 90 540 1,080 2,160 # data nodes 3.333333333 20 40 80 (c) Elephant Scale 2015
  • 40. (3) STORAGE (REAL TIME) CHOICES FOR NOSQL •  Too many ! J •  HBase •  Part of Hadoop eco system •  Uses HDFS for storage •  Provides consistent view of data •  Cassandra •  Popular NoSQL store •  No Single Point of Failure (SPOF) – ring architecture •  No dependency on Hadoop •  Accumulo •  Came out of NSA ! •  Uses HDFS for storage •  Provides very good security (naturally !) (c) Elephant Scale 2015
  • 41. (3) STORAGE CAP THEOREM (c) Elephant Scale 2015
  • 42. (3) STORAGE ARCHITECTURE SO FAR (c) Elephant Scale 2015
  • 43. NEXT : QUERY (c) Elephant Scale 2015
  • 44. (3) QUERY REQUIREMENTS •  Real Time queries •  “what is the latest reading for the sensor id = 123” •  Useful for building applications / dashboards •  Latency : milli-seconds •  Batch / Aggregate queries •  “What is the average temperature in zip code = 12345” ? •  May need to go through large data points •  Latency : ‘batch’ (minutes / hours) (c) Elephant Scale 2015
  • 45. (3) QUERY SOLUTIONS •  Batch queries •  Query data in HDFS (and or NoSQL) •  Hadoop mapreduce (Pig / Hive) •  Spark batch analytics •  Real time queries •  Queries to go NoSQL store HDFS NoSQL Real time queries Batch queries (c) Elephant Scale 2015
  • 49. LAMBDA ARCHITECTURE EXPLAINED 1.  All new data is sent to both batch layer and speed layer 2.  Batch layer •  Holds master data set (immutable , append-only) •  Answers batch queries 3.  Serving layer •  updates batch views so they can be queried adhoc 4.  Speed Layer •  Handles new data •  Facilitates fast / real-time queries 5.  Query layer •  Answers queries using batch & real-time views (c) Elephant Scale 2015
  • 51. OUR ARCHITECTURE •  Each component is scalable •  Each component is fault tolerant •  Incorporates best practices •  All open source ! (c) Elephant Scale 2015
  • 52. … AND ONE MORE THING… •  Security ! (c) Elephant Scale 2015 Source : businessinsider.com
  • 53. HOW EVER… (c) Elephant Scale 2015 At scale nothing works as advertised !
  • 54. GOOD NEWS ! •  We are building an open source, reference data platform for IoT / connected devices! •  Yes, open source ! J •  ElephantScale is a strong believer in open source •  “hadoop illuminated” – open source Hadoop book •  Github.com/elephantscale •  Best practices •  Bringing together lots of expertise in Big Data systems •  Register your interest http://elephantscale.com/iotx/ (c) Elephant Scale 2015
  • 55. GOALS FOR IOTX http://elephantscale.com/iotx/ •  Use open-source proven components •  Capture : •  Kafka •  Kinesis (AWS) •  Processing : Spark Streaming •  Batch storage : Hadoop / HDFS •  Real Time Store : support multiple data stores •  Cassandra •  Hbase •  Accumulo •  ??? (c) Elephant Scale 2015
  • 56. GOALS FOR IOTX… http://elephantscale.com/iotx/ •  Query templates using •  Spark •  Hadoop Map Reduce (Pig / Hive) •  Incorporate third party libraries for •  Outlier detection (temperature is outside norms) •  Trend detection (stock price is trending up) •  Alerts (fire !) •  Monitoring & Metrics (key !!) •  What’s going in the system? •  Host / system level (cpu / network ..etc) – easier •  application level (e.g. find slow queries) – harder •  Incorporate third party libraries (c) Elephant Scale 2015
  • 57. THANKS AND QUESTIONS? “A Reference Architecture for Internet of Things (IoT)” Sujee Maniyam Founder / Principal @ ElephantScale Expert Consulting + Training in Big Data technologies sujee@elephantscale.com Elephantscale.com Project sign up page : http://elephantscale.com/iotx/ (c) Elephant Scale 2015
  • 58. IMAGE CREDITS •  www.engadget.com •  Xfinity.com •  Tesla.com (c) Elephant Scale 2015