SlideShare una empresa de Scribd logo
1 de 26
Pramod Immaneni <pramod@datatorrent.com>
PPMC Member, Architect @DataTorrent Inc
Apr 6th, 2016
A native Hadoop platform
IoT Ingestion & Analytics using Apache Apex
© 2015 DataTorrent
IoT (Internet of Things)
2
• Network of objects in physical world with electronic sensors
• Collecting and sending data
•Consumer
• Smart home – Smart bulbs, thermostats, locks
• Connected vehicles
• Wearables
• Industrial
• Manufacturing
• Retail
• Energy
© 2015 DataTorrent
Challenges
3
• Velocity of data
• Machines generate data at a much higher pace than humans
• They generate data continuously without pause posing processing
challenges
• Certain aspects of data may become stale if not processed in time
• Volume
• Large volumes of data for processing and storage
• Volume keep increasing because of the exponential growth in devices
• Research estimates zettabytes by 2020
• Variety
• Types of data – sensor, location, kpi
• Formats – csv, json
• Delivery mechanisms – files, message queues
© 2015 DataTorrent
What should a solution have
4
• Ingest and perform analytics on data in motion
• Handle the incoming throughput while maintaining SLAs
• Scale with the volume of data
• Fault tolerant so that no data is lost
• Flexible to handle data variety and allow custom business
logic
• Easy to use tools for the user
© 2015 DataTorrent
What is Apex
5
• Platform and runtime engine to enable scalable and fault-
tolerant distributed applications
• Hadoop native
• Process streaming or batch big data
• High throughput and low latency
• Library of commonly needed business logic
• Build any custom business logic in your application
© 2015 DataTorrent
Applications
6
• Distributed processing
• Application logic broken into components called operators that run in a distributed fashion
across your cluster
• Scalable
• Operators can be partitioned into multiple instances to handle large volumes
• Fault tolerant
• Automatically recover from node outages without having to reprocess from beginning
• State is preserved
• Long running applications
• Operators
• Use library to build applications quickly
• Write your own in Java using the API
• Operational insight – DataTorrent RTS
• See how each operator is performing and even record data
© 2015 DataTorrent
Apex Platform Overview
7
© 2015 DataTorrent
Apache Malhar Library
8
© 2015 DataTorrent
Application Development Model
9
 A Stream is a sequence of data tuples
 A typical Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance is single-threaded
 Directed Acyclic Graph (DAG) is made up of operators and streams
Directed Acyclic Graph (DAG)
Output
Stream
Tupl
e
Tupl
e
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
© 2015 DataTorrent
Native Hadoop Integration
10
• YARN is
the
resource
manager
• HDFS used
for storing
any
persistent
state
© 2015 DataTorrent
Advanced Windowing Support
11
 Application window
 Sliding window and tumbling window
 Checkpoint window
 No artificial latency
© 2015 DataTorrent
Scalability & Partitioning
12
NxM PartitionsUnifier
0 1 2 3
Logical DAG
0 1 2
1
1 Unifier
1
20
Logical Diagram
Physical Diagram with operator 1 with 3 partitions
0
Unifier
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck
Unifier
Unifier0
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
© 2015 DataTorrent
Advanced Partitioning
13
0
1a
1b
2 3 4Unifier
Physical DAG
0 4
3a2a1a
1b 2b 3b
Unifier
Physical DAG with Parallel Partition
Parallel Partition
Container
uopr
uopr1
uopr2
uopr3
uopr4
uopr1
uopr2
uopr3
uopr4
dopr
dopr
doprunifier
unifier
unifier
unifier
Container
Container
NICNIC
NICNIC
NIC
Container
NIC
Logical Plan
Execution Plan, for N = 4; M = 1
Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers
Cascading Unifiers
0 1 2 3 4
Logical DAG
© 2015 DataTorrent
Dynamic Partitioning
14
• Partitioning change while application is running
ᵒ Change number of partitions at runtime based on stats
ᵒ Determine initial number of partitions dynamically
• Kafka operators scale according to number of kafka partitions
ᵒ Supports re-distribution of state when number of partitions change
ᵒ API for custom scaler or partitioner
2b
2c
3
2a
2d
1b
1a1a 2a
1b 2b
3
1a 2b
1b 2c 3b
2a
2d
3a
Unifiers not shown
© 2015 DataTorrent
Fault Tolerance
15
• Operator state is checkpointed to a persistent store
ᵒ Automatically performed by engine, no additional work needed by operator
ᵒ In case of failure operators are restarted from checkpoint state
ᵒ Frequency configurable per operator
ᵒ Asynchronous and distributed by default
ᵒ Default store is HDFS
• Automatic detection and recovery of failed operators
ᵒ Heartbeat mechanism
• Buffering mechanism to ensure replay of data from recovered point so
that there is no loss of data
• Application master state checkpointed
© 2015 DataTorrent
Processing Guarantees
16
Atleast once
• On recovery data will be replayed from a previous checkpoint
ᵒ Messages will not be lost
ᵒ Default mechanism and is suitable for most applications
• Can be used in conjunction with following mechanisms to achieve
exactly-once behavior in fault recovery scenarios
ᵒ Transactions with meta information, Rewinding output, Feedback from
external entity, Idempotent operations
Atmost once
• On recovery the latest data is made available to operator
ᵒ Useful in use cases where some data loss is acceptable and latest data is
sufficient
Exactly once
• At least once + state recovery + operator logic to achieve end-to-end
exactly once
© 2015 DataTorrent
Ingestion Operators
17
• NFS/FTP/S3/HDFS Readers and Writers
• Kafka readers and writers
• Cassandra, HBase, Hive, JDBC
• CSV, JSON, Avro, XML parsers
• Filtering and enrichment
• Dedup
• Dimension computation
• Stream Joins
• …
© 2015 DataTorrent
Pipeline builder
18
© 2015 DataTorrent
Monitoring Console
Logical View
19
© 2015 DataTorrent
Monitoring Console
Physical View
20
© 2015 DataTorrent
Real-Time Dashboards
Real Time Visualization
21
© 2015 DataTorrent
IoT solutions
22
© 2015 DataTorrent
Resources
23
• Ingest application – http://www.datatorrent.com/download
• Solutions – http://www.datatorrent.com/solutions
• Subscribe - http://apex.incubator.apache.org/community.html
• Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex
• Free Enterprise License for Startups - https://www.datatorrent.com/product/startup-
accelerator/
• Cloud Trial - http://datatorrent.com/cloudtrial.html
© 2015 DataTorrent
We Are Hiring
24
• jobs@datatorrent.com
• Developers/Architects
• QA Automation Developers
• Information Developers
• Build and Release
• Community Leaders
© 2015 DataTorrent
Upcoming Events
25
• April 14th 6pm PDT – IOT Big Data Ingestion and Processing in
Hadoop by Silver Spring Networks
• May 4th 6pm PDT – Fault Tolerant HDFS r/w with Apache Apex and
Apex (native Hadoop) benchmarks
End
26

Más contenido relacionado

La actualidad más candente

Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 

La actualidad más candente (20)

Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
Extending The Yahoo Streaming Benchmark to Apache Apex
Extending The Yahoo Streaming Benchmark to Apache ApexExtending The Yahoo Streaming Benchmark to Apache Apex
Extending The Yahoo Streaming Benchmark to Apache Apex
 
University program - writing an apache apex application
University program  - writing an apache apex applicationUniversity program  - writing an apache apex application
University program - writing an apache apex application
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Building your first aplication using Apache Apex
Building your first aplication using Apache ApexBuilding your first aplication using Apache Apex
Building your first aplication using Apache Apex
 
Java High Level Stream API
Java High Level Stream APIJava High Level Stream API
Java High Level Stream API
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big Data
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
 
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareActionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare
 
Fault-Tolerant File Input & Output
Fault-Tolerant File Input & OutputFault-Tolerant File Input & Output
Fault-Tolerant File Input & Output
 
Deep Dive into Apache Apex App Development
Deep Dive into Apache Apex App DevelopmentDeep Dive into Apache Apex App Development
Deep Dive into Apache Apex App Development
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
Fault Tolerance and Processing Semantics in Apache Apex
Fault Tolerance and Processing Semantics in Apache ApexFault Tolerance and Processing Semantics in Apache Apex
Fault Tolerance and Processing Semantics in Apache Apex
 
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and EnrichmentIngesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017
 
Ingestion file copy using apex
Ingestion   file copy using apexIngestion   file copy using apex
Ingestion file copy using apex
 

Destacado

Jena – A Semantic Web Framework for Java
Jena – A Semantic Web Framework for JavaJena – A Semantic Web Framework for Java
Jena – A Semantic Web Framework for Java
Aleksander Pohl
 

Destacado (12)

Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
 
An Introduction to the Jena API
An Introduction to the Jena APIAn Introduction to the Jena API
An Introduction to the Jena API
 
Semantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolSemantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and Stanbol
 
Jena – A Semantic Web Framework for Java
Jena – A Semantic Web Framework for JavaJena – A Semantic Web Framework for Java
Jena – A Semantic Web Framework for Java
 
Jena Programming
Jena ProgrammingJena Programming
Jena Programming
 
Open source data ingestion
Open source data ingestionOpen source data ingestion
Open source data ingestion
 
High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into Hadoop
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
 
Hadoop data ingestion
Hadoop data ingestionHadoop data ingestion
Hadoop data ingestion
 

Similar a IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Flink Forward
 
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
In-Memory Computing Summit
 

Similar a IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform (20)

Stream Processing with Apache Apex
Stream Processing with Apache ApexStream Processing with Apache Apex
Stream Processing with Apache Apex
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Apache Apex - Hadoop Users Group
Apache Apex - Hadoop Users GroupApache Apex - Hadoop Users Group
Apache Apex - Hadoop Users Group
 
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
 
Apache Apex Meetup at Cask
Apache Apex Meetup at CaskApache Apex Meetup at Cask
Apache Apex Meetup at Cask
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
BigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Introduction to Apache ApexBigDataSpain 2016: Introduction to Apache Apex
BigDataSpain 2016: Introduction to Apache Apex
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
 
Monitoring federation open stack infrastructure
Monitoring federation open stack infrastructureMonitoring federation open stack infrastructure
Monitoring federation open stack infrastructure
 
Ibm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_CapabilitiesIbm_IoT_Architecture_and_Capabilities
Ibm_IoT_Architecture_and_Capabilities
 
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
 
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
IMCSummit 2015 - Day 1 Developer Track - In-memory Computing for Iterative CP...
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Stream data from Apache Kafka for processing with Apache Apex
Stream data from Apache Kafka for processing with Apache ApexStream data from Apache Kafka for processing with Apache Apex
Stream data from Apache Kafka for processing with Apache Apex
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
 

Más de Apache Apex

Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)
Apache Apex
 

Más de Apache Apex (16)

Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
 
Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)Developing streaming applications with apache apex (strata + hadoop world)
Developing streaming applications with apache apex (strata + hadoop world)
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsKafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationBuilding Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexMaking sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
 
Apache Apex & Bigtop
Apache Apex & BigtopApache Apex & Bigtop
Apache Apex & Bigtop
 
Building Your First Apache Apex Application
Building Your First Apache Apex ApplicationBuilding Your First Apache Apex Application
Building Your First Apache Apex Application
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform

  • 1. Pramod Immaneni <pramod@datatorrent.com> PPMC Member, Architect @DataTorrent Inc Apr 6th, 2016 A native Hadoop platform IoT Ingestion & Analytics using Apache Apex
  • 2. © 2015 DataTorrent IoT (Internet of Things) 2 • Network of objects in physical world with electronic sensors • Collecting and sending data •Consumer • Smart home – Smart bulbs, thermostats, locks • Connected vehicles • Wearables • Industrial • Manufacturing • Retail • Energy
  • 3. © 2015 DataTorrent Challenges 3 • Velocity of data • Machines generate data at a much higher pace than humans • They generate data continuously without pause posing processing challenges • Certain aspects of data may become stale if not processed in time • Volume • Large volumes of data for processing and storage • Volume keep increasing because of the exponential growth in devices • Research estimates zettabytes by 2020 • Variety • Types of data – sensor, location, kpi • Formats – csv, json • Delivery mechanisms – files, message queues
  • 4. © 2015 DataTorrent What should a solution have 4 • Ingest and perform analytics on data in motion • Handle the incoming throughput while maintaining SLAs • Scale with the volume of data • Fault tolerant so that no data is lost • Flexible to handle data variety and allow custom business logic • Easy to use tools for the user
  • 5. © 2015 DataTorrent What is Apex 5 • Platform and runtime engine to enable scalable and fault- tolerant distributed applications • Hadoop native • Process streaming or batch big data • High throughput and low latency • Library of commonly needed business logic • Build any custom business logic in your application
  • 6. © 2015 DataTorrent Applications 6 • Distributed processing • Application logic broken into components called operators that run in a distributed fashion across your cluster • Scalable • Operators can be partitioned into multiple instances to handle large volumes • Fault tolerant • Automatically recover from node outages without having to reprocess from beginning • State is preserved • Long running applications • Operators • Use library to build applications quickly • Write your own in Java using the API • Operational insight – DataTorrent RTS • See how each operator is performing and even record data
  • 7. © 2015 DataTorrent Apex Platform Overview 7
  • 8. © 2015 DataTorrent Apache Malhar Library 8
  • 9. © 2015 DataTorrent Application Development Model 9  A Stream is a sequence of data tuples  A typical Operator takes one or more input streams, performs computations & emits one or more output streams • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library • Operator has many instances that run in parallel and each instance is single-threaded  Directed Acyclic Graph (DAG) is made up of operators and streams Directed Acyclic Graph (DAG) Output Stream Tupl e Tupl e er Operator er Operator er Operator er Operator er Operator er Operator
  • 10. © 2015 DataTorrent Native Hadoop Integration 10 • YARN is the resource manager • HDFS used for storing any persistent state
  • 11. © 2015 DataTorrent Advanced Windowing Support 11  Application window  Sliding window and tumbling window  Checkpoint window  No artificial latency
  • 12. © 2015 DataTorrent Scalability & Partitioning 12 NxM PartitionsUnifier 0 1 2 3 Logical DAG 0 1 2 1 1 Unifier 1 20 Logical Diagram Physical Diagram with operator 1 with 3 partitions 0 Unifier 1a 1b 1c 2a 2b Unifier 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck Unifier Unifier0 1a 1b 1c 2a 2b Unifier 3 Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
  • 13. © 2015 DataTorrent Advanced Partitioning 13 0 1a 1b 2 3 4Unifier Physical DAG 0 4 3a2a1a 1b 2b 3b Unifier Physical DAG with Parallel Partition Parallel Partition Container uopr uopr1 uopr2 uopr3 uopr4 uopr1 uopr2 uopr3 uopr4 dopr dopr doprunifier unifier unifier unifier Container Container NICNIC NICNIC NIC Container NIC Logical Plan Execution Plan, for N = 4; M = 1 Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers Cascading Unifiers 0 1 2 3 4 Logical DAG
  • 14. © 2015 DataTorrent Dynamic Partitioning 14 • Partitioning change while application is running ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically • Kafka operators scale according to number of kafka partitions ᵒ Supports re-distribution of state when number of partitions change ᵒ API for custom scaler or partitioner 2b 2c 3 2a 2d 1b 1a1a 2a 1b 2b 3 1a 2b 1b 2c 3b 2a 2d 3a Unifiers not shown
  • 15. © 2015 DataTorrent Fault Tolerance 15 • Operator state is checkpointed to a persistent store ᵒ Automatically performed by engine, no additional work needed by operator ᵒ In case of failure operators are restarted from checkpoint state ᵒ Frequency configurable per operator ᵒ Asynchronous and distributed by default ᵒ Default store is HDFS • Automatic detection and recovery of failed operators ᵒ Heartbeat mechanism • Buffering mechanism to ensure replay of data from recovered point so that there is no loss of data • Application master state checkpointed
  • 16. © 2015 DataTorrent Processing Guarantees 16 Atleast once • On recovery data will be replayed from a previous checkpoint ᵒ Messages will not be lost ᵒ Default mechanism and is suitable for most applications • Can be used in conjunction with following mechanisms to achieve exactly-once behavior in fault recovery scenarios ᵒ Transactions with meta information, Rewinding output, Feedback from external entity, Idempotent operations Atmost once • On recovery the latest data is made available to operator ᵒ Useful in use cases where some data loss is acceptable and latest data is sufficient Exactly once • At least once + state recovery + operator logic to achieve end-to-end exactly once
  • 17. © 2015 DataTorrent Ingestion Operators 17 • NFS/FTP/S3/HDFS Readers and Writers • Kafka readers and writers • Cassandra, HBase, Hive, JDBC • CSV, JSON, Avro, XML parsers • Filtering and enrichment • Dedup • Dimension computation • Stream Joins • …
  • 19. © 2015 DataTorrent Monitoring Console Logical View 19
  • 20. © 2015 DataTorrent Monitoring Console Physical View 20
  • 21. © 2015 DataTorrent Real-Time Dashboards Real Time Visualization 21
  • 22. © 2015 DataTorrent IoT solutions 22
  • 23. © 2015 DataTorrent Resources 23 • Ingest application – http://www.datatorrent.com/download • Solutions – http://www.datatorrent.com/solutions • Subscribe - http://apex.incubator.apache.org/community.html • Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex • Free Enterprise License for Startups - https://www.datatorrent.com/product/startup- accelerator/ • Cloud Trial - http://datatorrent.com/cloudtrial.html
  • 24. © 2015 DataTorrent We Are Hiring 24 • jobs@datatorrent.com • Developers/Architects • QA Automation Developers • Information Developers • Build and Release • Community Leaders
  • 25. © 2015 DataTorrent Upcoming Events 25 • April 14th 6pm PDT – IOT Big Data Ingestion and Processing in Hadoop by Silver Spring Networks • May 4th 6pm PDT – Fault Tolerant HDFS r/w with Apache Apex and Apex (native Hadoop) benchmarks