SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
1
Tale of two stream processing frameworks
Apache Storm & Apache Flink
Karthik Deivasigamani
@WalmartLabs
2
Streaming
• Stream
– Continuous flow
• Streaming Data
– Streaming data is data that is continuously
generated by different sources.
– Unbounded data
• Stream Processing
– processing of data in motion, or in other
words, computing on data directly as it is
produced or received
– data processing engine that is designed with
infinite data sets in mind
3
Retail Data
• Catalog Data
• Pricing Data
• Clickstream logs
• Payments
• Order Data
• Inventory
• Delivery Logistics
4
Not so long ago..
• Data submitted as feeds
• Periodic Data Collection
• Data Processed In Batches
• Runs offline
• Delay between actual time &
processing time
• Failures
5
Need For Speed – Fast Data
• Catalog Updates
• Price Updates
• Fraud Detection
• Out of stock
• Delivery alerts
• Personalization
6
7
Catalog Use Case
8
Catalog Functions
• Normalization
• Classification
• Product Matching
• Shelving
• Attribute Extraction
• Grouping
• Image
9
Characteristics of ingestion pipeline
• Zero message loss
• Fault Tolerance
• Source based priority queue
• Scale to millions of product updates/hour
• Near Real Time Updates
• Checkpoint at various stages
10
Apache Storm
• Created by Nathan Marz
• Stream Abstraction
• Spouts, Bolts, Topology
• Trident
• Kafka Integration
• Message processing
guarantees
11
Storm Cluster
• Nimbus
– distributing code
– assigning tasks to machines
– monitoring for failures
• Supervisor
– communicates with Nimbus
through Zookeeper
– starts and stops workers
according to signals from Nimbus
• Zookeeper
– Coordinates the storm cluster
12
Key Concepts
• Tuples
– Named list of values where each
value can be any type.
• Stream
– unbounded sequence of tuples
• Spout
– sources of streams in a
computation
• Bolts
– process input streams and
produce output streams
• Topology
– DAG - network of spouts and
bolts
13
Stream Grouping
• Shuffle Grouping
• Fields Grouping
• All grouping
• Global Grouping
• Local or Shuffle grouping
• Direct Grouping
14
Parallelism of a Storm Topology
• Worker processes
– Executes a subset of a topology
• Executors (Threads)
– Is a thread that is spawned by a
worker process.
– It may run one or more tasks for
the same component (spout or
bolt).
• Tasks
– performs the actual data processing
— each spout or bolt that you
implement in your code executes as
many tasks across the cluster
15
Guaranteeing Message Processing
16
Micro Service vs Bolt
• Choice of language
• Teams operate independently
• Platform with pluggable services
Bolt
17
Catalog Pipeline
18
Challenges
• Validations at various stages
• Async IO using RxJava, Hystrix
• Hystrix Circuit Breaker
• Failing Tuples
• Fetch-size, increase workers,
increase bolt parallelism
• Data Errors
• Services taking longer
• Service outage
• Fatal Errors
• Spike in traffic
19
Lessons Learnt
• Things will fail
• Monitor everything
• Automation
• Scale is not a feature
• Logs don’t lie
20
21
Pricing Use Case
• Competitive pricing (EDLP)
• Seller price updates
• Handle spike during holidays
• Promotions
• Anomaly Detection
• Accuracy
22
Characteristics of ingestion pipeline
• Exactly Once
• Order Guarantee
• Stateful
• Handle tens of millions of
updates/hour
• NRT price update on website
• Traceability
23
Apache Flink
• Project Stratosphere in
Universities around Berlin
• data Artisans founded in 2014
• Process Unbounded and
Bounded Data
• Exactly Once
• Stateful & Flexible API
• Alibaba was using it at scale
24
Apache Flink - Overview
• Data source: Incoming data that Flink processes
• Transformations: The processing step, when Flink modifies incoming data
• Data sink: Where Flink sends data after processing
25
Apache Flink - Runtime
Footer
26
Stateful Stream Processing
• "state" is shared between events.
• Past events can influence the way current
events are processed.
• Embedded database (Rocks DB) for state.
• Local state needs to be protected against
failures to avoid data loss.
• Checkpointing to guarantee persistence of
state.
27
Flink Checkpointing (Chandy-Lamport Algorithm)
28
Exactly Once - Explained
• The label “exactly-once” is misleading in
describing what is done exactly once.
• No Stream Processing can guarantee
exactly-once event processing.
• Flink guarantees exactly-once state
updates.
• Flink uses Chandy and Lamport Algorithm,
to draw consistent snapshots of current
state to create a checkpoint.
• Flink restarts an application using the most
recently completed checkpoint as a starting
point.
29
Duplicate Events
30
Pricing Pipeline
31
Challenges
• HTTP/DB lookup calls
• Huge payload choking network
• Isolation
• Buffer bloat
• Async I/O Operator
• Operator Chaining
• Mesos / YARN
• taskmanager.memory.segment-size
32
What we learnt
• Flink is fast, APIs are super easy to use.
• Avoid network shuffle and use forward / operator
chaining.
• Use accumulators to monitor the progress of your
application.
• Checkpoint failures indicate that your application is
running slow.
• Monitor everything – lag, checkpoints, latency etc
• For application inherently slow configure your
buffers to accommodate for buffer bloat, so that
checkpoints don’t fail.
• Join the flink users mailing list and ask questions!
33
Apache Storm vs Apache Flink
Feature Winner
True streaming Yes Yes Tie
Speed Fast Amazingly fast
Overall maturity Very stable, haven’t really
encountered storm bugs that
hit us in production.
Little behind – ran into lots of
fink bugs, some of it is
addressed now.
API Used to be very primitive with
until 1.0
Rich API and you can achieve lot
by writing very few lines of
code.
Windowing, Join They added support in 1.2 Excellent out of the box support
for windowing and join.
Tie
Monitoring / Deployment Better isolation of jobs with the
process model
You need YARN/Mesos to get
better isolation.
Tie (assumes you are running
Flink on YARN)
Stateful Stream processing WIP (apache storm 2.0) Supported with rocksdb. You
can also query the state outside
your stream processing system.
Message Processing Guarantee Supports - At least once, At
most once, Exactly once (need
trident)
Supports - At least once, At
most once, Exactly Once (state
is touched exactly once)
Tie
Backpressure Max spout pending can be used
to adjust
Handle automatically
Async IO support No native support Out of the box
Streaming SQL WIP (apache storm 2.0) Very early stage -
34
What should I pick
35
Future of streaming - Cloud
Amazon Kinesis Streams
Functions as stream processors
Cloud Flow
Confluent Cloud
Event Hub – Kafka Compatible
36
Thank You!
Yes, we are hiring!
https://indiacareers.walmartlabs.com/

Más contenido relacionado

La actualidad más candente

Integrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your EnvironmentIntegrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your Environmentconfluent
 
From data stream management to distributed dataflows and beyond
From data stream management to distributed dataflows and beyondFrom data stream management to distributed dataflows and beyond
From data stream management to distributed dataflows and beyondVasia Kalavri
 
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...HostedbyConfluent
 
Stream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsStream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsTom Van den Bulck
 
Events Everywhere: Enabling Digital Transformation in the Public Sector
Events Everywhere: Enabling Digital Transformation in the Public SectorEvents Everywhere: Enabling Digital Transformation in the Public Sector
Events Everywhere: Enabling Digital Transformation in the Public Sectorconfluent
 
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...confluent
 
New Features in Confluent Platform 6.0 / Apache Kafka 2.6
New Features in Confluent Platform 6.0 / Apache Kafka 2.6New Features in Confluent Platform 6.0 / Apache Kafka 2.6
New Features in Confluent Platform 6.0 / Apache Kafka 2.6Kai Wähner
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
Best Practices for Streaming IoT Data with MQTT and Apache KafkaBest Practices for Streaming IoT Data with MQTT and Apache Kafka
Best Practices for Streaming IoT Data with MQTT and Apache KafkaKai Wähner
 
Saga pattern and event sourcing with kafka
Saga pattern and event sourcing with kafkaSaga pattern and event sourcing with kafka
Saga pattern and event sourcing with kafkaRoan Brasil Monteiro
 
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...StreamNative
 
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)KafkaZone
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology confluent
 
Real time analytics in Azure IoT
Real time analytics in Azure IoT Real time analytics in Azure IoT
Real time analytics in Azure IoT Sam Vanhoutte
 
Cloud Native London 2019 Faas composition using Kafka and cloud-events
Cloud Native London 2019 Faas composition using Kafka and cloud-eventsCloud Native London 2019 Faas composition using Kafka and cloud-events
Cloud Native London 2019 Faas composition using Kafka and cloud-eventsNeil Avery
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache KafkaDataStax
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Kai Wähner
 
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?Kai Wähner
 
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Kai Wähner
 
Evolving from Messaging to Event Streaming
Evolving from Messaging to Event StreamingEvolving from Messaging to Event Streaming
Evolving from Messaging to Event Streamingconfluent
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...confluent
 

La actualidad más candente (20)

Integrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your EnvironmentIntegrating Apache Kafka Into Your Environment
Integrating Apache Kafka Into Your Environment
 
From data stream management to distributed dataflows and beyond
From data stream management to distributed dataflows and beyondFrom data stream management to distributed dataflows and beyond
From data stream management to distributed dataflows and beyond
 
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
 
Stream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka StreamsStream Processing Live Traffic Data with Kafka Streams
Stream Processing Live Traffic Data with Kafka Streams
 
Events Everywhere: Enabling Digital Transformation in the Public Sector
Events Everywhere: Enabling Digital Transformation in the Public SectorEvents Everywhere: Enabling Digital Transformation in the Public Sector
Events Everywhere: Enabling Digital Transformation in the Public Sector
 
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...
 
New Features in Confluent Platform 6.0 / Apache Kafka 2.6
New Features in Confluent Platform 6.0 / Apache Kafka 2.6New Features in Confluent Platform 6.0 / Apache Kafka 2.6
New Features in Confluent Platform 6.0 / Apache Kafka 2.6
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
Best Practices for Streaming IoT Data with MQTT and Apache KafkaBest Practices for Streaming IoT Data with MQTT and Apache Kafka
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
 
Saga pattern and event sourcing with kafka
Saga pattern and event sourcing with kafkaSaga pattern and event sourcing with kafka
Saga pattern and event sourcing with kafka
 
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
 
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
Abstractions for managed stream processing platform (Arya Ketan - Flipkart)
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
 
Real time analytics in Azure IoT
Real time analytics in Azure IoT Real time analytics in Azure IoT
Real time analytics in Azure IoT
 
Cloud Native London 2019 Faas composition using Kafka and cloud-events
Cloud Native London 2019 Faas composition using Kafka and cloud-eventsCloud Native London 2019 Faas composition using Kafka and cloud-events
Cloud Native London 2019 Faas composition using Kafka and cloud-events
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?
 
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?
 
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
 
Evolving from Messaging to Event Streaming
Evolving from Messaging to Event StreamingEvolving from Messaging to Event Streaming
Evolving from Messaging to Event Streaming
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
 

Similar a Tale of two streaming frameworks (Karthik D - Walmart)

Data Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkData Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkFabian Hueske
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Robert Metzger
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestDataGyula Fóra
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Apache Flink Taiwan User Group
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkRobert Metzger
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkRobert Metzger
 
Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Ilya Ganelin
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsDataWorks Summit/Hadoop Summit
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureRobert Metzger
 
Counting Elements in Streams
Counting Elements in StreamsCounting Elements in Streams
Counting Elements in StreamsJamie Grier
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 

Similar a Tale of two streaming frameworks (Karthik D - Walmart) (20)

Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Data Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkData Stream Processing with Apache Flink
Data Stream Processing with Apache Flink
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 
Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
Counting Elements in Streams
Counting Elements in StreamsCounting Elements in Streams
Counting Elements in Streams
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 

Más de KafkaZone

Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
Introduction to ksqlDB and stream processing (Vish Srinivasan  - Confluent)Introduction to ksqlDB and stream processing (Vish Srinivasan  - Confluent)
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)KafkaZone
 
Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...KafkaZone
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)KafkaZone
 
Stream processing at Hotstar
Stream processing at HotstarStream processing at Hotstar
Stream processing at HotstarKafkaZone
 
Data science at scale with Kafka and Flink (Razorpay)
Data science at scale with Kafka and Flink (Razorpay)Data science at scale with Kafka and Flink (Razorpay)
Data science at scale with Kafka and Flink (Razorpay)KafkaZone
 
Key considerations in productionizing streaming applications
Key considerations in productionizing streaming applicationsKey considerations in productionizing streaming applications
Key considerations in productionizing streaming applicationsKafkaZone
 

Más de KafkaZone (6)

Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
Introduction to ksqlDB and stream processing (Vish Srinivasan  - Confluent)Introduction to ksqlDB and stream processing (Vish Srinivasan  - Confluent)
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
 
Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
 
Stream processing at Hotstar
Stream processing at HotstarStream processing at Hotstar
Stream processing at Hotstar
 
Data science at scale with Kafka and Flink (Razorpay)
Data science at scale with Kafka and Flink (Razorpay)Data science at scale with Kafka and Flink (Razorpay)
Data science at scale with Kafka and Flink (Razorpay)
 
Key considerations in productionizing streaming applications
Key considerations in productionizing streaming applicationsKey considerations in productionizing streaming applications
Key considerations in productionizing streaming applications
 

Último

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Último (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Tale of two streaming frameworks (Karthik D - Walmart)

  • 1. 1 Tale of two stream processing frameworks Apache Storm & Apache Flink Karthik Deivasigamani @WalmartLabs
  • 2. 2 Streaming • Stream – Continuous flow • Streaming Data – Streaming data is data that is continuously generated by different sources. – Unbounded data • Stream Processing – processing of data in motion, or in other words, computing on data directly as it is produced or received – data processing engine that is designed with infinite data sets in mind
  • 3. 3 Retail Data • Catalog Data • Pricing Data • Clickstream logs • Payments • Order Data • Inventory • Delivery Logistics
  • 4. 4 Not so long ago.. • Data submitted as feeds • Periodic Data Collection • Data Processed In Batches • Runs offline • Delay between actual time & processing time • Failures
  • 5. 5 Need For Speed – Fast Data • Catalog Updates • Price Updates • Fraud Detection • Out of stock • Delivery alerts • Personalization
  • 6. 6
  • 8. 8 Catalog Functions • Normalization • Classification • Product Matching • Shelving • Attribute Extraction • Grouping • Image
  • 9. 9 Characteristics of ingestion pipeline • Zero message loss • Fault Tolerance • Source based priority queue • Scale to millions of product updates/hour • Near Real Time Updates • Checkpoint at various stages
  • 10. 10 Apache Storm • Created by Nathan Marz • Stream Abstraction • Spouts, Bolts, Topology • Trident • Kafka Integration • Message processing guarantees
  • 11. 11 Storm Cluster • Nimbus – distributing code – assigning tasks to machines – monitoring for failures • Supervisor – communicates with Nimbus through Zookeeper – starts and stops workers according to signals from Nimbus • Zookeeper – Coordinates the storm cluster
  • 12. 12 Key Concepts • Tuples – Named list of values where each value can be any type. • Stream – unbounded sequence of tuples • Spout – sources of streams in a computation • Bolts – process input streams and produce output streams • Topology – DAG - network of spouts and bolts
  • 13. 13 Stream Grouping • Shuffle Grouping • Fields Grouping • All grouping • Global Grouping • Local or Shuffle grouping • Direct Grouping
  • 14. 14 Parallelism of a Storm Topology • Worker processes – Executes a subset of a topology • Executors (Threads) – Is a thread that is spawned by a worker process. – It may run one or more tasks for the same component (spout or bolt). • Tasks – performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster
  • 16. 16 Micro Service vs Bolt • Choice of language • Teams operate independently • Platform with pluggable services Bolt
  • 18. 18 Challenges • Validations at various stages • Async IO using RxJava, Hystrix • Hystrix Circuit Breaker • Failing Tuples • Fetch-size, increase workers, increase bolt parallelism • Data Errors • Services taking longer • Service outage • Fatal Errors • Spike in traffic
  • 19. 19 Lessons Learnt • Things will fail • Monitor everything • Automation • Scale is not a feature • Logs don’t lie
  • 20. 20
  • 21. 21 Pricing Use Case • Competitive pricing (EDLP) • Seller price updates • Handle spike during holidays • Promotions • Anomaly Detection • Accuracy
  • 22. 22 Characteristics of ingestion pipeline • Exactly Once • Order Guarantee • Stateful • Handle tens of millions of updates/hour • NRT price update on website • Traceability
  • 23. 23 Apache Flink • Project Stratosphere in Universities around Berlin • data Artisans founded in 2014 • Process Unbounded and Bounded Data • Exactly Once • Stateful & Flexible API • Alibaba was using it at scale
  • 24. 24 Apache Flink - Overview • Data source: Incoming data that Flink processes • Transformations: The processing step, when Flink modifies incoming data • Data sink: Where Flink sends data after processing
  • 25. 25 Apache Flink - Runtime Footer
  • 26. 26 Stateful Stream Processing • "state" is shared between events. • Past events can influence the way current events are processed. • Embedded database (Rocks DB) for state. • Local state needs to be protected against failures to avoid data loss. • Checkpointing to guarantee persistence of state.
  • 28. 28 Exactly Once - Explained • The label “exactly-once” is misleading in describing what is done exactly once. • No Stream Processing can guarantee exactly-once event processing. • Flink guarantees exactly-once state updates. • Flink uses Chandy and Lamport Algorithm, to draw consistent snapshots of current state to create a checkpoint. • Flink restarts an application using the most recently completed checkpoint as a starting point.
  • 31. 31 Challenges • HTTP/DB lookup calls • Huge payload choking network • Isolation • Buffer bloat • Async I/O Operator • Operator Chaining • Mesos / YARN • taskmanager.memory.segment-size
  • 32. 32 What we learnt • Flink is fast, APIs are super easy to use. • Avoid network shuffle and use forward / operator chaining. • Use accumulators to monitor the progress of your application. • Checkpoint failures indicate that your application is running slow. • Monitor everything – lag, checkpoints, latency etc • For application inherently slow configure your buffers to accommodate for buffer bloat, so that checkpoints don’t fail. • Join the flink users mailing list and ask questions!
  • 33. 33 Apache Storm vs Apache Flink Feature Winner True streaming Yes Yes Tie Speed Fast Amazingly fast Overall maturity Very stable, haven’t really encountered storm bugs that hit us in production. Little behind – ran into lots of fink bugs, some of it is addressed now. API Used to be very primitive with until 1.0 Rich API and you can achieve lot by writing very few lines of code. Windowing, Join They added support in 1.2 Excellent out of the box support for windowing and join. Tie Monitoring / Deployment Better isolation of jobs with the process model You need YARN/Mesos to get better isolation. Tie (assumes you are running Flink on YARN) Stateful Stream processing WIP (apache storm 2.0) Supported with rocksdb. You can also query the state outside your stream processing system. Message Processing Guarantee Supports - At least once, At most once, Exactly once (need trident) Supports - At least once, At most once, Exactly Once (state is touched exactly once) Tie Backpressure Max spout pending can be used to adjust Handle automatically Async IO support No native support Out of the box Streaming SQL WIP (apache storm 2.0) Very early stage -
  • 35. 35 Future of streaming - Cloud Amazon Kinesis Streams Functions as stream processors Cloud Flow Confluent Cloud Event Hub – Kafka Compatible
  • 36. 36 Thank You! Yes, we are hiring! https://indiacareers.walmartlabs.com/