SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
Introduction to Real-time
data processing
Yogi Devendra
(yogidevendra@apache.org)
Agenda
● What is big data?
● Data at rest Vs Data in motion
● Batch processing Vs Real - time data
processing (streaming)
● Examples
● When to use: Batch? Real-time?
● Current trends
2
Image ref [4]
3
Big data
Definition : big data
Big data is high-volume, high-velocity and/or
high-variety information assets that demand
cost-effective, innovative forms of information
processing that enable enhanced insight,
decision making, and process automation. [1]
4
Exploding sizes of datasets
5
● Google
○ >100PB data everyday [3]
● Large Hydron collidor :
○ 150M sensors x 40M sample per sec x 600 M
collisions per sec
○ >500 exabytes per day [2]
○ 0.0001% of data is actually analysed
6
Questions
Image ref [16]
Data at rest Vs Data in motion
● At rest :
○ Dataset is fixed
○ a.k.a bounded [15]
● In motion :
○ continuously incoming data
○ a.k.a unbounded
7
Data at rest Vs Data in motion (continued)
● Generally Big data has velocity
○ continuous data
● Difference lies in when are you analyzing
your data? [5]
○ after the event occurs ⇒ at rest
○ as the event occurs ⇒ in motion
8
Examples
● Data at rest
○ Finding stats about group in a closed room
○ Analyzing sales data for last month to make
strategic decisions
● Data in motion
○ Finding stats about group in a marathon
○ e-commerce order processing
9
10
Questions
Image ref [16]
Batch processing
● Problem statement :
○ Process this entire data
○ give answer for X at the end.
11
Batch processing : Use-cases
12
● Sales summary for the previous
month[5]
● Model training for Spam emails
Batch processing : Characteristics
13
● Access to entire data
● Split decided at the launch time.
● Capable of doing complex analysis (e.g.
Model training) [6]
● Optimize for Throughput (data processed
per sec)
● Example frameworks : Map Reduce,
Apache Spark [6]
14
Questions
Image ref [16]
Real time data processing
● a.k.a. Stream processing
● Problem statement :
○ Process incoming stream of data
○ to give answer for X at this
moment.
15
Stream processing : Use-cases
● e-commerce order processing
● Credit card fraud detection
● Label given email as : spam vs non-
spam
16
Image ref [7]
17
Stream processing : Characteristics
● Results for X are based on the
current data
● Computes function on one record or
smaller window. [6]
● Optimizations for latency (avg. time
taken for a record)
18
Stream processing : Characteristics
● Need to complete computes in near real-
time
● Computes something relatively simple e.g.
Using pre-defined model to label a record.
● Example frameworks: Apache Apex,
Apache storm
19
20
Questions
Image ref [16]
21
Batch Vs Streaming
pani puri ⇒ Streaming
image ref [9]
wada ⇒ batch
image ref [8]
22
23
Questions
Image ref [16]
Micro-batch
● Create batch of
small size
● Process each
micro-batch
separately
● Example
frameworks: Spark
streaming
pani puri ⇒ micro-batch
image ref [10]
24
● Depends on use-case
○ Some are suitable for batch
○ Some are suitable for streaming
○ Some can be solved by any one
○ Some might need combination of two.
25
When to use : Batch Vs Streaming?
When to use : Batch Vs Real time?(continued)
● Answers for current snapshot ⇒ Real-time
○ Answers at the end ⇒ Open
● Complex calculations, multiple iterations
over entire data ⇒ Batch
○ Simple computations ⇒ Open
● Low latency requirements (< 1s) ⇒ Real-
time
26
When to use : Batch Vs Real time?(continued)
● Each record can be processed
independently ⇒ Open
○ Independent processing not possible ⇒
Batch
● Depends on use-case
○ Some use-cases can be solved by any one
○ Some other might need combination of two.
27
28
Questions
Image ref [16]
Can one replace the other?
● Batch processing is designed for ‘data at
rest’. ‘data in motion’ becomes stale; if
processed in batch mode.
● Real-time processing is designed for ‘data
in motion’. But, can be used for ‘data at
rest’ as well (in many cases).
29
30
Questions
Image ref [16]
Quiz : is this Batch or Real-time?
● Queue for roller coaster
ride image ref [11]
● Queue at the petrol
pump image ref [12]
31
Quiz : is this Batch or Real-time?
● Selecting relevant ad
to show for requested
page
● Courier dispatch from
city A to B
image ref [13]
image ref [14]
32
33
Questions
Image ref [16]
Current trends
● Difficulty in splitting problems as Map
Reduce : Alternative paradigms for
expressing user intent .
● More and more use-cases demanding
faster insight to data (near real-time)
● ‘Data in motion’ is common.
● ‘Real-time data processing’ getting
traction.
34
35
Questions
Image ref [16]
36
References
1. Big Data | Gartner IT Glossary http://www.gartner.com/it-glossary/big-data/
2. Big Data | Wikipedia https://en.wikipedia.org/wiki/Big_data
3. Data size estimates | Follow the data https://followthedata.wordpress.com/2014/06/24/data-size-estimates/
4. Data Never Sleeps 2.0 | Domo https://www.domo.com/blog/2014/04/data-never-sleeps-2-0/
5. Data in motion vs. data at rest | Internap http://www.internap.com/2013/06/20/data-in-motion-vs-data-at-rest/
6. Difference between batch processing and stream processing | Quora https://www.quora.com/What-are-the-differences-between-batch-
processing-and-stream-processing-systems/answer/Sean-Owen?srid=O9ht
7. How FAST is Credit Card Fraud Detection | FICO http://www.fico.com/en/latest-thinking/infographic/how-fast-is-credit-card-fraud-
detection
8. CULINARY TERMS | panjakhada http://panjakhada.com/the-basics/
9. Crispy Chaat ... | grabhouse http://grabhouse.com/urbancocktail/11-crispy-chaat-joints-food-lovers-hyderabad/
10. Paani puri stall | citiyshor http://www.cityshor.com/pune/food/street-food/camp/murali-paani-puri-stall/
11. Great Inventions: The Roller Coaster | findingdulcinea http://www.findingdulcinea.com/features/science/innovations/great-inventions/the-
roller-coaster.html
12. RIL petrol pump network | economictimes http://articles.economictimes.indiatimes.com/2015-05-24/news/62583419_1_petrol-and-
diesel-fuel-retailing-ril
13. Publishers | Propellerads https://propellerads.com/publishers/
14. Michael Bishop Couriers | Google plus https://plus.google.com/110684176517668223067
15. The world beyond batch: Streaming 101 http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-101.html
16. How to Answer the Question http://www.clipartpanda.com/clipart_images/how-to-answer-the-question-46954146
17. Thank You http://www.planwallpaper.com/thank-you
37

Más contenido relacionado

La actualidad más candente

Online analytical processing
Online analytical processingOnline analytical processing
Online analytical processingSamraiz Tejani
 
Cloud computing and Cloud Enabling Technologies
Cloud computing and Cloud Enabling TechnologiesCloud computing and Cloud Enabling Technologies
Cloud computing and Cloud Enabling TechnologiesAbdelkhalik Mosa
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream ProcessingZbigniew Jerzak
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Data Warehouse Design Considerations
Data Warehouse Design ConsiderationsData Warehouse Design Considerations
Data Warehouse Design ConsiderationsRam Kedem
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
data warehouse , data mart, etl
data warehouse , data mart, etldata warehouse , data mart, etl
data warehouse , data mart, etlAashish Rathod
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeDatabricks
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data StreamsSujaAldrin
 

La actualidad más candente (20)

Online analytical processing
Online analytical processingOnline analytical processing
Online analytical processing
 
Cloud computing and Cloud Enabling Technologies
Cloud computing and Cloud Enabling TechnologiesCloud computing and Cloud Enabling Technologies
Cloud computing and Cloud Enabling Technologies
 
Temporal databases
Temporal databasesTemporal databases
Temporal databases
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
 
Activity diagram
Activity diagramActivity diagram
Activity diagram
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Data Warehouse Design Considerations
Data Warehouse Design ConsiderationsData Warehouse Design Considerations
Data Warehouse Design Considerations
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
OLAP technology
OLAP technologyOLAP technology
OLAP technology
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
data warehouse , data mart, etl
data warehouse , data mart, etldata warehouse , data mart, etl
data warehouse , data mart, etl
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
Google BigTable
Google BigTableGoogle BigTable
Google BigTable
 
Tableau Presentation
Tableau PresentationTableau Presentation
Tableau Presentation
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
 

Similar a Introduction to Real-time data processing

Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingApache Apex
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ IndixRajesh Muppalla
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleItai Yaffe
 
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Seattle Apache Flink Meetup
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...Bowen Li
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesKarthik Murugesan
 
Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)Ido Green
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Stavros Kontopoulos
 
E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres Regunath B
 
Adding Velocity to BigBench
Adding Velocity to BigBenchAdding Velocity to BigBench
Adding Velocity to BigBencht_ivanov
 
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...DataBench
 
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York CityThe Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York CityNeo4j
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 

Similar a Introduction to Real-time data processing (20)

Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Druid
DruidDruid
Druid
 
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres
 
Adding Velocity to BigBench
Adding Velocity to BigBenchAdding Velocity to BigBench
Adding Velocity to BigBench
 
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
 
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York CityThe Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
The Connected Data Imperative: Why Graphs? at Neo4j GraphDay New York City
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 

Último

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Último (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Introduction to Real-time data processing

  • 1. Introduction to Real-time data processing Yogi Devendra (yogidevendra@apache.org)
  • 2. Agenda ● What is big data? ● Data at rest Vs Data in motion ● Batch processing Vs Real - time data processing (streaming) ● Examples ● When to use: Batch? Real-time? ● Current trends 2
  • 4. Definition : big data Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. [1] 4
  • 5. Exploding sizes of datasets 5 ● Google ○ >100PB data everyday [3] ● Large Hydron collidor : ○ 150M sensors x 40M sample per sec x 600 M collisions per sec ○ >500 exabytes per day [2] ○ 0.0001% of data is actually analysed
  • 7. Data at rest Vs Data in motion ● At rest : ○ Dataset is fixed ○ a.k.a bounded [15] ● In motion : ○ continuously incoming data ○ a.k.a unbounded 7
  • 8. Data at rest Vs Data in motion (continued) ● Generally Big data has velocity ○ continuous data ● Difference lies in when are you analyzing your data? [5] ○ after the event occurs ⇒ at rest ○ as the event occurs ⇒ in motion 8
  • 9. Examples ● Data at rest ○ Finding stats about group in a closed room ○ Analyzing sales data for last month to make strategic decisions ● Data in motion ○ Finding stats about group in a marathon ○ e-commerce order processing 9
  • 11. Batch processing ● Problem statement : ○ Process this entire data ○ give answer for X at the end. 11
  • 12. Batch processing : Use-cases 12 ● Sales summary for the previous month[5] ● Model training for Spam emails
  • 13. Batch processing : Characteristics 13 ● Access to entire data ● Split decided at the launch time. ● Capable of doing complex analysis (e.g. Model training) [6] ● Optimize for Throughput (data processed per sec) ● Example frameworks : Map Reduce, Apache Spark [6]
  • 15. Real time data processing ● a.k.a. Stream processing ● Problem statement : ○ Process incoming stream of data ○ to give answer for X at this moment. 15
  • 16. Stream processing : Use-cases ● e-commerce order processing ● Credit card fraud detection ● Label given email as : spam vs non- spam 16
  • 18. Stream processing : Characteristics ● Results for X are based on the current data ● Computes function on one record or smaller window. [6] ● Optimizations for latency (avg. time taken for a record) 18
  • 19. Stream processing : Characteristics ● Need to complete computes in near real- time ● Computes something relatively simple e.g. Using pre-defined model to label a record. ● Example frameworks: Apache Apex, Apache storm 19
  • 21. 21
  • 22. Batch Vs Streaming pani puri ⇒ Streaming image ref [9] wada ⇒ batch image ref [8] 22
  • 24. Micro-batch ● Create batch of small size ● Process each micro-batch separately ● Example frameworks: Spark streaming pani puri ⇒ micro-batch image ref [10] 24
  • 25. ● Depends on use-case ○ Some are suitable for batch ○ Some are suitable for streaming ○ Some can be solved by any one ○ Some might need combination of two. 25 When to use : Batch Vs Streaming?
  • 26. When to use : Batch Vs Real time?(continued) ● Answers for current snapshot ⇒ Real-time ○ Answers at the end ⇒ Open ● Complex calculations, multiple iterations over entire data ⇒ Batch ○ Simple computations ⇒ Open ● Low latency requirements (< 1s) ⇒ Real- time 26
  • 27. When to use : Batch Vs Real time?(continued) ● Each record can be processed independently ⇒ Open ○ Independent processing not possible ⇒ Batch ● Depends on use-case ○ Some use-cases can be solved by any one ○ Some other might need combination of two. 27
  • 29. Can one replace the other? ● Batch processing is designed for ‘data at rest’. ‘data in motion’ becomes stale; if processed in batch mode. ● Real-time processing is designed for ‘data in motion’. But, can be used for ‘data at rest’ as well (in many cases). 29
  • 31. Quiz : is this Batch or Real-time? ● Queue for roller coaster ride image ref [11] ● Queue at the petrol pump image ref [12] 31
  • 32. Quiz : is this Batch or Real-time? ● Selecting relevant ad to show for requested page ● Courier dispatch from city A to B image ref [13] image ref [14] 32
  • 34. Current trends ● Difficulty in splitting problems as Map Reduce : Alternative paradigms for expressing user intent . ● More and more use-cases demanding faster insight to data (near real-time) ● ‘Data in motion’ is common. ● ‘Real-time data processing’ getting traction. 34
  • 36. 36
  • 37. References 1. Big Data | Gartner IT Glossary http://www.gartner.com/it-glossary/big-data/ 2. Big Data | Wikipedia https://en.wikipedia.org/wiki/Big_data 3. Data size estimates | Follow the data https://followthedata.wordpress.com/2014/06/24/data-size-estimates/ 4. Data Never Sleeps 2.0 | Domo https://www.domo.com/blog/2014/04/data-never-sleeps-2-0/ 5. Data in motion vs. data at rest | Internap http://www.internap.com/2013/06/20/data-in-motion-vs-data-at-rest/ 6. Difference between batch processing and stream processing | Quora https://www.quora.com/What-are-the-differences-between-batch- processing-and-stream-processing-systems/answer/Sean-Owen?srid=O9ht 7. How FAST is Credit Card Fraud Detection | FICO http://www.fico.com/en/latest-thinking/infographic/how-fast-is-credit-card-fraud- detection 8. CULINARY TERMS | panjakhada http://panjakhada.com/the-basics/ 9. Crispy Chaat ... | grabhouse http://grabhouse.com/urbancocktail/11-crispy-chaat-joints-food-lovers-hyderabad/ 10. Paani puri stall | citiyshor http://www.cityshor.com/pune/food/street-food/camp/murali-paani-puri-stall/ 11. Great Inventions: The Roller Coaster | findingdulcinea http://www.findingdulcinea.com/features/science/innovations/great-inventions/the- roller-coaster.html 12. RIL petrol pump network | economictimes http://articles.economictimes.indiatimes.com/2015-05-24/news/62583419_1_petrol-and- diesel-fuel-retailing-ril 13. Publishers | Propellerads https://propellerads.com/publishers/ 14. Michael Bishop Couriers | Google plus https://plus.google.com/110684176517668223067 15. The world beyond batch: Streaming 101 http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-101.html 16. How to Answer the Question http://www.clipartpanda.com/clipart_images/how-to-answer-the-question-46954146 17. Thank You http://www.planwallpaper.com/thank-you 37