SlideShare una empresa de Scribd logo
1 de 83
Niels Basjes | Corporate Inventor
and you have
100K events/sec
When ordering
matters
• Shopping
• Conversation
• Measuring
• Cause and effect
• End-to-end pipeline
• State machines
Agenda
ir. drs. Niels Basjes
TU-Delft Computer Science (MSc)
Nyenrode Business School (MSc)
Software Developer, Researcher,
Consultant, Infra/Web/IT Architect
Corporate Inventor
Contributor for Apache Hadoop,
Pig, HBase, Flink, Beam, Parquet, …
Apache Avro Committer & PMC
Speaker & Teacher at
Universities and Conferences
@nielsbasjes
https://github.com/nielsbasjes
• Listen
• Understand
• Process
• Respond
Communicate
Communication
requires
“Remembering”
~
Observe
UnderstandProcess
Respond
I want something to play with
I want something to play with
I want something to play with
I want something to play with
Going online …
22.085.386
~ 22 million
products for sale
> 10 million
active customers
> 8000 million
pageviews/year
Season 2017
~16.000.000 presents
To have an online conversation
we need
• Accurate observations
• Don’t miss any response
• Scalable low latency ‘loop’
• Quick enough
• 100K events/second
• State per session
• Remember per visitor
• Ordering guarantees
• Give the right response
~
Observe
UnderstandProcess
Respond
Observe
UnderstandProcess
Respond
Online conversation M2
Measuring 2.0
U2
Understanding 2.0
Creating
content
Streaming
DataScience
Why M2?
We already have
Omniture/GA…
Measuring interaction
• What are we showing (and why)?
• How are our visitors responding?
• Pages
• Products/Offers
• Add to cart
• Purchase
• Advertising
• Inspiration
Use cases
• Personalization
• Site optimization
• Fraud prevention
• Dashboards
• Attribution modelling
• Data Science
• …
~ 3K-4K pages/sec
~ 30 events/page
~ 100K events/sec
Q3 2019: Per day
• 1 500 000 000
measurements
• 3TiB data
‘All’ details
Why
Where
What
x
JavaScript data is …
• Broken … fundamentally broken
• Measuring a side effect
• Missing & Duplicate orders
• Blockable
• Intelligent Tracking Protection
• Ad blockers
• Spiders
• Hackers
• Fragile
• HTTP 414: URI Too long
• Locked in
• “Boxed” SaaS solutions.
Measurements are…
too old.
• Available once every 24 hours.
• So personalization is a ‘day behind’.
Useless inspiration:
I was interested in this YESTERDAY
Data relevance decay
Age of the data
Valueofthedata
Days WeeksMinutes
Time is not always important
Crowd pattern analysis of website usage
Building a better website for future visitors: (Micro)Batch
Individual pattern analysis of website usage
Supporting and advising the current visitor: Realtime
Batch processing
Stream processing
Measure
accurately
Where to measure?
• Measure where “it” happens.
• “In” the responsible “frontend” service!
• Webshop
• App API
• Basket Service
• Order Service
• …
• Usually NOT in the browser
Measure pages
• Serverside
• What is in the page
• Clientside (Javascript)
• What part was viewed
• Screen resolution
Measure orders
• Listen to Order events!
• with website/app sessionid.
• The “Order confirmation” page.
• Just a viewing of the order.
Record everything at the start
• Measure what “really” happens.
• Keep all relevant details
• Product: ProductId, Product type, …
• Offer: OfferId, ProductId, Price, Condition, SellerId, …
• Later joining on productid/offerid is “impossible”.
• Webshop caching
• Data volume / Extra latency
Ordering
• All single event entities
• No correlations
• No logical ‘cause and effect’
• No conversation
This is easy
Our usecases
• Banner optimization
• Look / Search  Next page better banner
• A/B testing
• Show feature  Use  Buy product
• Search Suggestions
• Search  Find  Choose  Buy product
• Attribution modeling
• Show ad  Click  Buy product
• …
Behavioral analytics
• Cause and effect
• Action: We show something
• Reaction: To click or not to click
Event ordering matters
• Click banner , Buy product
• Banner WAS (possibly) part of reason to buy.
• Buy product, Click banner
• Banner WAS NOT part of reason to buy.
“WAS” or “WAS NOT” is based on
The ordering of the events
Finite State machine
• Simple, low latency, pattern detection
• Ordered events
Pushdown automaton
• State machine
• with a memory stack
• Simple, low latency,
pattern detection
• Ordered events
Event ordering matters
• A fast temperature change is dangerous
• should alert IMMEDIATELY
• Delta stays in bounds
• Expect “Ordered”
while (curr = newEvent()) {
if (tooBig(curr, prev))
sendAlert();
prev = curr;
}
-40
-20
0
20
40
60
80
100
T1 T2 T3 T4 T5 T6 T7 T8 T9
Temperature Delta
This is a simple
pushdown automaton
Event ordering matters
• A fast temperature change is dangerous
• should alert IMMEDIATELY
Ordering problems
• Many false positives !
• Many false negatives !
-40
-20
0
20
40
60
80
100
T1 T5 T7 T4 T2 T8 T3 T6 T9
Temperature Delta
!
! !
!
!
Repairing event ordering
• Is hard
• Needless complexity
• Takes time
• Buffer for the maximum ‘out-of-orderness’ period.
• Several minutes
• We want really low latency
123 4 56 78 9
Sliding time based
sort buffer
Exactly once please
• At least once
• Need data deduplication 
• Is hard
• Large memory buffer
• Idempotent output
• Takes time
So we try to
never damage
the ordering
Achieving end-to-end ordering...
1. The measuring point
2. Measurement transport
3. Measurement processing
The measuring point
• Single entity
•  single measuring instance
• Multiple instances
• Multiple output buffers
• Race conditions
• Ordering problems
The measuring point
In IOT:
• One temperature sensor
• one recording device
At bol.com
• One visitor
• Single webshop instance
• Session routing is a MUST have!!
• Not perfect!
• Impact negligible
• “View” measurements
• Orders
Transport
Message transport
• We need ordering per session: FIFO
• “Queue” or “Partitioned Queue”
• Session pinned to a specific partition
https://en.wikipedia.org/wiki/Queue_(abstract_data_type)
Partitioned Queue
Many “Queue” are not a Queue !
https://en.wikipedia.org/wiki/Java_Message_Service
https://stackoverflow.com/questions/16300353/activemq-lifo-ordering
Many “Queue” are not a Queue !
https://cloud.google.com/pubsub/docs/ordering
High volume partitioned queues
• Apache Kafka
• https://kafka.apache.org/
• Production ready
• Apache Pulsar
• https://pulsar.apache.org/
• Connector for Flink very new.
• Pravega
• http://pravega.io/
• Being beta tested
• MapR Event Store
• https://mapr.com/products/mapr-eventstore/
• Production ready
• Amazon Kinesis
• Sorry, wrong cloud
• Microsoft Event Hubs
• Sorry, wrong cloud Azure Event Hubs
Use
Google PubSub
If ordering
does not matter
at-least-once
is Ok.
High IO
‘Distributed Set’
If ordering
matters
and/or
exactly-once
is needed
High IO
‘Partitioned Queue’
Use
Apache Kafka
Processing
Measurement processing
Requirements
• Low latency
• Exactly once
• Ordering guarantees
• A pushdown automaton per
session
• Keyed Stateful processing
• Where the ‘key’ is the ‘session id’
Observe
UnderstandProcess
Respond
~
Choosing a Processing toolkit
Apache Beam
• Low latency … except
• Exactly once by deduplication
• NO ordering guarantees
• NO natural keyed stateful processing
• “Dynamic” scaling
• Abstract Java API
• Runs on
• DataFlow
• Flink
Choosing a Processing toolkit
Apache Flink
• Low latency
• Exactly once
• Ordering guarantees (Chandy–Lamport)
• Keyed Stateful processing
• “Fixed” scaling
• Easy Java API
• Runs on
• Hadoop
• Kubernetes
Changes
happen
Applications change!
• New business
• New insights
• New wishes
• New scope
• New …
The records will
•  get new fields
•  have obsolete fields
Data producerData producerData producer
Streaming applications
Data producer Streaming Interface
Data consumers
Data consumersData consumers
Data consumers
The real payload is
“byte array”
Multiple Applications
Rolling upgrades
Canary releases
Multiple Applications
Rolling upgrades
Canary releases
Kafka persists messages
• A message is retained until the TTL expired.
• So a topic will contain several message versions!
• With different fields
V1 V2
V3 V4
So we need something to
• Serialize records into bytes
• Data types
• Nested records
• Bidirectional Schema evolution
Apache Avro
is what we use !
Apache Avro (IDL Schema)
Code generation
Avro Message format
• Single record into bytes encoding
• Designed for evolving streaming applications
• Need schema database:
• Key = 64bit long
• Value = String
The json version of the schema
Producing from Flink into Kafka
Consume from Kafka into Flink
Consume from Kafka into Flink
Using
the data
Search Suggestions
The search funnel
Search
Find
(Product page)
Choose
(Add to cart)
Buy
Deeper = more relevant
(Search)
Find
Choose
Buy
High level
M2
Stateful
analysis
Relevant events
Suggestions
Event to
scored
suggestions
Suggestion Delta
Is this a valuable event?
Pushdown Automaton
1. Input:
“Harry Potter” Product page
2. Score:
“Harry Potter” 25
3. Suggest:
“H”  “Harry Potter”: 25
“Ha”  “Harry Potter”: 25
“Har”  “Harry Potter”: 25
Ordering is important Ordering is NOT important
Search Suggestions
Default
no search
Searched
Found
Chosen
Initial
Search
for “X”
To PDP
Add to Cart
Within PDP
Anythingelse
The PDP is really a set of
pages about the product,
offers, reviews, …
= Send out event that something useful happened
Searched
for “X”
PDP for “X”
ATC for “X”
Recommended reading
Join us
100
https://careers.bol.com
@nielsbasjes
https://www.linkedin.com/in/nielsbasjes/
https://niels.basj.es/
https://github.com/nielsbasjes
ir. drs. Niels
Basjes
Thanks for
listening

Más contenido relacionado

Similar a When ordering Matters - Flink Forward EU - Berlin - 2019

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 

Similar a When ordering Matters - Flink Forward EU - Berlin - 2019 (20)

Scaling
ScalingScaling
Scaling
 
The server side story: Parallel and Asynchronous programming in .NET - ITPro...
The server side story:  Parallel and Asynchronous programming in .NET - ITPro...The server side story:  Parallel and Asynchronous programming in .NET - ITPro...
The server side story: Parallel and Asynchronous programming in .NET - ITPro...
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
 
Distributed "Web Scale" Systems
Distributed "Web Scale" SystemsDistributed "Web Scale" Systems
Distributed "Web Scale" Systems
 
Greenfields tech decisions
Greenfields tech decisionsGreenfields tech decisions
Greenfields tech decisions
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
SenchaCon 2016: How to Auto Generate a Back-end in Minutes - Per Minborg, Emi...
SenchaCon 2016: How to Auto Generate a Back-end in Minutes - Per Minborg, Emi...SenchaCon 2016: How to Auto Generate a Back-end in Minutes - Per Minborg, Emi...
SenchaCon 2016: How to Auto Generate a Back-end in Minutes - Per Minborg, Emi...
 
SenchaCon 2016 - How to Auto Generate a Back-end in Minutes
SenchaCon 2016 - How to Auto Generate a Back-end in MinutesSenchaCon 2016 - How to Auto Generate a Back-end in Minutes
SenchaCon 2016 - How to Auto Generate a Back-end in Minutes
 
SenchaCon 2016 - How to Auto Generate a Back-end in Minutes
SenchaCon 2016 - How to Auto Generate a Back-end in MinutesSenchaCon 2016 - How to Auto Generate a Back-end in Minutes
SenchaCon 2016 - How to Auto Generate a Back-end in Minutes
 
Genji: Framework for building resilient near-realtime data pipelines
Genji: Framework for building resilient near-realtime data pipelinesGenji: Framework for building resilient near-realtime data pipelines
Genji: Framework for building resilient near-realtime data pipelines
 
Prototipação em hackathons
Prototipação em hackathonsPrototipação em hackathons
Prototipação em hackathons
 
In Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionIn Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics Solution
 
How Thin is Thin? Effective User Story Slicing
How Thin is Thin? Effective User Story SlicingHow Thin is Thin? Effective User Story Slicing
How Thin is Thin? Effective User Story Slicing
 
System insight without Interference
System insight without InterferenceSystem insight without Interference
System insight without Interference
 
Iot meets Serverless
Iot meets ServerlessIot meets Serverless
Iot meets Serverless
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 

Último

Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
mbmh111980
 

Último (20)

AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdf
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
SQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionSQL Injection Introduction and Prevention
SQL Injection Introduction and Prevention
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 

When ordering Matters - Flink Forward EU - Berlin - 2019

  • 1.
  • 2.
  • 3. Niels Basjes | Corporate Inventor and you have 100K events/sec When ordering matters
  • 4. • Shopping • Conversation • Measuring • Cause and effect • End-to-end pipeline • State machines Agenda
  • 5. ir. drs. Niels Basjes TU-Delft Computer Science (MSc) Nyenrode Business School (MSc) Software Developer, Researcher, Consultant, Infra/Web/IT Architect Corporate Inventor Contributor for Apache Hadoop, Pig, HBase, Flink, Beam, Parquet, … Apache Avro Committer & PMC Speaker & Teacher at Universities and Conferences @nielsbasjes https://github.com/nielsbasjes
  • 6.
  • 7. • Listen • Understand • Process • Respond Communicate Communication requires “Remembering” ~ Observe UnderstandProcess Respond
  • 8. I want something to play with
  • 9. I want something to play with
  • 10. I want something to play with
  • 11. I want something to play with
  • 12.
  • 13.
  • 15.
  • 17.
  • 18.
  • 19. ~ 22 million products for sale > 10 million active customers > 8000 million pageviews/year Season 2017 ~16.000.000 presents
  • 20. To have an online conversation we need • Accurate observations • Don’t miss any response • Scalable low latency ‘loop’ • Quick enough • 100K events/second • State per session • Remember per visitor • Ordering guarantees • Give the right response ~ Observe UnderstandProcess Respond
  • 21. Observe UnderstandProcess Respond Online conversation M2 Measuring 2.0 U2 Understanding 2.0 Creating content Streaming DataScience
  • 22. Why M2? We already have Omniture/GA…
  • 23. Measuring interaction • What are we showing (and why)? • How are our visitors responding? • Pages • Products/Offers • Add to cart • Purchase • Advertising • Inspiration
  • 24. Use cases • Personalization • Site optimization • Fraud prevention • Dashboards • Attribution modelling • Data Science • …
  • 25. ~ 3K-4K pages/sec ~ 30 events/page ~ 100K events/sec Q3 2019: Per day • 1 500 000 000 measurements • 3TiB data ‘All’ details Why Where What x
  • 26. JavaScript data is … • Broken … fundamentally broken • Measuring a side effect • Missing & Duplicate orders • Blockable • Intelligent Tracking Protection • Ad blockers • Spiders • Hackers • Fragile • HTTP 414: URI Too long • Locked in • “Boxed” SaaS solutions.
  • 27. Measurements are… too old. • Available once every 24 hours. • So personalization is a ‘day behind’. Useless inspiration: I was interested in this YESTERDAY
  • 28. Data relevance decay Age of the data Valueofthedata Days WeeksMinutes
  • 29. Time is not always important Crowd pattern analysis of website usage Building a better website for future visitors: (Micro)Batch Individual pattern analysis of website usage Supporting and advising the current visitor: Realtime
  • 33. Where to measure? • Measure where “it” happens. • “In” the responsible “frontend” service! • Webshop • App API • Basket Service • Order Service • … • Usually NOT in the browser
  • 34. Measure pages • Serverside • What is in the page • Clientside (Javascript) • What part was viewed • Screen resolution
  • 35. Measure orders • Listen to Order events! • with website/app sessionid. • The “Order confirmation” page. • Just a viewing of the order.
  • 36. Record everything at the start • Measure what “really” happens. • Keep all relevant details • Product: ProductId, Product type, … • Offer: OfferId, ProductId, Price, Condition, SellerId, … • Later joining on productid/offerid is “impossible”. • Webshop caching • Data volume / Extra latency
  • 38. • All single event entities • No correlations • No logical ‘cause and effect’ • No conversation This is easy
  • 39. Our usecases • Banner optimization • Look / Search  Next page better banner • A/B testing • Show feature  Use  Buy product • Search Suggestions • Search  Find  Choose  Buy product • Attribution modeling • Show ad  Click  Buy product • …
  • 40. Behavioral analytics • Cause and effect • Action: We show something • Reaction: To click or not to click
  • 41. Event ordering matters • Click banner , Buy product • Banner WAS (possibly) part of reason to buy. • Buy product, Click banner • Banner WAS NOT part of reason to buy. “WAS” or “WAS NOT” is based on The ordering of the events
  • 42. Finite State machine • Simple, low latency, pattern detection • Ordered events
  • 43. Pushdown automaton • State machine • with a memory stack • Simple, low latency, pattern detection • Ordered events
  • 44. Event ordering matters • A fast temperature change is dangerous • should alert IMMEDIATELY • Delta stays in bounds • Expect “Ordered” while (curr = newEvent()) { if (tooBig(curr, prev)) sendAlert(); prev = curr; } -40 -20 0 20 40 60 80 100 T1 T2 T3 T4 T5 T6 T7 T8 T9 Temperature Delta This is a simple pushdown automaton
  • 45. Event ordering matters • A fast temperature change is dangerous • should alert IMMEDIATELY Ordering problems • Many false positives ! • Many false negatives ! -40 -20 0 20 40 60 80 100 T1 T5 T7 T4 T2 T8 T3 T6 T9 Temperature Delta ! ! ! ! !
  • 46. Repairing event ordering • Is hard • Needless complexity • Takes time • Buffer for the maximum ‘out-of-orderness’ period. • Several minutes • We want really low latency 123 4 56 78 9 Sliding time based sort buffer
  • 47. Exactly once please • At least once • Need data deduplication  • Is hard • Large memory buffer • Idempotent output • Takes time
  • 48. So we try to never damage the ordering
  • 49. Achieving end-to-end ordering... 1. The measuring point 2. Measurement transport 3. Measurement processing
  • 50. The measuring point • Single entity •  single measuring instance • Multiple instances • Multiple output buffers • Race conditions • Ordering problems
  • 51. The measuring point In IOT: • One temperature sensor • one recording device At bol.com • One visitor • Single webshop instance • Session routing is a MUST have!! • Not perfect! • Impact negligible • “View” measurements • Orders
  • 53. Message transport • We need ordering per session: FIFO • “Queue” or “Partitioned Queue” • Session pinned to a specific partition https://en.wikipedia.org/wiki/Queue_(abstract_data_type) Partitioned Queue
  • 54. Many “Queue” are not a Queue ! https://en.wikipedia.org/wiki/Java_Message_Service https://stackoverflow.com/questions/16300353/activemq-lifo-ordering
  • 55. Many “Queue” are not a Queue ! https://cloud.google.com/pubsub/docs/ordering
  • 56. High volume partitioned queues • Apache Kafka • https://kafka.apache.org/ • Production ready • Apache Pulsar • https://pulsar.apache.org/ • Connector for Flink very new. • Pravega • http://pravega.io/ • Being beta tested • MapR Event Store • https://mapr.com/products/mapr-eventstore/ • Production ready • Amazon Kinesis • Sorry, wrong cloud • Microsoft Event Hubs • Sorry, wrong cloud Azure Event Hubs
  • 57. Use Google PubSub If ordering does not matter at-least-once is Ok. High IO ‘Distributed Set’
  • 58. If ordering matters and/or exactly-once is needed High IO ‘Partitioned Queue’ Use Apache Kafka
  • 60. Measurement processing Requirements • Low latency • Exactly once • Ordering guarantees • A pushdown automaton per session • Keyed Stateful processing • Where the ‘key’ is the ‘session id’ Observe UnderstandProcess Respond ~
  • 61. Choosing a Processing toolkit Apache Beam • Low latency … except • Exactly once by deduplication • NO ordering guarantees • NO natural keyed stateful processing • “Dynamic” scaling • Abstract Java API • Runs on • DataFlow • Flink
  • 62. Choosing a Processing toolkit Apache Flink • Low latency • Exactly once • Ordering guarantees (Chandy–Lamport) • Keyed Stateful processing • “Fixed” scaling • Easy Java API • Runs on • Hadoop • Kubernetes
  • 64. Applications change! • New business • New insights • New wishes • New scope • New … The records will •  get new fields •  have obsolete fields
  • 65. Data producerData producerData producer Streaming applications Data producer Streaming Interface Data consumers Data consumersData consumers Data consumers The real payload is “byte array” Multiple Applications Rolling upgrades Canary releases Multiple Applications Rolling upgrades Canary releases
  • 66. Kafka persists messages • A message is retained until the TTL expired. • So a topic will contain several message versions! • With different fields V1 V2 V3 V4
  • 67. So we need something to • Serialize records into bytes • Data types • Nested records • Bidirectional Schema evolution
  • 69. Apache Avro (IDL Schema) Code generation
  • 70. Avro Message format • Single record into bytes encoding • Designed for evolving streaming applications • Need schema database: • Key = 64bit long • Value = String The json version of the schema
  • 71. Producing from Flink into Kafka
  • 72. Consume from Kafka into Flink
  • 73. Consume from Kafka into Flink
  • 76. The search funnel Search Find (Product page) Choose (Add to cart) Buy
  • 77. Deeper = more relevant (Search) Find Choose Buy
  • 78. High level M2 Stateful analysis Relevant events Suggestions Event to scored suggestions Suggestion Delta Is this a valuable event? Pushdown Automaton 1. Input: “Harry Potter” Product page 2. Score: “Harry Potter” 25 3. Suggest: “H”  “Harry Potter”: 25 “Ha”  “Harry Potter”: 25 “Har”  “Harry Potter”: 25 Ordering is important Ordering is NOT important
  • 79. Search Suggestions Default no search Searched Found Chosen Initial Search for “X” To PDP Add to Cart Within PDP Anythingelse The PDP is really a set of pages about the product, offers, reviews, … = Send out event that something useful happened Searched for “X” PDP for “X” ATC for “X”

Notas del editor

  1. Just storing the globalid/offerid and later joining is impossible due to the size of the datasets, the required speed and caching.
  2. https://blog.scottlogic.com/2018/04/17/comparing-big-data-messaging.html
  3. Keuze: GEEN product suggesties