SlideShare una empresa de Scribd logo
1 de 68
Descargar para leer sin conexión
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
John Bennett
Senior Software Engineer, Netflix
The Visible Network
How Netflix Uses Kinesis Streams to Monitor
Applications and Analyze Billions of Traffic Flows
Allan MacInnis
Senior Solutions Architect, Amazon
July 27, 2017
What is your decision latency?
OR
The value of data
Recent data is valuable
If you act on it in real time
Capture all of the value
from your data
Amazon Kinesis
Streams
Custom real time
processing
Amazon Kinesis
Firehose
Load and transform your
data
Amazon Kinesis
Analytics
Easily analyze data
streams using standard
SQL queries
Amazon Kinesis: Streaming data made easy
Capture all of the value from your data with Amazon Kinesis
Amazon
S3
Amazon
Kinesis
Analytics
Amazon Kinesis–
enabled appAmazon Kinesis
Streams
Ingest Process React Persist
AWS
Lambda
0ms 200ms 1-2 s
Amazon
QuickSight
Amazon Kinesis
Firehose
Amazon
Redshift
Amazon Kinesis customer base diversity
1 billion events/wk from
connected devices | IoT
17 PB of game data per
season | Entertainment
80 billion ad impressions/day,
30 ms response time | Ad
Tech
100 GB/day click streams
from 250+ sites | Enterprise
50 billion ad impressions/day
sub-50 ms responses | Ad
Tech
10 million events/day
| Retail
Amazon Kinesis as databus
Migrated from Kafka to Kinesis |
Enterprise
Funnel all production
events through
Amazon Kinesis |
High Tech
Why are these customers choosing Amazon Kinesis?
Lower costs
Performant without
heavy lifting
Scales elastically
Increased agility
Secure and visible
Plug and play
Amazon Kinesis customer base diversity
Netflix Uses Kinesis Streams
to Analyze Billions of Network
Traffic Flows in Real-Time
What is Netflix’s decision latency?
● Motivation
● Data processing patterns
● Using Kinesis
● Results
Agenda
● 104 million customers
● Over 190 countries
● 37% of Internet traffic
● 125 million hours of video
Netflix is big.
● Dozens of accounts
● Multiple regions
● 100s of microservices
● 1,000s of deployments
● > 100,000 instances
And complex.
Hint: It’s not the network
What's wrong with the network?
Hint: Faraway places are far away
Why is the network so slow?
Hint: Distributed systems are hard
My service can’t connect
to its dependencies.
● No access to the underlying network
● Large traffic volume
● Billions of flows per day
● Gigabytes per second
● Dynamic environment
● Logs are limited, ex. IP-to-IP
● IP addresses are randomly assigned
● IP metadata varies over time, unpredictable
Challenges
● Good: Wide coverage of network traffic
● Good: Consolidated
● Good: Core info (source and destination IP)
● Bad: 10-minute capture window
● Ugly: Stateless
AWS VPC Flow Logs
At time t
Source
IP
172.31.16.139
Destination
IP
10.13.67.49
At time t
Source
metadata
Service
A
Account
1234567890
Zone
us-east-1e
Destination
metadata
Service
B
Account
0987654321
Zone
eu-west-1b
● Develop a new data source for network analytics
● Multiple dimensions (Netflix- and AWS-centric)
● Fast aggregations
● Enable ad-hoc OLAP-style queries
● Rollup, drill down, slicing and dicing
● Add visibility to network
● Fill gap not addressed by existing tools and metrics
Goal
Dredge
Enriched and aggregated
multi-dimensional
network traffic data
Patterns
● Delay: 24 hours (daily interval)
● Bounded, fixed-size input
● Measured by throughput (time to process input)
● Limitations
● Remote DB: Round-trip time, parallel queries could overload
● Local cache: Depends on distribution of data, how to handle invalidation
● Local DB: More effective, less contention and no network RTT
Batch
Patterns: Batch + DB
Patterns: Batch + DB
● Delay: 7 minutes, average case (capture window)
● Unbounded input as events happen
● Measured by how far consumer is behind
● Limitations (similar to batch)
● Remote DB: Round-trip time, parallel queries could overload
● Local cache: Depends on distribution of data, how to handle invalidation
● Local DB: More effective, less contention and no network RTT
Stream
Patterns: Stream + DB
Patterns: Stream + DB
Patterns: Stream + DB + Cache
Patterns: Stream + DB + Cache
● Ex: Database indexes, caches, materialized views
● Transformed from source of truth
● Optimized for read queries, improve performance
● Built from a changelog of events
Derived data
● Log-based message broker to send change events
● Expose changelog stream as 1st class citizen
● Consume and join streams instead of querying DB
● Alternative view to query efficiently
● Update when data changes
● Removes network round-trip time, resource contention
● Pre-computed cache
Change data capture
Patterns: Stream + table join
Patterns: Stream + table join
Kinesis
● Integration with AWS services
● VPC Flow Logs
● Amazon S3
● Elasticsearch
● Handles scale
● Kinesis Client Library (KCL)
● Total Cost of Ownership (TCO)
Why Kinesis?
CloudWatch Logs sharing
● Load streaming data differently
● Batch with Kinesis Firehose
● Store in Amazon S3, process with Lambda
● Elasticsearch as an intermediate store
● Stream with Kinesis Streams
Enables experimentation
VPC Flow Logs IncomingBytes per hour
Example account and region over 1 week
Elastic throughput
VPC Flow Logs IncomingBytes per minute
Example account and region over 3 hours
Elastic throughput
● Worker per EC2 instance
○ Multiple record processors per worker
○ Record processor per shard
● Load balancing between workers
● Checkpointing (with DynamoDB)
● Stream- and shard-level metrics
Kinesis Client Library
● Very little operational overhead
○ Monitor stream metrics and DynamoDB table
○ Run and manage auto-scaling util
TCO
● Per-shard limits
○ Increase shard count or fan out to other streams
● No log compaction
○ Up to 7-day max retention
○ Manual snapshots, increased complexity
○ Not ideal for changelog joins
Limitations
Results
7 million network flows
Enriched per second
5 minutes
Average delay from network flow occurrence
1 Kinesis stream
With 100s of shards
By the numbers
What's wrong with the network?
Dredge reduces
mean-time-to-innocence.
Fault Domain 2 Account 0987654321, Zone eu-west-1a
Fault Domain 1 Account 1234567890, Zone us-east-1e
Fault Domain 2 Account 0987654321, Zone eu-west-1a
Fault Domain 1 Account 1234567890, Zone us-east-1e
Bad code
push?
Fault Domain 2 Account 0987654321, Zone eu-west-1a
Fault Domain 1 Account 1234567890, Zone us-east-1e
Network
outage?
Why is the network so slow?
Dredge identifies
high-latency network flows.
Region us-east-1
Zone Affinity
<1ms
Zone us-east-1d
Zone us-east-1e
Region us-east-1
Cross-zone
< 2ms
Zone us-east-1d
Zone us-east-1e
Region us-east-1
Zone us-east-1d
Zone us-east-1e
Region us-west-2
Zone us-west-2a
Zone us-west-2b
Cross-region
30-300ms
Region us-east-1
Zone us-east-1d
Region us-west-2
Zone us-west-2a
Zone us-west-2b
Cross-region fan-out
30-300ms
● Estimate 23% of total traffic is cross-zone
● About 14% of total traffic is cross-region
● Some intentional cross-zone, cross-region traffic
Initial findings
My service can’t connect
to its dependencies.
Dredge classifies a service’s
inbound and outbound dependencies.
Existing tools
● Distributed tracing via Salp
● Similar to Google‘s Dapper
● Naive sampling
● JVM-centric
● Incomplete coverage
● Need to be a part of the main request path
● Difficult to capture startup dependencies
● Lacks support protocols other than TCP IPv4
Outbound dependencies using tracing
Outbound Dependencies using Tracing
Outbound dependencies using traffic logs
Initial findings
● Significant discrepancy between Dredge and Salp
● Sample of 100 services
● Dependencies from tracing are a subset
● Tracing is implemented inconsistently
● Higher coverage
● Connections to AWS services prove helpful
Security use cases
● Use network dependencies to audit security groups
● Reduce blast radius
● Only source of logs for security group rejected flows
● Reports communication with public Internet
● Threat detection, port scanning, etc.
● AWS resources (instances, load balancers) with increased exposure
● Risk profiles
Enriched and aggregated traffic data
is a powerful source of information
that adds visibility to the network.
John Bennett
Cloud Network Engineering
bennett@netflix.com
@yo_bennett
Thank you!

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)
AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)
AWS re:Invent 2016: Another Day in the Life of a Netflix Engineer (DEV209)
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWS
 
Monitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECSMonitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECS
 
Real-time Data Processing using AWS Lambda
Real-time Data Processing using AWS LambdaReal-time Data Processing using AWS Lambda
Real-time Data Processing using AWS Lambda
 
Getting Started with Amazon Kinesis
Getting Started with Amazon KinesisGetting Started with Amazon Kinesis
Getting Started with Amazon Kinesis
 
WKS407 Wild Rydes Takes Off – The Dawn of a New Unicorn
WKS407 Wild Rydes Takes Off – The Dawn of a New UnicornWKS407 Wild Rydes Takes Off – The Dawn of a New Unicorn
WKS407 Wild Rydes Takes Off – The Dawn of a New Unicorn
 
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
 
ENT316 Keeping Pace With The Cloud: Managing and Optimizing as You Scale
ENT316 Keeping Pace With The Cloud: Managing and Optimizing as You ScaleENT316 Keeping Pace With The Cloud: Managing and Optimizing as You Scale
ENT316 Keeping Pace With The Cloud: Managing and Optimizing as You Scale
 
SRV409 Deep Dive on Microservices and Docker
SRV409 Deep Dive on Microservices and DockerSRV409 Deep Dive on Microservices and Docker
SRV409 Deep Dive on Microservices and Docker
 
SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon GlacierSRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
 
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
 
Migrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the CloudMigrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the Cloud
 
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal HealthSRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
 
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 
Introduction to Block and File storage on AWS
Introduction to Block and File storage on AWSIntroduction to Block and File storage on AWS
Introduction to Block and File storage on AWS
 
SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
SRV420 Analyzing Streaming Data in Real-time with Amazon KinesisSRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
 
Compare Clouds: Aws vs Azure vs Google vs SoftLayer
Compare Clouds: Aws vs Azure vs Google vs SoftLayerCompare Clouds: Aws vs Azure vs Google vs SoftLayer
Compare Clouds: Aws vs Azure vs Google vs SoftLayer
 
Self-Service Supercomputing
Self-Service SupercomputingSelf-Service Supercomputing
Self-Service Supercomputing
 

Similar a BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows

Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 Keynote
Roger Barga
 

Similar a BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows (20)

How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
BDA403 How Netflix Monitors Applications in Real-time with Amazon KinesisBDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
 
AWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with KinesisAWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with Kinesis
 
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
 
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisDay 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
 
Future Grid Overview 2018
Future Grid Overview 2018Future Grid Overview 2018
Future Grid Overview 2018
 
What's New with Amazon DynamoDB - SRV311 - Atlanta AWS Summit
What's New with Amazon DynamoDB - SRV311 - Atlanta AWS SummitWhat's New with Amazon DynamoDB - SRV311 - Atlanta AWS Summit
What's New with Amazon DynamoDB - SRV311 - Atlanta AWS Summit
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 Keynote
 
Tapping the cloud for real time data analytics
 Tapping the cloud for real time data analytics Tapping the cloud for real time data analytics
Tapping the cloud for real time data analytics
 
High Volume Streaming Data: How Amazon Web Services is Changing Our Approach
High Volume Streaming Data: How Amazon Web Services is Changing Our ApproachHigh Volume Streaming Data: How Amazon Web Services is Changing Our Approach
High Volume Streaming Data: How Amazon Web Services is Changing Our Approach
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
 

Más de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

Último (20)

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. John Bennett Senior Software Engineer, Netflix The Visible Network How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows Allan MacInnis Senior Solutions Architect, Amazon July 27, 2017
  • 2. What is your decision latency? OR
  • 3. The value of data Recent data is valuable If you act on it in real time Capture all of the value from your data
  • 4. Amazon Kinesis Streams Custom real time processing Amazon Kinesis Firehose Load and transform your data Amazon Kinesis Analytics Easily analyze data streams using standard SQL queries Amazon Kinesis: Streaming data made easy
  • 5. Capture all of the value from your data with Amazon Kinesis Amazon S3 Amazon Kinesis Analytics Amazon Kinesis– enabled appAmazon Kinesis Streams Ingest Process React Persist AWS Lambda 0ms 200ms 1-2 s Amazon QuickSight Amazon Kinesis Firehose Amazon Redshift
  • 6. Amazon Kinesis customer base diversity 1 billion events/wk from connected devices | IoT 17 PB of game data per season | Entertainment 80 billion ad impressions/day, 30 ms response time | Ad Tech 100 GB/day click streams from 250+ sites | Enterprise 50 billion ad impressions/day sub-50 ms responses | Ad Tech 10 million events/day | Retail Amazon Kinesis as databus Migrated from Kafka to Kinesis | Enterprise Funnel all production events through Amazon Kinesis | High Tech
  • 7. Why are these customers choosing Amazon Kinesis? Lower costs Performant without heavy lifting Scales elastically Increased agility Secure and visible Plug and play
  • 8. Amazon Kinesis customer base diversity Netflix Uses Kinesis Streams to Analyze Billions of Network Traffic Flows in Real-Time
  • 9. What is Netflix’s decision latency?
  • 10. ● Motivation ● Data processing patterns ● Using Kinesis ● Results Agenda
  • 11. ● 104 million customers ● Over 190 countries ● 37% of Internet traffic ● 125 million hours of video Netflix is big.
  • 12. ● Dozens of accounts ● Multiple regions ● 100s of microservices ● 1,000s of deployments ● > 100,000 instances And complex.
  • 13. Hint: It’s not the network What's wrong with the network?
  • 14. Hint: Faraway places are far away Why is the network so slow?
  • 15. Hint: Distributed systems are hard My service can’t connect to its dependencies.
  • 16. ● No access to the underlying network ● Large traffic volume ● Billions of flows per day ● Gigabytes per second ● Dynamic environment ● Logs are limited, ex. IP-to-IP ● IP addresses are randomly assigned ● IP metadata varies over time, unpredictable Challenges
  • 17. ● Good: Wide coverage of network traffic ● Good: Consolidated ● Good: Core info (source and destination IP) ● Bad: 10-minute capture window ● Ugly: Stateless AWS VPC Flow Logs
  • 20. ● Develop a new data source for network analytics ● Multiple dimensions (Netflix- and AWS-centric) ● Fast aggregations ● Enable ad-hoc OLAP-style queries ● Rollup, drill down, slicing and dicing ● Add visibility to network ● Fill gap not addressed by existing tools and metrics Goal
  • 23.
  • 24. ● Delay: 24 hours (daily interval) ● Bounded, fixed-size input ● Measured by throughput (time to process input) ● Limitations ● Remote DB: Round-trip time, parallel queries could overload ● Local cache: Depends on distribution of data, how to handle invalidation ● Local DB: More effective, less contention and no network RTT Batch
  • 27. ● Delay: 7 minutes, average case (capture window) ● Unbounded input as events happen ● Measured by how far consumer is behind ● Limitations (similar to batch) ● Remote DB: Round-trip time, parallel queries could overload ● Local cache: Depends on distribution of data, how to handle invalidation ● Local DB: More effective, less contention and no network RTT Stream
  • 30. Patterns: Stream + DB + Cache
  • 31. Patterns: Stream + DB + Cache
  • 32. ● Ex: Database indexes, caches, materialized views ● Transformed from source of truth ● Optimized for read queries, improve performance ● Built from a changelog of events Derived data
  • 33. ● Log-based message broker to send change events ● Expose changelog stream as 1st class citizen ● Consume and join streams instead of querying DB ● Alternative view to query efficiently ● Update when data changes ● Removes network round-trip time, resource contention ● Pre-computed cache Change data capture
  • 34.
  • 35.
  • 36.
  • 37. Patterns: Stream + table join
  • 38. Patterns: Stream + table join
  • 40. ● Integration with AWS services ● VPC Flow Logs ● Amazon S3 ● Elasticsearch ● Handles scale ● Kinesis Client Library (KCL) ● Total Cost of Ownership (TCO) Why Kinesis?
  • 42. ● Load streaming data differently ● Batch with Kinesis Firehose ● Store in Amazon S3, process with Lambda ● Elasticsearch as an intermediate store ● Stream with Kinesis Streams Enables experimentation
  • 43. VPC Flow Logs IncomingBytes per hour Example account and region over 1 week Elastic throughput
  • 44. VPC Flow Logs IncomingBytes per minute Example account and region over 3 hours Elastic throughput
  • 45. ● Worker per EC2 instance ○ Multiple record processors per worker ○ Record processor per shard ● Load balancing between workers ● Checkpointing (with DynamoDB) ● Stream- and shard-level metrics Kinesis Client Library
  • 46. ● Very little operational overhead ○ Monitor stream metrics and DynamoDB table ○ Run and manage auto-scaling util TCO
  • 47. ● Per-shard limits ○ Increase shard count or fan out to other streams ● No log compaction ○ Up to 7-day max retention ○ Manual snapshots, increased complexity ○ Not ideal for changelog joins Limitations
  • 49. 7 million network flows Enriched per second 5 minutes Average delay from network flow occurrence 1 Kinesis stream With 100s of shards By the numbers
  • 50. What's wrong with the network? Dredge reduces mean-time-to-innocence.
  • 51. Fault Domain 2 Account 0987654321, Zone eu-west-1a Fault Domain 1 Account 1234567890, Zone us-east-1e
  • 52. Fault Domain 2 Account 0987654321, Zone eu-west-1a Fault Domain 1 Account 1234567890, Zone us-east-1e Bad code push?
  • 53. Fault Domain 2 Account 0987654321, Zone eu-west-1a Fault Domain 1 Account 1234567890, Zone us-east-1e Network outage?
  • 54. Why is the network so slow? Dredge identifies high-latency network flows.
  • 55. Region us-east-1 Zone Affinity <1ms Zone us-east-1d Zone us-east-1e
  • 56. Region us-east-1 Cross-zone < 2ms Zone us-east-1d Zone us-east-1e
  • 57. Region us-east-1 Zone us-east-1d Zone us-east-1e Region us-west-2 Zone us-west-2a Zone us-west-2b Cross-region 30-300ms
  • 58. Region us-east-1 Zone us-east-1d Region us-west-2 Zone us-west-2a Zone us-west-2b Cross-region fan-out 30-300ms
  • 59. ● Estimate 23% of total traffic is cross-zone ● About 14% of total traffic is cross-region ● Some intentional cross-zone, cross-region traffic Initial findings
  • 60. My service can’t connect to its dependencies. Dredge classifies a service’s inbound and outbound dependencies.
  • 61. Existing tools ● Distributed tracing via Salp ● Similar to Google‘s Dapper ● Naive sampling ● JVM-centric ● Incomplete coverage ● Need to be a part of the main request path ● Difficult to capture startup dependencies ● Lacks support protocols other than TCP IPv4
  • 63. Outbound Dependencies using Tracing Outbound dependencies using traffic logs
  • 64. Initial findings ● Significant discrepancy between Dredge and Salp ● Sample of 100 services ● Dependencies from tracing are a subset ● Tracing is implemented inconsistently ● Higher coverage ● Connections to AWS services prove helpful
  • 65. Security use cases ● Use network dependencies to audit security groups ● Reduce blast radius ● Only source of logs for security group rejected flows ● Reports communication with public Internet ● Threat detection, port scanning, etc. ● AWS resources (instances, load balancers) with increased exposure ● Risk profiles
  • 66. Enriched and aggregated traffic data is a powerful source of information that adds visibility to the network.
  • 67. John Bennett Cloud Network Engineering bennett@netflix.com @yo_bennett