BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
John Bennett
Senior Software Engineer, Netflix
The Visible Network
How Netflix Uses Kinesis Streams to Monitor
Applications and Analyze Billions of Traffic Flows
Allan MacInnis
Senior Solutions Architect, Amazon
July 27, 2017

What is your decision latency?
OR

The value of data
Recent data is valuable
If you act on it in real time
Capture all of the value
from your data

Amazon Kinesis
Streams
Custom real time
processing
Amazon Kinesis
Firehose
Load and transform your
data
Amazon Kinesis
Analytics
Easily analyze data
streams using standard
SQL queries
Amazon Kinesis: Streaming data made easy

Capture all of the value from your data with Amazon Kinesis
Amazon
S3
Amazon
Kinesis
Analytics
Amazon Kinesis–
enabled appAmazon Kinesis
Streams
Ingest Process React Persist
AWS
Lambda
0ms 200ms 1-2 s
Amazon
QuickSight
Amazon Kinesis
Firehose
Amazon
Redshift

Amazon Kinesis customer base diversity
1 billion events/wk from
connected devices | IoT
17 PB of game data per
season | Entertainment
80 billion ad impressions/day,
30 ms response time | Ad
Tech
100 GB/day click streams
from 250+ sites | Enterprise
50 billion ad impressions/day
sub-50 ms responses | Ad
Tech
10 million events/day
| Retail
Amazon Kinesis as databus
Migrated from Kafka to Kinesis |
Enterprise
Funnel all production
events through
Amazon Kinesis |
High Tech

Why are these customers choosing Amazon Kinesis?
Lower costs
Performant without
heavy lifting
Scales elastically
Increased agility
Secure and visible
Plug and play

Amazon Kinesis customer base diversity
Netflix Uses Kinesis Streams
to Analyze Billions of Network
Traffic Flows in Real-Time

What is Netflix’s decision latency?

● Motivation
● Data processing patterns
● Using Kinesis
● Results
Agenda

● 104 million customers
● Over 190 countries
● 37% of Internet traffic
● 125 million hours of video
Netflix is big.

● Dozens of accounts
● Multiple regions
● 100s of microservices
● 1,000s of deployments
● > 100,000 instances
And complex.

Hint: It’s not the network
What's wrong with the network?

Hint: Faraway places are far away
Why is the network so slow?

Hint: Distributed systems are hard
My service can’t connect
to its dependencies.

● No access to the underlying network
● Large traffic volume
● Billions of flows per day
● Gigabytes per second
● Dynamic environment
● Logs are limited, ex. IP-to-IP
● IP addresses are randomly assigned
● IP metadata varies over time, unpredictable
Challenges

● Good: Wide coverage of network traffic
● Good: Consolidated
● Good: Core info (source and destination IP)
● Bad: 10-minute capture window
● Ugly: Stateless
AWS VPC Flow Logs

At time t
Source
IP
172.31.16.139
Destination
IP
10.13.67.49

At time t
Source
metadata
Service
A
Account
1234567890
Zone
us-east-1e
Destination
metadata
Service
B
Account
0987654321
Zone
eu-west-1b

● Develop a new data source for network analytics
● Multiple dimensions (Netflix- and AWS-centric)
● Fast aggregations
● Enable ad-hoc OLAP-style queries
● Rollup, drill down, slicing and dicing
● Add visibility to network
● Fill gap not addressed by existing tools and metrics
Goal

Dredge
Enriched and aggregated
multi-dimensional
network traffic data

● Delay: 24 hours (daily interval)
● Bounded, fixed-size input
● Measured by throughput (time to process input)
● Limitations
● Remote DB: Round-trip time, parallel queries could overload
● Local cache: Depends on distribution of data, how to handle invalidation
● Local DB: More effective, less contention and no network RTT
Batch

● Delay: 7 minutes, average case (capture window)
● Unbounded input as events happen
● Measured by how far consumer is behind
● Limitations (similar to batch)
● Remote DB: Round-trip time, parallel queries could overload
● Local cache: Depends on distribution of data, how to handle invalidation
● Local DB: More effective, less contention and no network RTT
Stream

● Ex: Database indexes, caches, materialized views
● Transformed from source of truth
● Optimized for read queries, improve performance
● Built from a changelog of events
Derived data

● Log-based message broker to send change events
● Expose changelog stream as 1st class citizen
● Consume and join streams instead of querying DB
● Alternative view to query efficiently
● Update when data changes
● Removes network round-trip time, resource contention
● Pre-computed cache
Change data capture

● Integration with AWS services
● VPC Flow Logs
● Amazon S3
● Elasticsearch
● Handles scale
● Kinesis Client Library (KCL)
● Total Cost of Ownership (TCO)
Why Kinesis?

● Load streaming data differently
● Batch with Kinesis Firehose
● Store in Amazon S3, process with Lambda
● Elasticsearch as an intermediate store
● Stream with Kinesis Streams
Enables experimentation

VPC Flow Logs IncomingBytes per hour
Example account and region over 1 week
Elastic throughput

VPC Flow Logs IncomingBytes per minute
Example account and region over 3 hours
Elastic throughput

● Worker per EC2 instance
○ Multiple record processors per worker
○ Record processor per shard
● Load balancing between workers
● Checkpointing (with DynamoDB)
● Stream- and shard-level metrics
Kinesis Client Library

● Very little operational overhead
○ Monitor stream metrics and DynamoDB table
○ Run and manage auto-scaling util
TCO

● Per-shard limits
○ Increase shard count or fan out to other streams
● No log compaction
○ Up to 7-day max retention
○ Manual snapshots, increased complexity
○ Not ideal for changelog joins
Limitations

7 million network flows
Enriched per second
5 minutes
Average delay from network flow occurrence
1 Kinesis stream
With 100s of shards
By the numbers

What's wrong with the network?
Dredge reduces
mean-time-to-innocence.

Fault Domain 2 Account 0987654321, Zone eu-west-1a
Fault Domain 1 Account 1234567890, Zone us-east-1e

Bad code
push?

Network
outage?

Why is the network so slow?
Dredge identifies
high-latency network flows.

Region us-east-1
Zone Affinity
<1ms
Zone us-east-1d
Zone us-east-1e

Region us-east-1
Cross-zone
< 2ms
Zone us-east-1d
Zone us-east-1e

Region us-east-1
Zone us-east-1d
Zone us-east-1e
Region us-west-2
Zone us-west-2a
Zone us-west-2b
Cross-region
30-300ms

Region us-east-1
Zone us-east-1d
Region us-west-2
Zone us-west-2a
Zone us-west-2b
Cross-region fan-out
30-300ms

● Estimate 23% of total traffic is cross-zone
● About 14% of total traffic is cross-region
● Some intentional cross-zone, cross-region traffic
Initial findings

My service can’t connect
to its dependencies.
Dredge classifies a service’s
inbound and outbound dependencies.

Existing tools
● Distributed tracing via Salp
● Similar to Google‘s Dapper
● Naive sampling
● JVM-centric
● Incomplete coverage
● Need to be a part of the main request path
● Difficult to capture startup dependencies
● Lacks support protocols other than TCP IPv4

Outbound dependencies using tracing

Outbound Dependencies using Tracing
Outbound dependencies using traffic logs

Initial findings
● Significant discrepancy between Dredge and Salp
● Sample of 100 services
● Dependencies from tracing are a subset
● Tracing is implemented inconsistently
● Higher coverage
● Connections to AWS services prove helpful

Security use cases
● Use network dependencies to audit security groups
● Reduce blast radius
● Only source of logs for security group rejected flows
● Reports communication with public Internet
● Threat detection, port scanning, etc.
● AWS resources (instances, load balancers) with increased exposure
● Risk profiles

Enriched and aggregated traffic data
is a powerful source of information
that adds visibility to the network.

John Bennett
Cloud Network Engineering
bennett@netflix.com
@yo_bennett

BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows

Similar a BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows (20)

Más de Amazon Web Services

Más de Amazon Web Services (20)

Último

Último (20)

BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows