Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this talk, we’ll first discuss why Netflix chose Amazon Kinesis Streams over other data streaming solutions like Kafka to address these challenges at scale. We’ll then dive deep into how Netflix uses Amazon Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we will cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this talk, you’ll take away techniques and processes that you can apply to your large-scale networks and derive real-time, actionable insights.
3. The value of data
Recent data is valuable
If you act on it in real time
Capture all of the value
from your data
4. Amazon Kinesis
Streams
Custom real time
processing
Amazon Kinesis
Firehose
Load and transform your
data
Amazon Kinesis
Analytics
Easily analyze data
streams using standard
SQL queries
Amazon Kinesis: Streaming data made easy
5. Capture all of the value from your data with Amazon Kinesis
Amazon
S3
Amazon
Kinesis
Analytics
Amazon Kinesis–
enabled appAmazon Kinesis
Streams
Ingest Process React Persist
AWS
Lambda
0ms 200ms 1-2 s
Amazon
QuickSight
Amazon Kinesis
Firehose
Amazon
Redshift
6. Amazon Kinesis customer base diversity
1 billion events/wk from
connected devices | IoT
17 PB of game data per
season | Entertainment
80 billion ad impressions/day,
30 ms response time | Ad
Tech
100 GB/day click streams
from 250+ sites | Enterprise
50 billion ad impressions/day
sub-50 ms responses | Ad
Tech
10 million events/day
| Retail
Amazon Kinesis as databus
Migrated from Kafka to Kinesis |
Enterprise
Funnel all production
events through
Amazon Kinesis |
High Tech
7. Why are these customers choosing Amazon Kinesis?
Lower costs
Performant without
heavy lifting
Scales elastically
Increased agility
Secure and visible
Plug and play
8. Amazon Kinesis customer base diversity
Netflix Uses Kinesis Streams
to Analyze Billions of Network
Traffic Flows in Real-Time
16. ● No access to the underlying network
● Large traffic volume
● Billions of flows per day
● Gigabytes per second
● Dynamic environment
● Logs are limited, ex. IP-to-IP
● IP addresses are randomly assigned
● IP metadata varies over time, unpredictable
Challenges
20. ● Develop a new data source for network analytics
● Multiple dimensions (Netflix- and AWS-centric)
● Fast aggregations
● Enable ad-hoc OLAP-style queries
● Rollup, drill down, slicing and dicing
● Add visibility to network
● Fill gap not addressed by existing tools and metrics
Goal
24. ● Delay: 24 hours (daily interval)
● Bounded, fixed-size input
● Measured by throughput (time to process input)
● Limitations
● Remote DB: Round-trip time, parallel queries could overload
● Local cache: Depends on distribution of data, how to handle invalidation
● Local DB: More effective, less contention and no network RTT
Batch
27. ● Delay: 7 minutes, average case (capture window)
● Unbounded input as events happen
● Measured by how far consumer is behind
● Limitations (similar to batch)
● Remote DB: Round-trip time, parallel queries could overload
● Local cache: Depends on distribution of data, how to handle invalidation
● Local DB: More effective, less contention and no network RTT
Stream
32. ● Ex: Database indexes, caches, materialized views
● Transformed from source of truth
● Optimized for read queries, improve performance
● Built from a changelog of events
Derived data
33. ● Log-based message broker to send change events
● Expose changelog stream as 1st class citizen
● Consume and join streams instead of querying DB
● Alternative view to query efficiently
● Update when data changes
● Removes network round-trip time, resource contention
● Pre-computed cache
Change data capture
42. ● Load streaming data differently
● Batch with Kinesis Firehose
● Store in Amazon S3, process with Lambda
● Elasticsearch as an intermediate store
● Stream with Kinesis Streams
Enables experimentation
43. VPC Flow Logs IncomingBytes per hour
Example account and region over 1 week
Elastic throughput
44. VPC Flow Logs IncomingBytes per minute
Example account and region over 3 hours
Elastic throughput
45. ● Worker per EC2 instance
○ Multiple record processors per worker
○ Record processor per shard
● Load balancing between workers
● Checkpointing (with DynamoDB)
● Stream- and shard-level metrics
Kinesis Client Library
46. ● Very little operational overhead
○ Monitor stream metrics and DynamoDB table
○ Run and manage auto-scaling util
TCO
47. ● Per-shard limits
○ Increase shard count or fan out to other streams
● No log compaction
○ Up to 7-day max retention
○ Manual snapshots, increased complexity
○ Not ideal for changelog joins
Limitations
49. 7 million network flows
Enriched per second
5 minutes
Average delay from network flow occurrence
1 Kinesis stream
With 100s of shards
By the numbers
50. What's wrong with the network?
Dredge reduces
mean-time-to-innocence.
51. Fault Domain 2 Account 0987654321, Zone eu-west-1a
Fault Domain 1 Account 1234567890, Zone us-east-1e
52. Fault Domain 2 Account 0987654321, Zone eu-west-1a
Fault Domain 1 Account 1234567890, Zone us-east-1e
Bad code
push?
53. Fault Domain 2 Account 0987654321, Zone eu-west-1a
Fault Domain 1 Account 1234567890, Zone us-east-1e
Network
outage?
54. Why is the network so slow?
Dredge identifies
high-latency network flows.
59. ● Estimate 23% of total traffic is cross-zone
● About 14% of total traffic is cross-region
● Some intentional cross-zone, cross-region traffic
Initial findings
60. My service can’t connect
to its dependencies.
Dredge classifies a service’s
inbound and outbound dependencies.
61. Existing tools
● Distributed tracing via Salp
● Similar to Google‘s Dapper
● Naive sampling
● JVM-centric
● Incomplete coverage
● Need to be a part of the main request path
● Difficult to capture startup dependencies
● Lacks support protocols other than TCP IPv4
64. Initial findings
● Significant discrepancy between Dredge and Salp
● Sample of 100 services
● Dependencies from tracing are a subset
● Tracing is implemented inconsistently
● Higher coverage
● Connections to AWS services prove helpful
65. Security use cases
● Use network dependencies to audit security groups
● Reduce blast radius
● Only source of logs for security group rejected flows
● Reports communication with public Internet
● Threat detection, port scanning, etc.
● AWS resources (instances, load balancers) with increased exposure
● Risk profiles
66. Enriched and aggregated traffic data
is a powerful source of information
that adds visibility to the network.