Comcast's X1 Platform delivers a dramatically new entertainment experience to viewers. And not just to Comcast subscribers, but to several other major cable companies. This requires a massive integration layer between companies. A big part of that integration is delivering billions of data points per day to syndication partners. Find out how the X1 Platform uses Amazon Kinesis as a data bus, drastically simplifying data integration with others. As part of this, see how X1 uses Lambda, EMR Spark, and S3 - all leading to a near serverless big data backbone.
2. What to expect from this session
• Streaming scenarios
• Amazon Kinesis overview
• Comcast X1 Platform
• Challenges with streaming data
• Schema management
3. Scenarios Accelerated Ingest-
Transform-Load
Continual Metrics
Generation
Responsive Data
Analysis
Ad/Marketing
Tech
Publisher, bidder data
aggregation
Advertising metrics like
coverage, yield, conversion
Analytics on user
engagement with ads,
optimized bid/buy engines
IoT Sensor, device telemetry
data ingestion
IT operational metrics
dashboards
Sensor operational
intelligence, alerts, and
notifications
Gaming Online customer
engagement data
aggregation
Consumer engagement
metrics for level success,
transition rates, CTR
Clickstream analytics,
leaderboard generation,
player-skill match engines
Consumer
Engagement
Online customer
engagement data
aggregation
Consumer engagement
metrics like page views,
CTR
Clickstream analytics,
recommendation engines
Streaming data scenarios across segments
1 2
3
4. Amazon Kinesis makes it easy to work with
real-time streaming data
Amazon Kinesis
Streams
• For Technical Developers
• Collect and stream data
for ordered, replayable,
real-time processing
Amazon Kinesis
Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into Amazon S3, Amazon
Redshift, or Amazon
Elasticsearch Service
Amazon Kinesis
Analytics
• For all developers, data
scientists
• Easily analyze data streams
using standard SQL queries
5. Amazon Kinesis Streams
Easy administration: Simply create a new stream and set the desired level of capacity with
shards. Scale to match your data throughput rate and volume.
Build real-time applications: Perform continual processing on streaming big data using
Amazon Kinesis Client Library (KCL), Apache Spark/Storm, AWS Lambda, and more.
Low cost: Cost-efficient for workloads of any scale.
7. Amazon Kinesis Firehose
Zero administration: Capture and deliver streaming data into Amazon S3, Amazon
Redshift, and other destinations without writing an application or managing infrastructure.
Direct-to-data-store integration: Batch, compress, and encrypt streaming data for
delivery into data destinations in as little as 60 seconds using simple configurations.
Seamless elasticity: Seamlessly scale to match data throughput without intervention.
Capture and submit
streaming data to Firehose
Firehose loads streaming data
continuously into Amazon S3
and Amazon Redshift
Analyze streaming data using
your favorite BI tools
8. Scenarios Accelerated Ingest-
Transform-Load
Continual Metrics
Generation
Responsive Data
Analysis
Ad/Marketing
Tech
Publisher, bidder data
aggregation
Advertising metrics like
coverage, yield, conversion
Analytics on user
engagement with ads,
optimized bid/buy engines
IoT Sensor, device telemetry
data ingestion
IT operational metrics
dashboards
Sensor operational
intelligence, alerts, and
notifications
Gaming Online customer
engagement data
aggregation
Consumer engagement
metrics for level success,
transition rates, CTR
Clickstream analytics,
leaderboard generation,
player-skill match engines
Consumer
Engagement
Online customer
engagement data
aggregation
Consumer engagement
metrics like page views,
CTR
Clickstream analytics,
recommendation engines
Streaming data scenarios across segments
1 2
3
10. Amazon Kinesis Analytics
Analyze data streams continuously with standard SQL
Apply SQL on streams: Easily connect to data streams and apply
existing SQL skills.
Build real-time applications: Perform continual processing on streaming
big data with sub-second processing latencies.
Scale elastically: Elastically scales to match data throughput without
operator intervention.
New!
Connect to Kinesis streams,
Firehose delivery streams
Run standard SQL queries
against data streams
Analytics can send processed data to
analytics tools so you can create alerts
and respond in real time
11. Use SQL to build real-time applications
Easily write SQL code to process streaming data
Connect to streaming source
Continuously deliver SQL results
12. Amazon Kinesis at Comcast
Charlie Hammell, Solutions Architect, Comcast
13.
14. The challenge
• Comcast now syndicates the X1 Platform to other video
providers
• Syndication includes providing telemetry data (data
related to performance and reliability), anonymized and
secured, to improve the X1 experience
• Stream quality status
• VOD usage
• Error rates and status
• Solution: The data bus
15. Delivering X1 telemetry to partners
Fairmount
X1 Platform
· STB
telemetry
· Mobile
player
actions
· IP VOD
player
actions
· Screen
errors
Service 1
Service 2
Service 3
Partner 1
Partner 2
Partner 3
20. Why a data bus?
Consumer
4
Producer
1
Producer
2
Producer
3
Consumer
1
Consumer
2
Consumer
3
Total connections: 14
21. Remember: Syndication includes providing telemetry
data, anonymized and secured, to cable partners
• The bus decouples publishers and subscribers
• The bus has extensible features
• The bus has topics
• The bus is reusable
Characteristics of a data bus
22. Where we started
X1 Services
1
2
X1 reporting and
analytics (Tableau,
other apps)
Partners
Partner 1
Partner 2
Apache Storm
23. • Mean Time Between Failure: two weeks
• Mean Time To Recovery: four hours
• Impact: affected syndication subscribers, extensive
overtime effort for staff
• Root causes: data re-balancing, infrastructure issues,
Zookeeper problems, overloading by other users
• Weak or missing features:
• Multi-tenant guardrails
• Elastic scale
• Security
• Geo-distributed high availability
Data bus challenges using Apache Kafka
27. The data bus foundation
• Multi-tenancy
• Elastic scale
• Security
• High availability
28. • Read, write limits
• Protects me from others (and others from me)
Multi-tenancy
Shard
Data Bus Stream
Stream/Topic
KPL
Producer
App
Consumer
App
KCL
29. • Streams are made of shards
• Each shard ingests data up to 1 MB/sec
and up to 1000 TPS
• Each shard emits up to 2 MB/sec
• Scale Kinesis streams by splitting or
merging shards
Elastic scale—how Kinesis scales
30. Batching
User Record 1
User Record 2
...
User Record A
User Record K
User Record L
...
User Record S
...
User Record AA
User Record BB
...
User Record ZZ
...
Kinesis Record 1
Aggregating
Kinesis Record C
...
Kinesis Record M
...
PutRecords Request
Collecting
Elastic scale: how batching helps
39. Avro containers over streaming
1 sec/1 MB 1 sec/1 MB 1 sec/1 MB 1 sec/1 MB
Schema
Binary Data
Schema
Binary Data
Schema
Binary Data
Schema
Binary Data
40. schema_id reserved major version minor version reserved reserved reserved reserved Core Header + Message Data
Magic Bytes Avro encoded body
Data bus schema header
60% Reduction!
41. Avro records over streaming
1 sec/1 MB 1 sec/1 MB 1 sec/1 MB 1 sec/1 MB
Magic Byte Header
Binary Data
Magic Byte Header
Binary Data
Magic Byte Header
Binary Data
Magic Byte Header
Binary Data
42. Data bus schema registry
Kinesis
Streams
Producer
(format stream
to schema)
Consumer
(validate stream
against schema)
Schema
Registry
No schema =
smaller payload
48. • Mean Time Before Failure: so far ∞
• Mean Time To Recovery: 0
1.Multi-tenant guardrails: clear and enforced by the
platform
2.Elastic scale: OK—API (looking forward to a
checkbox)
3.Security: IAM, SAML federation, cross-account trust
4.Multi Data Center high availability: yes
Retrospective
49. How to get started
Decide:
• High-impact, higher risk
• Low-impact, lower risk
Pick a data flow—preferably a new one
Get an eager developer who wants the challenge (and the resume perks)
Pitch it to the end consumer (if not your team)
Choose a schema approach—it really matters
Decide on RT processing framework: Spark, Storm, AWS Lambda, Kinesis
Analytics?
Build a producer proxy to pull in the data—don’t ask the producer to bother
Build a consumer or send it to S3 through Firehose
Evaluate and take next steps