What’s easier than building a data pipeline nowadays? You add a few Apache Kafka clusters and a way to ingest data (probably over HTTP), design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse… wait, this looks like a lot of things, doesn’t it? And you probably want to make it highly scalable and available too. Join this session to learn best practices for building a data pipeline, drawn from my experience at Activision/Demonware. I’ll share the lessons learned about scaling pipelines, not only in terms of volume, but also in terms of supporting more games and more use cases. You’ll also hear about message schemas and envelopes, Apache Kafka organization, topics naming conventions, routing, reliable and scalable producers and the ingestion layer, as well as stream processing.
Building Scalable and Extendable Data Pipeline for Call of Duty Games (Yaroslav Tkachenko, Activision) Kafka Summit NYC 2019
1. Building Scalable and
Extendable Data Pipeline
for Call of Duty Games:
Lessons Learned
Yaroslav Tkachenko
Software Architect at Central Data Platform
Activision
14. Each approach has pros and cons
• Simple
• Low-latency connection
• Number of TCP connections per
broker starts to look scary
• Really hard to do maintenance
on Kafka clusters
• Flexible
• Easier to manage Kafka clusters
• Possible to do basic enrichment
18. Even if you can add more
partitions
• Still can have bottlenecks within a partition (large messages)
• In case of reprocessing, it’s really hard to quickly add A LOT of new
partitions AND remove them after
• Also, number of partitions is not infinite
• … but there are other ways to solve the challenges above ;-)
19. It’s not always about
tuning. Sometimes we need
more than one cluster.
Different workloads require
different topologies.
20.
21.
22.
23. ● Ingestion (HTTP Proxy)
● Long retention
● High SLA
● Lots of consumers
● Medium retention
● ACL
● Stream processing
● Short retention
● More partitions
24.
25. You can’t be sure about
any improvements without
load testing.
Not only for a cluster,
but producers and
consumers too.
27. We need to keep the number
of topics and partitions low
• More topics means more operational burden
• Number of partitions in a fixed cluster is not infinite
• Scaling Kafka is hard, autoscaling is very hard
32. Each approach has pros and cons
• Metadata simplifies monitoring
• Metadata helps consumers to
consume precisely what they
want
• These dynamic fields can and
will change
• Very efficient utilization of
topics and partitions
• It’s very hard to enforce any
constraints with a topic name
41. Schema Registry in Activision
• Custom Python application with Cassandra as a data store
• Single source of truth for all producers and consumers
• Supports immutability, versioning and basic validation
• Schema Registry clients use a lot of caching at different layers
42. Summary
• Kafka tuning and best practices matter
• Invest in good SDKs for producing and consuming data
• Unified message envelope and topic names make adding a new game
almost effortless
• “Operational” stream processing makes it possible. Make sure you can
support adhoc filtering and routing of data
• Topic names should express data types, not producer or consumer
metadata
• Schema Registry is a must-have
43. Next Steps: Realtime Streaming ETL
• Kafka all the way!
• Kafka Streams for all ETL steps*
• Kafka Connect for all downstream integrations
• Spark for file consolidation
* except aggregations