Over 137 million members worldwide are enjoying TV series, feature films across a wide variety of genres and languages on Netflix. It leads to petabyte scale of user behavior data. At Netflix, our client logging platform collects and processes this data to empower recommendations, personalization and many other services to enhance user experience. Built with Apache Flink, this platform processes 100s of billion events and a petabyte data per day, 2.5 million events/sec in sub milliseconds latency. The processing involves a series of data transformations such as decryption and data enrichment of customer, geo, device information using microservices based lookups.
The transformed and enriched data is further used by multiple data consumers for a variety of applications such as improving user-experience with A/B tests, tracking application performance metrics, tuning algorithms. This causes redundant reads of the dataset by multiple batch jobs and incurs heavy processing costs. To avoid this, we have developed a config driven, centralized, managed platform, on top of Apache Flink, that reads this data once and routes it to multiple streams based on dynamic configuration. This has resulted in improved computation efficiency, reduced costs and reduced operational overhead.
Stream processing at scale while ensuring that the production systems are scalable and cost-efficient brings interesting challenges. In this talk, we will share about how we leverage Apache Flink to achieve this, the challenges we faced and our learnings while running one of the largest Flink application at Netflix.