Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix using Flink - Snehal Nagmote & Pallavi Phadnis

700 visualizaciones

Publicado el

Over 137 million members worldwide are enjoying TV series, feature films across a wide variety of genres and languages on Netflix. It leads to petabyte scale of user behavior data. At Netflix, our client logging platform collects and processes this data to empower recommendations, personalization and many other services to enhance user experience. Built with Apache Flink, this platform processes 100s of billion events and a petabyte data per day, 2.5 million events/sec in sub milliseconds latency. The processing involves a series of data transformations such as decryption and data enrichment of customer, geo, device information using microservices based lookups.

The transformed and enriched data is further used by multiple data consumers for a variety of applications such as improving user-experience with A/B tests, tracking application performance metrics, tuning algorithms. This causes redundant reads of the dataset by multiple batch jobs and incurs heavy processing costs. To avoid this, we have developed a config driven, centralized, managed platform, on top of Apache Flink, that reads this data once and routes it to multiple streams based on dynamic configuration. This has resulted in improved computation efficiency, reduced costs and reduced operational overhead.

Stream processing at scale while ensuring that the production systems are scalable and cost-efficient brings interesting challenges. In this talk, we will share about how we leverage Apache Flink to achieve this, the challenges we faced and our learnings while running one of the largest Flink application at Netflix.

Publicado en: Tecnología
  • if you think elizAbeth`s story is AmAzing..., 5 weAks-Ago my friend's brother bAsicAlly got A cheque for $8294 grAfting twelve hour's A week from there ApArtment And their neighbor's mother`s neighbour hAs done this for 4 months And mAde over $8294 pArttime on- line. the instructions At this Address, go to this site home tAb for more detAil. HERE.....
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix using Flink - Snehal Nagmote & Pallavi Phadnis

  1. 1. Massive Scale Data Processing Pallavi Phadnis, Snehal Nagmote Flink Forward SF 2019
  2. 2. ● Consolidated Logging (CL) Overview ● High Level Architecture of CL platform ● Log Processing at Scale ● Event Extractor Use Case ● Monitoring and Alerting ● Impact of Flink based Platform Agenda
  3. 3. Consolidated Logging (CL)
  4. 4. Build an integrated solution to provide insights into user behavior and application performance metrics through client-side logging. Consolidated Logging
  5. 5. Use Cases Powered By CL ● Personalization ● Recommendations ● A/B Experimentation ● Application Performance
  6. 6. Consolidated Logging X Event Types 300+ Log ProfileIdentify Presented NavigationLevel Focus ... Play Device Platforms / App Versions 10+ TVUI Android iOS Web ... Log Events 100s of billion events / day 1+ petabyte of user behavior data per day =
  8. 8. Legacy Pipeline Flink Based Platform Landing Service Kafka Event Extractor CL App Kafka Streams Elasticsearch Hive tables Landing Service SQS Log Processing Server Kafka CL Streaming App CL ETL CL DW (Hive) Kafka CL Router App 13 Keystone routes S3
  9. 9. CL App
  10. 10. ● Generic log processing application - supports different logging specifications ● Real-time processing ○ Data transformations ○ Data enrichment - Membership information, Geo, Device type ■ Joins ● Single source of truth with unified output schema ● Supports different data sinks: Kafka/Hive ● SLA ○ RPS: 3.5 million events per sec at peak, Latency: < 3ms CL App Features
  11. 11. CL App Design ● Stateless Flink Application (Flink 1.4, Kafka 1.1) ○ At-least once processing ● Isolation of concerns through separate Flink jobs for different use cases/sink types ● Different job DAGs with common framework library: Fan In/ Fan Out
  12. 12. Common Log Processing Framework Log Consumer Config Reader (FP) Data Enrichment Data Transformations Spec Parser Data Sink Raw events Processed events Kafka Kafka Hive / Iceberg CL Schema / App Schema Request Type & Version Source Segregated sources Multiple sinks Raw events Hive Data Partitioning Events Backup
  13. 13. ● Embarrassingly parallel job (parallelism over 2000) ○ Uniform CPU utilization with high number of partitions on source kafka topic ● High memory pressure and GC pause on JM - Recovery failure/restart loop ○ Memory leak in archiving execution history (FLINK-10066) ○ Scaling bottleneck of kafka source’s union state (FLINK-10122) ● Overwhelmed coordinator due to thundering herd problem with high parallelism (KIP-266) Learnings & Best Practices
  14. 14. Data compression - a factor to consider
  15. 15. ● Data compression ratio was worse for parquet and kafka (~ 4x) ○ Upstream kafka producer batching difference increased data entropy ● Backlog in kafka can lead to sudden load on external micro-services ● Kafka backpressure leads to task failures ○ Duplicate events ● Guice dependency injection conflicts with Flink ○ classloader.resolve-order=parent-first Learnings & Best Practices
  16. 16. Event Extractor
  17. 17. Event Extractor Use Case Personalization Pipeline CL Consumers User clicks User Searches App perf metrics Impressions CL Stream (Transformed and enriched) Personalization stream Search stream Impressions stream Experimentation stream Search Pipeline Impressions Pipeline A/B Experimentation Pipeline Consumer Insights Pipeline Exploratory Analysis Customer Service Tool
  18. 18. Keystone Routes For CL
  19. 19. ● Growth/Scale ○ 3.5 million events/sec ○ Reading same data multiple times ■ Compute redundancy ■ Scale Kafka infrastructure for outgoing bytes ■ Operational Overhead ● High Compute and Operational cost Problems with CL Legacy Pipeline
  20. 20. Keystone Routes For CL Event Extractor
  21. 21. ● Stateless Single Flink Application ● Read data once, apply processing and route it to multiple streams ● Configuration driven Processing, without code change ● SQL Support on Stream ● Filter, Transformation and Projection support on stream ● Out of box metrics for users What is Event Extractor ?
  22. 22. ● User configuration in Yaml ● Confings are managed in version control and updated in s3 ● Example config Event Extractor User Interface filterExpression: field1= 'Presented' and field2 like '%impressionToken%' and field3 not like '%storyArt%' projectionExpression: field_name1, field_name2, field_name3, field_name5 transformations: { OutputFieldName:inner_field, fieldName:top_level_field, nestedFieldName:inner_field, type: type} sinkDetails: {sinkType: kafka, name: topic_name} ownerName: email-address routeName: unique_name
  23. 23. Event Extractor Design Config Reader SQL Parser Config Parser Transformation Projection User Config Management Pipeline Filter Function Schema Builder Elastic Search Sink Kafka Sink Hive Sink User Configs via S3 CL Enriched Stream Hive Multiple Kafka Sinks Event Extractor
  24. 24. ● Scaling single Flink Application ● Lack of Isolation ○ Isolated by type of sink application writes to ○ Deployment per sink type (Kafka,Hive,Elasticsearch) ● Back pressure is shared between multiple consumers ○ Consumer Kafka topics are created in the same cluster ○ Canaries and testing before on boarding new config Challenges with Event Extractor
  25. 25. ● Buildup of Network Pressure caused S3 checkpoint failures due to socket timeouts ○ Job goes into restart loop due to high frequency of checkpoint failures ○ Better g1gc and increase s3 timeouts ● Tuning parallelism to avoid unbalanced CPU Utilization ○ Extensive CPU Flame Graphs and system metrics to identify bottlenecks ○ Setting parallelism in multiples of Kafka partitions and task slots to achieve better cpu utilization Learnings and Best Practices
  26. 26. ● Flink Kafka Consumer needs continuous stream to progress high watermark (FLINK-5479) ○ StickyPartitioner Producer skips producing data to out of sync partitions ○ Setting stickyPartitioner.minQualifiedIsrRatio=1.0 helps to produce data to out of sync partitions ● Outlier Container/Broker (due to bad hardware) ○ Consumer gets non-linear traffic pattern (stuck consumer alert) ○ Producer throws BatchExpiredTimeout Exception and increase in checkpoint failures Learnings and Best Practices
  27. 27. ● Keystone (Self-Serve UI) for deployment of streaming apps ○ Out of box ELK stack support for application logs ○ Automated Alerts integration with Atlas ● Deployment Strategy ○ Minimize Duplicates, Checkpoints are stored in S3 ● Restart Strategy ○ Fine-grained Recovery Deployment
  28. 28. Monitoring and Alerting
  29. 29. Monitoring and Alerting
  30. 30. CL Platform Benefits Improved Data Processing Can Handle Large Payloads compared to Legacy pipeline Improved error handling Reduced Data Loss Reduced points of failures Ability to backfill or reprocess historic raw events Legacy Tables Decommission and Reduced Storage Redundancy Read once and route to different sinks through event extractor Single source of truth (SSOT) for CL Data in Data warehouse Schema consistency across CL components and Tools Single Source of Truth Reduced Cost & Operational Overhead
  31. 31. Thank you.