Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - Justin Cunningham

•

5 recomendaciones•705 vistas

Netflix processes trillions of events and petabytes of data a day in the Keystone data pipeline, which is built on top of Apache Flink. As Netflix has scaled up original productions annually enjoyed by more than 150 million global members, data integration across the streaming service and the studio has become a priority. Scalably integrating data across hundreds of different data stores in a way that enables us to holistically optimize cost, performance and operational concerns presented a significant challenge. Learn how we expanded the scope of the Keystone pipeline into the Netflix Data Mesh, our real-time, general-purpose, data transportation platform for moving data between Netflix systems. The Keystone Platform’s unique approach to declarative configuration and schema evolution, as well as our approach to unifying batch and streaming data and processing will be covered in depth.

Tecnología

Netflix Data Mesh
Composable Data Processing
jcunningham@netflix.com

Cross-platform
Eventing
Netflix Streaming &
Keystone

More than 150M Global Members
Trillions of Messages / Petabytes a Day

A high level view of Netflix’s Studio Structure

Airport: Netflix
Air Traffic Control: Studio
Airplanes: Productions
Credit: Christopher Goss, Netflix

Production
Company D Production
Company E
Studio B
Studio
A
Production
Company F
Studio C
Parent Studio
(many airports, many airplanes per airport)

Netflix
(one large airport, huge # of airplanes)

Data Mesh: Composable
Data Processing
Data Transport
Problems
Significant duplication of
effort across pipelines and
teams.
Delay in bringing online new
pipelines and increasing
maintenance overhead from
existing pipeline.
Uneven implementation of
best practices.
Need for lower latency data
transportation and
warehousing for operational
reporting.
Correctness issues related
to distributed systems error
recovery.

Data Mesh: Composable
Data Processing
Flink Processing
RDS
Cassandra
Airtable
Logging Data
…
RDS
Cassandra
S3 Data Warehouse
Elastic Search
…
Extract Transform Load

Data Mesh: Composable
Data Processing
Stream 1
Stream 2
Stream 3
Stream 4
Catalog
EV Cache
ES
S3
Service
RDS
Cassandra
Stream Processor
SourceConnector
SourceConnector
Sources Sinks
SinkConnector
SinkConnector
SinkConnector
Out
In
(Avro)

Stream 1
Stream 2
Stream 1
Stream Processor
Stream Processor
Streams
Sinks
Data Mesh: Composable
Data Processing

Data Mesh: Composable
Data Processing
Source
Database
DB CDC Source
Connector
DB Change
Stream
CDC Flink Auditor
GraphQL Flink
Processor
Enriched Stream
Iceberg Sink
Flink Processor
Iceberg
S3 Data
GraphQL Flink
Auditor
Batch Iceberg
Auditor

Data Mesh: Composable
Data Processing
Overall Schema
Evolution Approach
Apache Avro
schema format
Stream
processors are
deployed with
fixed input and
output schemas
Schema changes
are managed by
redeploying with
new fixed input
and output
schemas
Processors can
opt-in to
Automatic
schema upgrades
Most schema
changes don’t
require a topic
change

Data Mesh: Composable
Data Processing
Data Mesh Controller
DB CDC Source
Connector
GraphQL Flink
Processor
Iceberg Sink
Flink Processor
Iceberg
S3 Data

Physical Data Mesh Storage
id: name
1: id
2: first
3: last
Physical S3 Storage
id
1
2
3
Iceberg Data
id: name
1: id
2: first
3: last
Logical Iceberg
Avro Data Mesh Topic Avro Iceberg Sink
Data Mesh: Composable
Data Processing

Physical Data Mesh Storage
id: name
1: id
2: first
3: last
4: city
Physical S3 Storage
id
1
2
3
4
Iceberg Data
id: name
1: id
2: first
3: last
Logical Iceberg
Avro Data Mesh Topic Avro Iceberg Sink
Data Mesh: Composable
Data Processing

id: name
1: id
2: first
3: last
Physical Data Mesh Storage
id: name
1: id
2: first
3: last
4: city
Physical S3 Storage
id
1
2
3
4
Iceberg Data
id: name
1: id
2: first
3: last
4: city
Logical Iceberg
Avro Data Mesh Topic Avro Iceberg Sink
Data Mesh: Composable
Data Processing

Physical Data Mesh Storage
id: name
1: id
2: first_name
3: last_name
4: city
Physical S3 Storage
id
1
2
3
4
Iceberg Data
id: name
1: id
2: first
3: last
4: city
Logical Iceberg
Avro Data Mesh Topic Avro Iceberg Sink
id: name
1: id
2: first_name
3: last_name
4: city
Data Mesh: Composable
Data Processing

Physical Data Mesh Storage
id: name
1: id
2: first_name
4: city
5: last
Physical S3 Storage
id
1
2
3
4
5
Iceberg Data
id: name
1: id
2: first_name
4: city
5: last
id: name
1: id
2: first_name
3: last_name
4: city
id: name
1: id
2: first
3: last
4: city
Logical Iceberg
Avro Data Mesh Topic Avro Iceberg Sink
Data Mesh: Composable
Data Processing

Más contenido relacionado

La actualidad más candente

Data Lakehouse Symposium | Day 4Databricks

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

How to govern and secure a Data Mesh?confluent

Delta lake and the delta architectureAdam Doyle

Data Mesh at CMC Markets: Past, Present and FutureLorenzo Nicora

Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks

Scaling and Modernizing Data Platform with DatabricksDatabricks

Making Apache Spark Better with Delta LakeDatabricks

Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Big Data Architectural PatternsAmazon Web Services

Databricks Delta Lake and Its BenefitsDatabricks

Master the Multi-Clustered Data Warehouse - SnowflakeMatillion

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks

Snowflake Datawarehouse ArchitecturingIshan Bhawantha Hewanayake

Data Mesh for DinnerKent Graziano

Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation

La actualidad más candente (20)

Data Lakehouse Symposium | Day 4

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

How to govern and secure a Data Mesh?

Delta lake and the delta architecture

Data Mesh at CMC Markets: Past, Present and Future

Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...

Scaling and Modernizing Data Platform with Databricks

Making Apache Spark Better with Delta Lake

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Iceberg: A modern table format for big data (Strata NY 2018)

Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021

Apache Iceberg: An Architectural Look Under the Covers

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Big Data Architectural Patterns

Databricks Delta Lake and Its Benefits

Master the Multi-Clustered Data Warehouse - Snowflake

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...

Snowflake Datawarehouse Architecturing

Data Mesh for Dinner

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg

Similar a Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - Justin Cunningham

PlankFNian

Os Gottfridoscon2007

Building Operational Data Lake using Spark and SequoiaDB with Yang PengDatabricks

Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...Thomas Gottron

Understanding Hadoop Clusters and the Networkbradhedlund

Hadoop architecture meetupvmoorthy

Windows Azure: Lessons From The FieldRob Gillen

Azure: Lessons From The FieldRob Gillen

Bids talk 9.18Travis Oliphant

Instrumenting and Scaling Databases with EnvoyDaniel Hochman

Synapse 2018 Guarding against failure in a hundred step pipelineCalvin French-Owen

Google Cloud Computing on Google Developer 2008 Dayprogrammermag

XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...The Linux Foundation

AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...Sungmin Kim

Data Science Across Data Sources with Apache ArrowDatabricks

BDA403 How Netflix Monitors Applications in Real-time with Amazon KinesisAmazon Web Services

Amazed by AWS Series #4Amazon Web Services Korea

Extending Analytic Reach - From The Warehouse to The Data Lake by Mike LimcacoData Con LA

Extending Analytic ReachAgilisium Consulting

Named data networking. Basic PrincipleМихаил Климарёв

Similar a Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - Justin Cunningham (20)

Plank

Os Gottfrid

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...

Understanding Hadoop Clusters and the Network

Hadoop architecture meetup

Windows Azure: Lessons From The Field

Azure: Lessons From The Field

Bids talk 9.18

Instrumenting and Scaling Databases with Envoy

Synapse 2018 Guarding against failure in a hundred step pipeline

Google Cloud Computing on Google Developer 2008 Day

XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...

AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...

Data Science Across Data Sources with Apache Arrow

BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis

Amazed by AWS Series #4

Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco

Extending Analytic Reach

Named data networking. Basic Principle

Más de Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward

Evening out the uneven: dealing with skew in FlinkFlink Forward

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward

Introducing the Apache Flink Kubernetes OperatorFlink Forward

Autoscaling Flink with Reactive ModeFlink Forward

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward

One sink to rule them all: Introducing the new Async SinkFlink Forward

Tuning Apache Kafka Connectors for Flink.pptxFlink Forward

Flink powered stream processing platform at PinterestFlink Forward

Apache Flink in the Cloud-Native EraFlink Forward

Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward

Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward

The Current State of Table API in 2022Flink Forward

Flink SQL on Pulsar made easyFlink Forward

Dynamic Rule-based Real-time Market Data AlertsFlink Forward

Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward

Processing Semantically-Ordered Streams in Financial ServicesFlink Forward

Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward

Batch Processing at Scale with Flink & IcebergFlink Forward

Más de Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...

Evening out the uneven: dealing with skew in Flink

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...

Introducing the Apache Flink Kubernetes Operator

Autoscaling Flink with Reactive Mode

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...

One sink to rule them all: Introducing the new Async Sink

Tuning Apache Kafka Connectors for Flink.pptx

Flink powered stream processing platform at Pinterest

Apache Flink in the Cloud-Native Era

Where is my bottleneck? Performance troubleshooting in Flink

Using the New Apache Flink Kubernetes Operator in a Production Deployment

The Current State of Table API in 2022

Flink SQL on Pulsar made easy

Dynamic Rule-based Real-time Market Data Alerts

Exactly-Once Financial Data Processing at Scale with Flink and Pinot

Processing Semantically-Ordered Streams in Financial Services

Tame the small files problem and optimize data layout for streaming ingestion...

Batch Processing at Scale with Flink & Iceberg

Último

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Commit 2024 - Secret Management made easyAlfredo García Lavilla

"ML in Production",Oleksandr BaganFwdays

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Artificial intelligence in cctv survelliance.pptxhariprasad279825

How to write a Business Continuity PlanDatabarracks

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Advanced Computer Architecture – An IntroductionDilum Bandara

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - Justin Cunningham

1. Netflix Data Mesh Composable Data Processing jcunningham@netflix.com

2. Netflix Streaming & Keystone

3. Cross-platform Eventing Netflix Streaming & Keystone

4. Keystone Platform

5. More than 150M Global Members Trillions of Messages / Petabytes a Day

6. A high level view of Netflix’s Studio Structure

7. Airport: Netflix Air Traffic Control: Studio Airplanes: Productions Credit: Christopher Goss, Netflix

8. Studio Productions

9. Production Company D Production Company E Studio B Studio A Production Company F Studio C Parent Studio (many airports, many airplanes per airport)

10. Netflix (one large airport, huge # of airplanes)

11. A Disconnected Studio

12. Data Mesh: Composable Data Processing Data Transport Problems Significant duplication of effort across pipelines and teams. Delay in bringing online new pipelines and increasing maintenance overhead from existing pipeline. Uneven implementation of best practices. Need for lower latency data transportation and warehousing for operational reporting. Correctness issues related to distributed systems error recovery.

13. Data Mesh: Composable Data Processing Flink Processing RDS Cassandra Airtable Logging Data … RDS Cassandra S3 Data Warehouse Elastic Search … Extract Transform Load

14. Data Mesh: Composable Data Processing Stream 1 Stream 2 Stream 3 Stream 4 Catalog EV Cache ES S3 Service RDS Cassandra Stream Processor SourceConnector SourceConnector Sources Sinks SinkConnector SinkConnector SinkConnector Out In (Avro)

15. Stream 1 Stream 2 Stream 1 Stream Processor Stream Processor Streams Sinks Data Mesh: Composable Data Processing

16. Data Mesh Platform

17. Data Mesh Pipelines

18. Data Mesh Sources

19. Data Mesh Flink Processors

20. GraphQL Configuration

21. Iceberg Sink Configuration

22.

23. Data Mesh: Composable Data Processing Source Database DB CDC Source Connector DB Change Stream CDC Flink Auditor GraphQL Flink Processor Enriched Stream Iceberg Sink Flink Processor Iceberg S3 Data GraphQL Flink Auditor Batch Iceberg Auditor

24. Data Mesh Schema Evolution

25. Data Mesh: Composable Data Processing Overall Schema Evolution Approach Apache Avro schema format Stream processors are deployed with fixed input and output schemas Schema changes are managed by redeploying with new fixed input and output schemas Processors can opt-in to Automatic schema upgrades Most schema changes don’t require a topic change

26. Data Mesh: Composable Data Processing Data Mesh Controller DB CDC Source Connector GraphQL Flink Processor Iceberg Sink Flink Processor Iceberg S3 Data

27. Data Mesh Batch & Stream Convergence

28. Data Mesh: Composable Data Processing

29. Physical Data Mesh Storage id: name 1: id 2: first 3: last Physical S3 Storage id 1 2 3 Iceberg Data id: name 1: id 2: first 3: last Logical Iceberg Avro Data Mesh Topic Avro Iceberg Sink Data Mesh: Composable Data Processing

30. Physical Data Mesh Storage id: name 1: id 2: first 3: last 4: city Physical S3 Storage id 1 2 3 4 Iceberg Data id: name 1: id 2: first 3: last Logical Iceberg Avro Data Mesh Topic Avro Iceberg Sink Data Mesh: Composable Data Processing

31. id: name 1: id 2: first 3: last Physical Data Mesh Storage id: name 1: id 2: first 3: last 4: city Physical S3 Storage id 1 2 3 4 Iceberg Data id: name 1: id 2: first 3: last 4: city Logical Iceberg Avro Data Mesh Topic Avro Iceberg Sink Data Mesh: Composable Data Processing

32. Physical Data Mesh Storage id: name 1: id 2: first_name 3: last_name 4: city Physical S3 Storage id 1 2 3 4 Iceberg Data id: name 1: id 2: first 3: last 4: city Logical Iceberg Avro Data Mesh Topic Avro Iceberg Sink id: name 1: id 2: first_name 3: last_name 4: city Data Mesh: Composable Data Processing

33. Physical Data Mesh Storage id: name 1: id 2: first_name 4: city 5: last Physical S3 Storage id 1 2 3 4 5 Iceberg Data id: name 1: id 2: first_name 4: city 5: last id: name 1: id 2: first_name 3: last_name 4: city id: name 1: id 2: first 3: last 4: city Logical Iceberg Avro Data Mesh Topic Avro Iceberg Sink Data Mesh: Composable Data Processing

34. Data Mesh: Composable Data Processing

35.

36. Questions? jcunningham@netflix.com

Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - Justin Cunningham

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - Justin Cunningham

Similar a Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - Justin Cunningham (20)

Más de Flink Forward

Más de Flink Forward (20)

Último

Último (20)

Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - Justin Cunningham