SlideShare una empresa de Scribd logo
1 de 18
Descargar para leer sin conexión
Batch Processing at Scale
with Flink & Iceberg
Andreas Hailu
Vice President, Goldman Sachs
Goldman Sachs Data Lake
● Platform allowing users to
generate batch data pipelines
without writing any code
● Data producers register datasets,
making metadata available
○ Dataset schema, source and access,
batch frequency, etc…
○ Flink batch applications generated
dynamically
● Datasets subscribed for updates
by consumers in warehouses
● Producers and consumers
decoupled
● Scale
○ 162K unique datasets
○ 140K batches/day
○ 4.2MM batches/month
2
ETL
Warehousing
Registry
Service
Producer
Source
Data
HDFS
Redshift
S3
SAP IQ/ASE
Snowflake
Lake
Browseable
Catalog
Batch Data Strategy
● Lake operates using copy-on-write enumerated batches
● Extracted data merged with existing data to create a new batch
● Support both milestoned and append merges
○ Milestoned merge builds out records such that records themselves contain the as-of data
■ No time-travel required
■ Done per key, “linked-list” of time-series records
■ Immutable, retained forever
○ Append merge simply appends incoming data to existing data
● Merged data is stored as Parquet/Avro, snapshots and deltas generated per
batch
○ Data exported to warehouse on batch completion in either snapshot/incremental loads
● Consumers always read data from last completed batch
● Last 3 batches of merged data are retained for recovery purposes
3
Milestoning Example
4
First Name Last Name Profession Date
Art Vandelay Importer May-31-1990
Staging Data
Merged Data
lake_in_id lake_out_id lake_from lake_thru First Name Last Name Profession Date
1 999999999 May-31-1990 11/30/9999 Art Vandelay Importer May-31-1990
Batch 1
Milestoning Example
5
First Name Last Name Profession Date
Art Vandelay Importer-Exporter June-30-1990
Staging Data
Merged Data
lake_in_id lake_out_id lake_from lake_thru First Name Last Name Profession Date
1 1 May-31-1990 11/30/9999 Art Vandelay Importer May-31-1990
2 999999999 May-31-1990 June-30-1990 Art Vandelay Importer May-31-1990
2 999999999 June-30-1990 11/30/9999 Art Vandelay Importer-Exporter June-30-1990
Batch 2
Job Graph - Extract
Extract Source
Data
Transform into Avro
→ Enrichment →
Validate Data Quality
Staging Directory
6
Accumulate Bloom
Filters, Partitions, …
Map, FlatMap DataSink
DataSource
Empty Sink
DataSink
Map, FlatMap
Batch
N
Job Graph - Merge
Read Merged Data
→ Dead Records ||
Records Not in
BloomFilter
7
Read Merged Data
→ Live Records →
Records In
BloomFilter
Read Staging Data
Merge Directory
(snapshot & delta)
Merge Staging
Records with
Merged Records
keyBy()
CoGroup DataSink
DataSource, Filter
DataSource, Filter
DataSource
Batch
N
Batch
N-1
Batch
N-1
Batch
N
Merge Details
● Staging data is merged with existing live records
○ Some niche exemptions for certain use cases
● Updates result in in closure of existing record and insertion of a new record
○ lake_out_id < 999999999 - “dead”
● Live records are typically what consumers query as contain time-series data
○ lake_out_id = 999999999 - “live”
● Over time, serialization of records not sent to CoGroup hinder runtime fitness
○ Dead records & records bloom filtered out but must still be written to new batch merge directory
○ More time spent rewriting records in CoGroup than actually merging
● Dead and live records bucketed by file, live records read and dead files copied
○ Substantial runtime reduction as data volume grows for patterns where ≥ 50% of data composed of
dead records
● Append merges copy data from previous batch
● Both optimizations require periodic compaction to tame overall file count
8
Partitioning
● Can substantially improve batch turnover time
○ Data merged against its own partition, reducing overall volume of data written in batch
● Dataset must have a field that supports partitioning for data requirements
○ Date, timestamp, or integer
● Changes how data is stored
○ Different underlying directory structure, consumers must be aware
○ Registry service stores metadata about latest batch for a partition
● Merge end result can be different
○ Partition fields can’t be changed once set
● Not all datasets have a field to partition on
9
Challenges
● Change set volumes per batch tend to stay consistent over time, but overall data
volume increases
● Data producer & consumer SLAs tend to be static
○ Data must be made available 30 minutes after batch begins
○ Data must be available by 14:30 EST in order to fulfill EOD reporting
● Own the implementation, not the data
○ Same code ran for every dataset
○ No control over fields, types, batch size, partitioning strategy etc…
● Support different use cases
○ Daily batch to 100+ batches/day
○ Milestoned & append batches
○ Snapshot feeds, incremental loads
● Merge optimizations so far only help ingest apps
○ Data consumed in many ways once ingested
○ User Spark code, internal processes exporting snapshot and incremental loads to warehouses
10
Iceberg
● Moving primary storage from HDFS → S3 offered chance for batch
data strategy review
● Iceberg’s metadata layer offers interesting features
○ Manifest files recording statistics
○ Hidden partitioning
■ Reading data looks the same client-side, regardless if/how table partitioned
■ Tracking of partition metadata no longer required
■ FIltering blocks out with Parquet predicates is good, not reading them at all is
better
● Not all datasets use Parquet
■ Consumers benefit in addition to ingest apps
○ V2 table format
■ Performant merge-on-read potential
● Batch retention managed with Snapshots
11
Iceberg - Partitioning
● Tables maintain metadata files that facilitate query planning
● Determines what files are required from query
○ Unnecessary files not read, single lookup rather than multiple IOPs
● Milestoned tables partitioned by record liveness
○ Live records bucketed together, dead records bucketed together
○ “select distinct(Profession) from dataset where lake_out_id =
999999999 and lake_from >= 7/1/1990 and lake_thru < 8/29/1990”
○ Ingest app no longer responsible for implementation
● Can further be partitioned by producer-specified field in schema
● Table implementation can change while consumption patterns
don’t
12
Iceberg - V2 Tables
● V2 tables support a merge-on-read strategy
○ Deltas applied to main table in lieu of rewriting files every batch
● Traditional ingest CoGroup step already marked records for insert,
update, delete, and unchanged
● Read only required records for CoGroup
○ Output becomes a bounded changelog DataStream
○ Unchanged records no longer emitted
● GenericRecord transformed to RowData and given
delta-appropriate RowKind association when written to Iceberg
table
○ RowKind.INSERT for new records
○ RowKind.DELETE + RowKind.INSERT for updates
13
Iceberg - V2 Tables
● Iceberg Flink connector uses Equality deletes
○ Identifies deleted rows by ≥ 1 column values
○ Data row is deleted if values equal to delete columns
○ Doesn’t require knowing where the rows are
○ Deleted when files compacted
○ Positional deletes require knowing where row to delete is required
● Records enriched with internal field with unique identifier for
deletes
○ Random 32-bit alphanumeric ID created during extract phase
○ Consumers only read data with schema in registry
14
Iceberg - V2 Tables Maintenance
● Over time, inserts and deletes can lead to many small data and
delete files
○ Small files problem, and more metadata stored in manifest files
● Periodically compact files during downtime
○ Downtime determined from ingestion schedule metadata in Registry
○ Creates a new snapshot, reads not impacted
○ Deletes applied to data files
15
Iceberg - V2 Tables Performance Testing
● Milestoning
○ Many updates and deletes
○ 10 million records over 8 batches
■ ~1.2GB staging data/batch
○ 10GB Snappy compressed data in total
○ 51% observed reduction in overall runtime over 8 batches when compared to
traditional file-based storage
○ Compaction runtime 51% faster than traditional merge runtime
● Append
○ Data is only appended, no updates/deletes
○ 500K records over 5 batches
○ 1TB Snappy compressed data in total
○ 63% observed reduction in overall runtime over 5 batches
○ Compaction runtime 24% faster than average traditional merge runtime
16
Summary
● Select equality delete fields wisely
○ Using just 1 field minimizes read overhead
● Compaction approach needs to be thought of early
○ Scheduling - built as part of application
● Partition to facilitate query patterns
17
Q&A
Thanks!
Learn more at GS.com/Engineering
The term ‘engineer’ in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law.
These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not
represent that it is accurate, complete, and/or up to date, and be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon
them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition of Goldman Sachs presenting the materials to you, you agree to
treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2022 the Goldman Sachs Group, Inc. All rights
reserved.

Más contenido relacionado

La actualidad más candente

A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache CalciteJulian Hyde
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 

La actualidad más candente (20)

A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 

Similar a Batch Processing at Scale with Flink & Iceberg

Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems researchVasia Kalavri
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioAlluxio, Inc.
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache CassandraSaeid Zebardast
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightGluster.org
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldScyllaDB
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
Simplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta LakeSimplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta LakeDatabricks
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxXinliShang1
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsAlluxio, Inc.
 
Webinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLabWebinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLabMayaData Inc
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series databasefelixbarny
 

Similar a Batch Processing at Scale with Flink & Iceberg (20)

Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems research
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High Yield
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Simplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta LakeSimplifying Disaster Recovery with Delta Lake
Simplifying Disaster Recovery with Delta Lake
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdf
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
 
Webinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLabWebinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLab
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 

Más de Flink Forward

Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!Flink Forward
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesFlink Forward
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitFlink Forward
 

Más de Flink Forward (20)

Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 

Último

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Último (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Batch Processing at Scale with Flink & Iceberg

  • 1. Batch Processing at Scale with Flink & Iceberg Andreas Hailu Vice President, Goldman Sachs
  • 2. Goldman Sachs Data Lake ● Platform allowing users to generate batch data pipelines without writing any code ● Data producers register datasets, making metadata available ○ Dataset schema, source and access, batch frequency, etc… ○ Flink batch applications generated dynamically ● Datasets subscribed for updates by consumers in warehouses ● Producers and consumers decoupled ● Scale ○ 162K unique datasets ○ 140K batches/day ○ 4.2MM batches/month 2 ETL Warehousing Registry Service Producer Source Data HDFS Redshift S3 SAP IQ/ASE Snowflake Lake Browseable Catalog
  • 3. Batch Data Strategy ● Lake operates using copy-on-write enumerated batches ● Extracted data merged with existing data to create a new batch ● Support both milestoned and append merges ○ Milestoned merge builds out records such that records themselves contain the as-of data ■ No time-travel required ■ Done per key, “linked-list” of time-series records ■ Immutable, retained forever ○ Append merge simply appends incoming data to existing data ● Merged data is stored as Parquet/Avro, snapshots and deltas generated per batch ○ Data exported to warehouse on batch completion in either snapshot/incremental loads ● Consumers always read data from last completed batch ● Last 3 batches of merged data are retained for recovery purposes 3
  • 4. Milestoning Example 4 First Name Last Name Profession Date Art Vandelay Importer May-31-1990 Staging Data Merged Data lake_in_id lake_out_id lake_from lake_thru First Name Last Name Profession Date 1 999999999 May-31-1990 11/30/9999 Art Vandelay Importer May-31-1990 Batch 1
  • 5. Milestoning Example 5 First Name Last Name Profession Date Art Vandelay Importer-Exporter June-30-1990 Staging Data Merged Data lake_in_id lake_out_id lake_from lake_thru First Name Last Name Profession Date 1 1 May-31-1990 11/30/9999 Art Vandelay Importer May-31-1990 2 999999999 May-31-1990 June-30-1990 Art Vandelay Importer May-31-1990 2 999999999 June-30-1990 11/30/9999 Art Vandelay Importer-Exporter June-30-1990 Batch 2
  • 6. Job Graph - Extract Extract Source Data Transform into Avro → Enrichment → Validate Data Quality Staging Directory 6 Accumulate Bloom Filters, Partitions, … Map, FlatMap DataSink DataSource Empty Sink DataSink Map, FlatMap Batch N
  • 7. Job Graph - Merge Read Merged Data → Dead Records || Records Not in BloomFilter 7 Read Merged Data → Live Records → Records In BloomFilter Read Staging Data Merge Directory (snapshot & delta) Merge Staging Records with Merged Records keyBy() CoGroup DataSink DataSource, Filter DataSource, Filter DataSource Batch N Batch N-1 Batch N-1 Batch N
  • 8. Merge Details ● Staging data is merged with existing live records ○ Some niche exemptions for certain use cases ● Updates result in in closure of existing record and insertion of a new record ○ lake_out_id < 999999999 - “dead” ● Live records are typically what consumers query as contain time-series data ○ lake_out_id = 999999999 - “live” ● Over time, serialization of records not sent to CoGroup hinder runtime fitness ○ Dead records & records bloom filtered out but must still be written to new batch merge directory ○ More time spent rewriting records in CoGroup than actually merging ● Dead and live records bucketed by file, live records read and dead files copied ○ Substantial runtime reduction as data volume grows for patterns where ≥ 50% of data composed of dead records ● Append merges copy data from previous batch ● Both optimizations require periodic compaction to tame overall file count 8
  • 9. Partitioning ● Can substantially improve batch turnover time ○ Data merged against its own partition, reducing overall volume of data written in batch ● Dataset must have a field that supports partitioning for data requirements ○ Date, timestamp, or integer ● Changes how data is stored ○ Different underlying directory structure, consumers must be aware ○ Registry service stores metadata about latest batch for a partition ● Merge end result can be different ○ Partition fields can’t be changed once set ● Not all datasets have a field to partition on 9
  • 10. Challenges ● Change set volumes per batch tend to stay consistent over time, but overall data volume increases ● Data producer & consumer SLAs tend to be static ○ Data must be made available 30 minutes after batch begins ○ Data must be available by 14:30 EST in order to fulfill EOD reporting ● Own the implementation, not the data ○ Same code ran for every dataset ○ No control over fields, types, batch size, partitioning strategy etc… ● Support different use cases ○ Daily batch to 100+ batches/day ○ Milestoned & append batches ○ Snapshot feeds, incremental loads ● Merge optimizations so far only help ingest apps ○ Data consumed in many ways once ingested ○ User Spark code, internal processes exporting snapshot and incremental loads to warehouses 10
  • 11. Iceberg ● Moving primary storage from HDFS → S3 offered chance for batch data strategy review ● Iceberg’s metadata layer offers interesting features ○ Manifest files recording statistics ○ Hidden partitioning ■ Reading data looks the same client-side, regardless if/how table partitioned ■ Tracking of partition metadata no longer required ■ FIltering blocks out with Parquet predicates is good, not reading them at all is better ● Not all datasets use Parquet ■ Consumers benefit in addition to ingest apps ○ V2 table format ■ Performant merge-on-read potential ● Batch retention managed with Snapshots 11
  • 12. Iceberg - Partitioning ● Tables maintain metadata files that facilitate query planning ● Determines what files are required from query ○ Unnecessary files not read, single lookup rather than multiple IOPs ● Milestoned tables partitioned by record liveness ○ Live records bucketed together, dead records bucketed together ○ “select distinct(Profession) from dataset where lake_out_id = 999999999 and lake_from >= 7/1/1990 and lake_thru < 8/29/1990” ○ Ingest app no longer responsible for implementation ● Can further be partitioned by producer-specified field in schema ● Table implementation can change while consumption patterns don’t 12
  • 13. Iceberg - V2 Tables ● V2 tables support a merge-on-read strategy ○ Deltas applied to main table in lieu of rewriting files every batch ● Traditional ingest CoGroup step already marked records for insert, update, delete, and unchanged ● Read only required records for CoGroup ○ Output becomes a bounded changelog DataStream ○ Unchanged records no longer emitted ● GenericRecord transformed to RowData and given delta-appropriate RowKind association when written to Iceberg table ○ RowKind.INSERT for new records ○ RowKind.DELETE + RowKind.INSERT for updates 13
  • 14. Iceberg - V2 Tables ● Iceberg Flink connector uses Equality deletes ○ Identifies deleted rows by ≥ 1 column values ○ Data row is deleted if values equal to delete columns ○ Doesn’t require knowing where the rows are ○ Deleted when files compacted ○ Positional deletes require knowing where row to delete is required ● Records enriched with internal field with unique identifier for deletes ○ Random 32-bit alphanumeric ID created during extract phase ○ Consumers only read data with schema in registry 14
  • 15. Iceberg - V2 Tables Maintenance ● Over time, inserts and deletes can lead to many small data and delete files ○ Small files problem, and more metadata stored in manifest files ● Periodically compact files during downtime ○ Downtime determined from ingestion schedule metadata in Registry ○ Creates a new snapshot, reads not impacted ○ Deletes applied to data files 15
  • 16. Iceberg - V2 Tables Performance Testing ● Milestoning ○ Many updates and deletes ○ 10 million records over 8 batches ■ ~1.2GB staging data/batch ○ 10GB Snappy compressed data in total ○ 51% observed reduction in overall runtime over 8 batches when compared to traditional file-based storage ○ Compaction runtime 51% faster than traditional merge runtime ● Append ○ Data is only appended, no updates/deletes ○ 500K records over 5 batches ○ 1TB Snappy compressed data in total ○ 63% observed reduction in overall runtime over 5 batches ○ Compaction runtime 24% faster than average traditional merge runtime 16
  • 17. Summary ● Select equality delete fields wisely ○ Using just 1 field minimizes read overhead ● Compaction approach needs to be thought of early ○ Scheduling - built as part of application ● Partition to facilitate query patterns 17
  • 18. Q&A Thanks! Learn more at GS.com/Engineering The term ‘engineer’ in this section is neither a licensed engineer nor an individual offering engineering services to the general public under applicable law. These materials (“Materials”) are confidential and for discussion purposes only. The Materials are based on information that we consider reliable, but Goldman Sachs does not represent that it is accurate, complete, and/or up to date, and be relied on as such. The Materials do not constitute advice nor is Goldman Sachs recommending any action based upon them. Opinions expressed may not be those of Goldman Sachs unless otherwise expressly noted. As a condition of Goldman Sachs presenting the materials to you, you agree to treat the Materials in a confidential manner and not disclose the contents thereof without the permission of Goldman Sachs. © Copyright 2022 the Goldman Sachs Group, Inc. All rights reserved.