Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Batch Processing at Scale with Flink & Iceberg
1. Batch Processing at Scale
with Flink & Iceberg
Andreas Hailu
Vice President, Goldman Sachs
2. Goldman Sachs Data Lake
● Platform allowing users to
generate batch data pipelines
without writing any code
● Data producers register datasets,
making metadata available
○ Dataset schema, source and access,
batch frequency, etc…
○ Flink batch applications generated
dynamically
● Datasets subscribed for updates
by consumers in warehouses
● Producers and consumers
decoupled
● Scale
○ 162K unique datasets
○ 140K batches/day
○ 4.2MM batches/month
2
ETL
Warehousing
Registry
Service
Producer
Source
Data
HDFS
Redshift
S3
SAP IQ/ASE
Snowflake
Lake
Browseable
Catalog
3. Batch Data Strategy
● Lake operates using copy-on-write enumerated batches
● Extracted data merged with existing data to create a new batch
● Support both milestoned and append merges
○ Milestoned merge builds out records such that records themselves contain the as-of data
■ No time-travel required
■ Done per key, “linked-list” of time-series records
■ Immutable, retained forever
○ Append merge simply appends incoming data to existing data
● Merged data is stored as Parquet/Avro, snapshots and deltas generated per
batch
○ Data exported to warehouse on batch completion in either snapshot/incremental loads
● Consumers always read data from last completed batch
● Last 3 batches of merged data are retained for recovery purposes
3
4. Milestoning Example
4
First Name Last Name Profession Date
Art Vandelay Importer May-31-1990
Staging Data
Merged Data
lake_in_id lake_out_id lake_from lake_thru First Name Last Name Profession Date
1 999999999 May-31-1990 11/30/9999 Art Vandelay Importer May-31-1990
Batch 1
5. Milestoning Example
5
First Name Last Name Profession Date
Art Vandelay Importer-Exporter June-30-1990
Staging Data
Merged Data
lake_in_id lake_out_id lake_from lake_thru First Name Last Name Profession Date
1 1 May-31-1990 11/30/9999 Art Vandelay Importer May-31-1990
2 999999999 May-31-1990 June-30-1990 Art Vandelay Importer May-31-1990
2 999999999 June-30-1990 11/30/9999 Art Vandelay Importer-Exporter June-30-1990
Batch 2
6. Job Graph - Extract
Extract Source
Data
Transform into Avro
→ Enrichment →
Validate Data Quality
Staging Directory
6
Accumulate Bloom
Filters, Partitions, …
Map, FlatMap DataSink
DataSource
Empty Sink
DataSink
Map, FlatMap
Batch
N
7. Job Graph - Merge
Read Merged Data
→ Dead Records ||
Records Not in
BloomFilter
7
Read Merged Data
→ Live Records →
Records In
BloomFilter
Read Staging Data
Merge Directory
(snapshot & delta)
Merge Staging
Records with
Merged Records
keyBy()
CoGroup DataSink
DataSource, Filter
DataSource, Filter
DataSource
Batch
N
Batch
N-1
Batch
N-1
Batch
N
8. Merge Details
● Staging data is merged with existing live records
○ Some niche exemptions for certain use cases
● Updates result in in closure of existing record and insertion of a new record
○ lake_out_id < 999999999 - “dead”
● Live records are typically what consumers query as contain time-series data
○ lake_out_id = 999999999 - “live”
● Over time, serialization of records not sent to CoGroup hinder runtime fitness
○ Dead records & records bloom filtered out but must still be written to new batch merge directory
○ More time spent rewriting records in CoGroup than actually merging
● Dead and live records bucketed by file, live records read and dead files copied
○ Substantial runtime reduction as data volume grows for patterns where ≥ 50% of data composed of
dead records
● Append merges copy data from previous batch
● Both optimizations require periodic compaction to tame overall file count
8
9. Partitioning
● Can substantially improve batch turnover time
○ Data merged against its own partition, reducing overall volume of data written in batch
● Dataset must have a field that supports partitioning for data requirements
○ Date, timestamp, or integer
● Changes how data is stored
○ Different underlying directory structure, consumers must be aware
○ Registry service stores metadata about latest batch for a partition
● Merge end result can be different
○ Partition fields can’t be changed once set
● Not all datasets have a field to partition on
9
10. Challenges
● Change set volumes per batch tend to stay consistent over time, but overall data
volume increases
● Data producer & consumer SLAs tend to be static
○ Data must be made available 30 minutes after batch begins
○ Data must be available by 14:30 EST in order to fulfill EOD reporting
● Own the implementation, not the data
○ Same code ran for every dataset
○ No control over fields, types, batch size, partitioning strategy etc…
● Support different use cases
○ Daily batch to 100+ batches/day
○ Milestoned & append batches
○ Snapshot feeds, incremental loads
● Merge optimizations so far only help ingest apps
○ Data consumed in many ways once ingested
○ User Spark code, internal processes exporting snapshot and incremental loads to warehouses
10
11. Iceberg
● Moving primary storage from HDFS → S3 offered chance for batch
data strategy review
● Iceberg’s metadata layer offers interesting features
○ Manifest files recording statistics
○ Hidden partitioning
■ Reading data looks the same client-side, regardless if/how table partitioned
■ Tracking of partition metadata no longer required
■ FIltering blocks out with Parquet predicates is good, not reading them at all is
better
● Not all datasets use Parquet
■ Consumers benefit in addition to ingest apps
○ V2 table format
■ Performant merge-on-read potential
● Batch retention managed with Snapshots
11
12. Iceberg - Partitioning
● Tables maintain metadata files that facilitate query planning
● Determines what files are required from query
○ Unnecessary files not read, single lookup rather than multiple IOPs
● Milestoned tables partitioned by record liveness
○ Live records bucketed together, dead records bucketed together
○ “select distinct(Profession) from dataset where lake_out_id =
999999999 and lake_from >= 7/1/1990 and lake_thru < 8/29/1990”
○ Ingest app no longer responsible for implementation
● Can further be partitioned by producer-specified field in schema
● Table implementation can change while consumption patterns
don’t
12
13. Iceberg - V2 Tables
● V2 tables support a merge-on-read strategy
○ Deltas applied to main table in lieu of rewriting files every batch
● Traditional ingest CoGroup step already marked records for insert,
update, delete, and unchanged
● Read only required records for CoGroup
○ Output becomes a bounded changelog DataStream
○ Unchanged records no longer emitted
● GenericRecord transformed to RowData and given
delta-appropriate RowKind association when written to Iceberg
table
○ RowKind.INSERT for new records
○ RowKind.DELETE + RowKind.INSERT for updates
13
14. Iceberg - V2 Tables
● Iceberg Flink connector uses Equality deletes
○ Identifies deleted rows by ≥ 1 column values
○ Data row is deleted if values equal to delete columns
○ Doesn’t require knowing where the rows are
○ Deleted when files compacted
○ Positional deletes require knowing where row to delete is required
● Records enriched with internal field with unique identifier for
deletes
○ Random 32-bit alphanumeric ID created during extract phase
○ Consumers only read data with schema in registry
14
15. Iceberg - V2 Tables Maintenance
● Over time, inserts and deletes can lead to many small data and
delete files
○ Small files problem, and more metadata stored in manifest files
● Periodically compact files during downtime
○ Downtime determined from ingestion schedule metadata in Registry
○ Creates a new snapshot, reads not impacted
○ Deletes applied to data files
15
16. Iceberg - V2 Tables Performance Testing
● Milestoning
○ Many updates and deletes
○ 10 million records over 8 batches
■ ~1.2GB staging data/batch
○ 10GB Snappy compressed data in total
○ 51% observed reduction in overall runtime over 8 batches when compared to
traditional file-based storage
○ Compaction runtime 51% faster than traditional merge runtime
● Append
○ Data is only appended, no updates/deletes
○ 500K records over 5 batches
○ 1TB Snappy compressed data in total
○ 63% observed reduction in overall runtime over 5 batches
○ Compaction runtime 24% faster than average traditional merge runtime
16
17. Summary
● Select equality delete fields wisely
○ Using just 1 field minimizes read overhead
● Compaction approach needs to be thought of early
○ Scheduling - built as part of application
● Partition to facilitate query patterns
17