Data Distribution Ordering

•

1 recomendación•300 vistas

This presentation discusses data distribution and ordering in Apache Iceberg's Data Source V2. It explains that proper distribution and ordering of data is important for performance when writing and reading large datasets. The new version introduces an API for connectors to specify their required distribution and ordering, addressing issues in V1 where connectors could apply arbitrary transformations. Supported distribution options include ordered, clustered, and unspecified, and the API supports batch and streaming writes. Future work includes supporting distribution and ordering in table creation and improving partition handling. Proper data distribution and ordering is key to scaling performance in Iceberg.

Datos y análisis

Data Distribution and Ordering for
Efficient Data Source V2
Anton Okolnychyi
This is not a contribution
Data + AI Summit 2021

Presenter
• Apache Iceberg PMC member
• Apache Spark contributor
• Data Lakes at Apple
• Open source enthusiast

Agenda
• Why V2?
• Data distribution and ordering
• Future work

Reliability
• Behavior of DataFrameWriter is not defined
- Connectors interpret SaveMode differently
- SaveIntoDataSourceCommand vs InsertIntoDataSourceCommand

Reliability
• Validation rules are not consistent
- PreprocessTableCreation vs PreprocessTableInsertion
- No schema validation for path-based tables

Design choices
• Connectors interact with internal APIs
- SQLContext
- RDD
- DataFrame

Extensibility
• Hard to support new features
- No easy way to extend PrunedFilterScan
- Exposing ColumnarBatch instead of Row is challenging

Features
• No Structured Streaming support
• No multi-catalog support
• Limited bucketed tables support

Reliability
• Predictable and reliable behavior
- Clearly defined logical plans for all connectors
- Consistent validation rules
- Less delegation to connectors

Design choices
• Proper abstractions
- Connectors interact only with InternalRow and ColumnarBatch
- Mix-in traits for optional functionality

Features
• Multi-catalog support
• Structured Streaming
• Vectorization
• Bucketed tables (in progress)

Impact
• Writes
- Control the number of generated files
- Reduce the overall memory consumption
- Reduce the actual writing time

© 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Unspecified distribution

© 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Proper distribution

© 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Unspecified ordering

© 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Proper ordering

Impact
• Reads
- Cluster data on write for faster reads
- Enable efficient data skipping

Impact
• Storage footprint
- Columnar encodings perform better on sorted data (e.g. dictionary encoding)

Data Source V1
• Connectors can apply arbitrary transformations on DataFrame
• Built-in connectors sort data within tasks using partition columns

Data Source V2
• No way to control (SPARK-23889)
• Severe performance issues unless explicitly handled by the user
• Blocks migration to V2
• Fixed in upcoming Spark 3.2

Use cases
• Global sort
• Cluster + sort within tasks
• Local sort within tasks
• No distribution and sort

API
interface WriteBuilder {
Write build()
}

API
interface Write {
BatchWrite toBatch();
StreamingWrite toStreaming();
}

API
interface RequiresDistributionAndOrdering extends Write {
Distribution requiredDistribution();
SortOrder[] requiredOrdering();
}

Distributions
• OrderedDistribution
• ClusteredDistribution
• UnspecifiedDistribution

SortOrder
interface SortOrder extends Expression {
Expression expression();
SortDirection direction();
NullOrdering nullOrdering();
}

Current state
• Available and fully functional in master for batch queries
• Structured Streaming support is in progress (SPARK-34183)

Future work
• Distribution and ordering in CREATE TABLE
• Ability to control the number of shuffle partitions
• Coalesce partitions during adaptive query execution

Summary
• Consider migrating to Data Source V2
• Data distribution and ordering is critical at scale

Feedback
• Your feedback is important to us
• Don’t forget to review and rate sessions

TM and © 2021 Apple Inc. All rights reserved.

Más contenido relacionado

La actualidad más candente

Hive Bucketing in Apache Spark with Tejas PatilDatabricks

State of the Trino ProjectMartin Traverso

Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuFlink Forward

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit

Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

Practical learnings from running thousands of Flink jobsFlink Forward

Streaming SQL for Data Engineers: The Next Big Thing?Yaroslav Tkachenko

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

Solving Enterprise Data Challenges with Apache ArrowWes McKinney

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle

Introduction to Apache CalciteJordan Halterman

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

Apache Flink Adoption at ShopifyYaroslav Tkachenko

Understanding Query Plans and Spark UIsDatabricks

La actualidad más candente (20)

Hive Bucketing in Apache Spark with Tejas Patil

State of the Trino Project

Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu

Apache Spark Core—Deep Dive—Proper Optimization

The columnar roadmap: Apache Parquet and Apache Arrow

Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia

Practical learnings from running thousands of Flink jobs

Streaming SQL for Data Engineers: The Next Big Thing?

The Parquet Format and Performance Optimization Opportunities

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Apache Iceberg: An Architectural Look Under the Covers

Solving Enterprise Data Challenges with Apache Arrow

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Introduction to Apache Calcite

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

Apache Flink Adoption at Shopify

Understanding Query Plans and Spark UIs

Similar a Data Distribution Ordering

EBS ECC Data Discovery and Visualization.pdfssuserf605b8

ProvidersBeMyApp

Building data pipelines at Shopee with DECRim Zaidullin

Jean-René Roy : The Modern DBAMSDEVMTL

SQL Server 2019 Big Data ClusterMaximiliano Accotto

Intro to Data Vault 2.0 on SnowflakeKent Graziano

TheTricky Bits of Deployment Automation IBM UrbanCode Products

UNYOUG - APEX 19.2 New Featuresmsewtz

200 OK v5.0: Unveiling Powerful ETL, Connector Framework and MoreCEPTES Software Inc

Technical Deck Delta Live Tables.pdfIlham31574

Ultime Novità di Prodotto Neo4j Neo4j

Oracle Openworld Presentation with Paul Kent (SAS) on Big Data Appliance and ...jdijcks

Rapid SQL Datasheet - The Intelligent IDE for SQL DevelopmentEmbarcadero Technologies

Was l iberty for java batch and jsr352sflynn073

Case Study of Financial Web System Development and Operations with Oracle Web...Arshal Ameen

Oracle web-applicationsurskeshav

SQL in the Hybrid WorldTanel Poder

Database CI/CD Pipelinemuhammadhashir57

NavicatOrlando Gavilanez

How to migrate from Oracle to EDB PostgresAshnikbiz

Similar a Data Distribution Ordering (20)

EBS ECC Data Discovery and Visualization.pdf

Providers

Building data pipelines at Shopee with DEC

Jean-René Roy : The Modern DBA

SQL Server 2019 Big Data Cluster

Intro to Data Vault 2.0 on Snowflake

TheTricky Bits of Deployment Automation

UNYOUG - APEX 19.2 New Features

200 OK v5.0: Unveiling Powerful ETL, Connector Framework and More

Technical Deck Delta Live Tables.pdf

Ultime Novità di Prodotto Neo4j

Oracle Openworld Presentation with Paul Kent (SAS) on Big Data Appliance and ...

Rapid SQL Datasheet - The Intelligent IDE for SQL Development

Was l iberty for java batch and jsr352

Case Study of Financial Web System Development and Operations with Oracle Web...

Oracle web-applications

SQL in the Hybrid World

Database CI/CD Pipeline

Navicat

How to migrate from Oracle to EDB Postgres

Más de Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Último

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor

Ukraine War presentation: KNOW THE BASICSAishani27

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

April 2024 - Crypto Market Report's Analysismanisha194592

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Industrialised data - the key to AI success.pdfLars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Data-Analysis for Chicago Crime Data 2023ymrp368

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Data Distribution Ordering

1. Data Distribution and Ordering for Efficient Data Source V2 Anton Okolnychyi This is not a contribution Data + AI Summit 2021

2. Presenter • Apache Iceberg PMC member • Apache Spark contributor • Data Lakes at Apple • Open source enthusiast

3. Agenda • Why V2? • Data distribution and ordering • Future work

4. What’s wrong with V1?

5. Reliability • Behavior of DataFrameWriter is not defined - Connectors interpret SaveMode differently - SaveIntoDataSourceCommand vs InsertIntoDataSourceCommand

6. Reliability • Validation rules are not consistent - PreprocessTableCreation vs PreprocessTableInsertion - No schema validation for path-based tables

7. Design choices • Connectors interact with internal APIs - SQLContext - RDD - DataFrame

8. Extensibility • Hard to support new features - No easy way to extend PrunedFilterScan - Exposing ColumnarBatch instead of Row is challenging

9. Features • No Structured Streaming support • No multi-catalog support • Limited bucketed tables support

10. What’s different in V2?

11. Reliability • Predictable and reliable behavior - Clearly defined logical plans for all connectors - Consistent validation rules - Less delegation to connectors

12. Design choices • Proper abstractions - Connectors interact only with InternalRow and ColumnarBatch - Mix-in traits for optional functionality

13. Features • Multi-catalog support • Structured Streaming • Vectorization • Bucketed tables (in progress)

14. Data distribution and ordering

15. Distribution

16. Distribution

17. Ordering

18. Ordering

19. Why should I care?

20. Impact • Writes - Control the number of generated files - Reduce the overall memory consumption - Reduce the actual writing time

31. Impact • Reads - Cluster data on write for faster reads - Enable efficient data skipping

32. Impact • Storage footprint - Columnar encodings perform better on sorted data (e.g. dictionary encoding)

33. How do connectors control this?

34. Data Source V1 • Connectors can apply arbitrary transformations on DataFrame • Built-in connectors sort data within tasks using partition columns

35. Data Source V2 • No way to control (SPARK-23889) • Severe performance issues unless explicitly handled by the user • Blocks migration to V2 • Fixed in upcoming Spark 3.2

36. Solution

37. Use cases • Global sort • Cluster + sort within tasks • Local sort within tasks • No distribution and sort

38. API interface WriteBuilder { Write build() }

39. API interface Write { BatchWrite toBatch(); StreamingWrite toStreaming(); }

40. API interface RequiresDistributionAndOrdering extends Write { Distribution requiredDistribution(); SortOrder[] requiredOrdering(); }

41. Distributions • OrderedDistribution • ClusteredDistribution • UnspecifiedDistribution

42. SortOrder interface SortOrder extends Expression { Expression expression(); SortDirection direction(); NullOrdering nullOrdering(); }

43. Current state • Available and fully functional in master for batch queries • Structured Streaming support is in progress (SPARK-34183)

44. Future work • Distribution and ordering in CREATE TABLE • Ability to control the number of shuffle partitions • Coalesce partitions during adaptive query execution

45. Key takeaways

46. Summary • Consider migrating to Data Source V2 • Data distribution and ordering is critical at scale

47. Feedback • Your feedback is important to us • Don’t forget to review and rate sessions

48. Thank you!

Data Distribution Ordering

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Data Distribution Ordering

Similar a Data Distribution Ordering (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Data Distribution Ordering