This presentation discusses data distribution and ordering in Apache Iceberg's Data Source V2. It explains that proper distribution and ordering of data is important for performance when writing and reading large datasets. The new version introduces an API for connectors to specify their required distribution and ordering, addressing issues in V1 where connectors could apply arbitrary transformations. Supported distribution options include ordered, clustered, and unspecified, and the API supports batch and streaming writes. Future work includes supporting distribution and ordering in table creation and improving partition handling. Proper data distribution and ordering is key to scaling performance in Iceberg.
5. Reliability
• Behavior of DataFrameWriter is not defined
- Connectors interpret SaveMode differently
- SaveIntoDataSourceCommand vs InsertIntoDataSourceCommand
6. Reliability
• Validation rules are not consistent
- PreprocessTableCreation vs PreprocessTableInsertion
- No schema validation for path-based tables
11. Reliability
• Predictable and reliable behavior
- Clearly defined logical plans for all connectors
- Consistent validation rules
- Less delegation to connectors
12. Design choices
• Proper abstractions
- Connectors interact only with InternalRow and ColumnarBatch
- Mix-in traits for optional functionality
34. Data Source V1
• Connectors can apply arbitrary transformations on DataFrame
• Built-in connectors sort data within tasks using partition columns
35. Data Source V2
• No way to control (SPARK-23889)
• Severe performance issues unless explicitly handled by the user
• Blocks migration to V2
• Fixed in upcoming Spark 3.2
43. Current state
• Available and fully functional in master for batch queries
• Structured Streaming support is in progress (SPARK-34183)
44. Future work
• Distribution and ordering in CREATE TABLE
• Ability to control the number of shuffle partitions
• Coalesce partitions during adaptive query execution