Matthew Powers gave a talk on optimizing data lakes for Apache Spark. He discussed community goals like standardizing method signatures. He advocated for using Spark helper libraries like spark-daria and spark-fast-tests. Powers explained how to build better data lakes using techniques like partitioning data on relevant fields to skip data and speed up queries significantly. He also covered modern Scala libraries, incremental updates, compacting small files, and using Delta Lakes to more easily update partitioned data lakes over time.
7. Community goals
• Passionate about community unification
(standardization of method signatures)
• Need to find optimal scalafmt settings
• Strongly dislike UDFs
• Spark tooling?
!7
14. Small file problem
• Incrementally updating a lake will create a lot of
small files
• We can store data like this so it’s easy to
compact
!14
15. Suppose we have a CSV data lake
• CSV data lake is constantly being updated
• Want to convert it to a Parquet data lake
• Want incremental updates every hour
!15
20. Why partition data lakes?
• Data skipping
• Massively improve query performance
• I’ve seen queries run 50-100 times faster on
partitioned lakes
!20
28. Real partitioned data lake
• Updates every 3 hours
• Has 5 million files
• 15,000 files are being added every day
• Still great for a lot of queries
!28
34. What talk should I give next?
• Best practices for the Spark community
• Ditching SBT for the Mill build tool
• Testing Spark code
• Running Spark Scala code in PySpark
!34
35. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT