Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Dali-Spark: Apache Spark Data Access at LinkedIn to Achieve Data Agility

2.959 visualizaciones

Publicado el

At LinkedIn, changes to our complex data ecosystem can place steep costs on application developers. We have designed Dali to insulate developers from physical attributes of data like storage format and location. At its core, Dali provides a catalog to define and evolve physical and virtual datasets, a dataset reader that allows applications to read datasets in different environments and a collection of development tools to manage these datasets. In this talk, we will cover
* What is Dali and how it simplifies complex data ecosystem
* Dali as unified data access layer at LinkedIn
* DaliSpark Architecture
* Roadmap including plans to open source Dali

Publicado en: Software
  • Sé el primero en comentar

Dali-Spark: Apache Spark Data Access at LinkedIn to Achieve Data Agility

  1. 1. Data Agility with DaliSpark Adwait Tumbde LinkedIn
  2. 2. “All problems in computer science can be solved by another layer of indirection” David Wheeler
  3. 3. Make data scientists more productive Data Agility
  4. 4. Data Agility Dali Tables + Views Storage
  5. 5. • Why views? • What problems does it solve? • Technology Data Agility
  6. 6. • Why views? • What problems does it solve? • Technology Data Agility
  7. 7. View of History Navigational DBMS MapReduce Programmers only! Programmers only! SQL: Anyone can query
  8. 8. “MapReduce: A Major Step Backwards” Michael Stonebraker
  9. 9. The Good Part • Scale • Best engine for the job!
  10. 10. A “De-constructed” Database Tables Queries & Optimizations Catalog
  11. 11. Storage Physical Schema Conceptual Schema External Schema External Schema External Schema Views Public Interface Abstract physical details
  12. 12. Storage Physical Schema Conceptual Schema External Schema External Schema External Schema Code Dataset As A Service
  13. 13. • Why views? • What problems does it solve? • Technology Data Agility
  14. 14. Challenges for Data Consumers Data models are in constant flux Data cleaning hurts productivity! Have to change data processing logic everywhere! My Raw Data
  15. 15. Challenge for InfrastructureProviders HARD TO CHANGE ANYTHING UNDERNEATH! Dependencies on path and hard-coded format Hard to move to better formats without breaking everyone or copying data twice My Raw Data
  16. 16. Storage DB Format Tables Engines Change is the Only Constant!
  17. 17. Data Format Evolution Table1 (Avro)DaliSpark Reader HDFS
  18. 18. Data Format Evolution Table1 (Union view) Table1_Avro Table1_ORC DaliSpark Reader HDFS
  19. 19. Change Storage Table1 (Union view) Table1_Avro Table1_ORC DaliSpark Reader Blobstore
  20. 20. One View for All Table 1 (Union view) Table1_Avro Table1_ORC DaliSpark Reader Blobstore
  21. 21. Push Down Logic Mobile Events Desktop Events DaliSpark Reader Union View, Filter events
  22. 22. Views as Code Sharing Mobile Events Desktop Events DaliSpark Reader Union View, Filter Events Flatten Nested DataDaliSpark Reader
  23. 23. • Why views? • What problems does it solve? • Technology Data Agility
  24. 24. Sample Dali Dataset CREATE VIEW profile_flattened TBLPROPERTIES ( 'functions' = 'get_profile_section:isb.GetProfileSections', 'dependencies' = 'com.linkedin.dali-udfs:get-profile-sections:0.0.5') AS SELECT get_profile_section(...) FROM prod_identity.profile;
  25. 25. SQL + UDFs Logical Independence
  26. 26. SQL + UDFs Logical Independence ⋈ σ T ⋈ R S Relational Algebra Intermediate Representation
  27. 27. Transport UDFs Single UDF for All Engines
  28. 28. • Portability • Schema Evolution • Privacy compliance (GDPR) • Materialized Views ⋈ σ T ⋈ R S
  29. 29. • Intelligent Materialization • Run-time query rewrite to use materialized views Declarative data processing pipelines Future: Materialized Views
  30. 30. Dali Views Insulate applications from storage and structure of data • Public API - private implementation • Enable evolution • Focus on business logic
  31. 31. Contributors
  32. 32. Thank you
  33. 33. Backup
  34. 34. Tables, Views, and Dali Views Table Path Format Schema Partitioning Scheme … View f(table(s) | view(s)) Dali View f(table(s)|view(s)) UDFs Dependencies
  35. 35. Only what is needed to defined logic. No boilerplate code. Simple Declarative type signatures with generics. Nullable arguments. API-level HDFS support. High-level user- friendly data types. Feature-rich Can run on multiple platforms. Code specific to platform is auto-generated. Translatable Direct access to native platform data. Performant TransportUDFs:CrossplatformUDFAPI
  36. 36. Transport Gradle Plugin Code Analysis – Metadata Generation Autogenerated Engine Wrappers Presto Hive Spark … Presto Autogenerated UDF JARs Hive Spark … User-defined Transport UDF Transport UDF Code Generation
  37. 37. DaliSpark: DataFrame API Dali Catalog DaliSpark.createDataFrame(…) ⋈ σ T ⋈ R S 1. registerBaseTables(R, S, T) 2. registerUDFs(…) 3. generateSparkSQL(…) 4. spark.sql(sqlStmt)

×