These slides describe LinkedIn efforts in building a view virtualization layer that enables compute engines to access Hive views, reason about them semantically, and optionally rewrite them before presenting to various compute engines such as Spark, Hive, and Presto. Namely, we describe two frameworks: Coral for reasoning about views using their logical plans, and Transport UDFs for reasoning about UDFs.
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse
1. Coral & Transport UDFs
Building Blocks of a Postmodern Data Warehouse
Walaa Eldin Moustafa
Staff Software Engineer @
2. The lifecycle of a data application at LinkedIn [Simplified]
HDFS
Dali Tables
Dali Views
3. The lifecycle of a data application at LinkedIn [Simplified]
HDFS
Dali Tables
Dali Views
4. The lifecycle of a data application at LinkedIn [Simplified]
HDFS
Dali Tables
Dali Views
5. Why do we like catalogs and tables
Location independence
Format independence
Statistics
Atomicity
SQL and compute engines
Data availability and completeness
PII and data governance
Type system
6. Why do we like catalogs and tables
Location independence
Format independence
Statistics
Atomicity
SQL and compute engines
Data availability and completeness
PII and data governance
Type system
7. Why do we like views
Data cleaning
Data transformation
Data abstraction ML Features
Data staging
Data reuse
10. Dali Views
SELECT my_udf(c)
FROM R JOIN S JOIN T
WHERE date > today() - 5
AND date <= today()
TBLPROPERTIES(
'functions' = 'my_udf:
com.linkedin.MyUDF',
'dependencies' =
'group:artifact:0.0.1')
SQL + UDFs
View.SQL Functions
.gradle
Code Review
Test
Publish
Deploy
12. Coral
SELECT my_udf(c)
FROM R JOIN S JOIN T
WHERE date > today() - 5
AND date <= today()
TBLPROPERTIES(
'functions' = 'my_udf',
'dependencies' =
'group:artifact:0.0.1')
⋈
σ T
⋈
R S
Coral IR
coral-hive
16. Coral-Presto
⋈
σ T
⋈
R S
Coral IR
⋈
σ T
⋈
R S
Presto Rel
SELECT presto_udf(c + 1)
FROM R JOIN S JOIN T
WHERE date > today() - 5
AND date <= today()
17. Coral-Pig
⋈
σ T
⋈
R S
Coral IR
⋈
σ T
⋈
R S
Pig Rel
R = LOAD "R"
S = LOAD "S"
T = LOAD "T"
RS = R JOIN S
RSF = FILTER RS BY ...
RSFT = RSF JOIN T
18. Presto on Spark, or Spark on Presto?
⋈
σ T
⋈
R S
Coral IR
identity
coral-spark
coral-presto
coral-pig
DataRealm
19. Presto on Spark, or Spark on Presto?
⋈
σ T
⋈
R S
Coral IR
identity
coral-spark
coral-presto
coral-pig
DataRealm
Both, and more!
21. Flexible Language Interface
GraphQL Pig Latin Dataflow Datalog
user {
name
address{}
}
A = LOAD
B = LOAD
C = A
JOIN B
d = load()
.filter()
.map()
.count()
TwoHop
(X,Y) :-
E(X,Z),
E(Z,Y).
22. Flexible Language Interface
GraphQL Pig Latin Dataflow Datalog
user {
name
address{}
}
A = LOAD
B = LOAD
C = A
JOIN B
d = load()
.filter()
.map()
.count()
TwoHop
(X,Y) :-
E(X,Z),
E(Z,Y).
DataRealm
23. Flexible Language Interface
GraphQL Pig Latin Dataflow Datalog
user {
name
address{}
}
A = LOAD
B = LOAD
C = A
JOIN B
d = load()
.filter()
.map()
.count()
TwoHop
(X,Y) :-
E(X,Z),
E(Z,Y).
W. E. Moustafa, et al. Datalography: Scaling Datalog Graph
Analytics on Graph Processing Systems. In IEEE BigData 2016.
32. IR
• SQL has pretty well-understood IR: Relational
Algebra
• Standard Operators
• Scan, Filter, Project, Join, Group By, etc
• UDFs
• Opaque
• Use imperative language
• Not portable or translatable
33. UDF Denormalization
Multiple versions of
the same UDF. Not
clear which is the
source of truth.
Duplication
Duplicate
implementations can
diverge causing data
inconsistency
Inconsistency
Developers need to
learn multiple APIs,
implement same
logic multiple times.
Low
Productivity
In some cases, use
tuple conversion
adapters to enable
portability.
Low
Performance
50. UDF APIs
• API Complexity
• APIs expose low-level details of engines
• Data types may not intuitvely map to SQL type-system
• API Disparity
• APIs differ in what to expect from developer
• APIs differ in features they can provide
60. Conclusions
Only implement what
is needed to define
logic. No boilerplate
code.
Simple
Declarative type
signatures with
generics.
getRequiredFiles()
support.
Feature-rich
Can run on multiple
platforms.
Code specific to platform
is auto-generated.
Translatable
Direct access to
native platform data.
Performant
Transport UDFs
API
61. How to Contribute?
github.com/linkedin/transport transport-udfs@googlegroups.com
• New platform support
• New Transport UDFs
• Machine Learning UDFs
• Spatial UDFs
• New UDF type support
• Aggregate UDFs (UDAF)
• Table Functions (UDTF)