SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
Composable Data Processing
Shone Sadler & Dilip Biswal
Agenda
The Why
Background on the problem(s) that drove our need
for Composable Data processing (CDP).
The What
High level walk-through of our CDP design.
The How
Discuss challenges to achieve CDP with
Spark.
The Results
What has been the impact of CDP and where
are we headed.
Adobe Experience Platform (AEP) Zen Statement
4
INSIGHTS – ML & QUERY
ACTION
POS
CRM
Product
Usage
Mktg
Automate
IoT
Geo-
Location
Commerce
DATA
Centralize and standardize customer data and content across the
enterprise – powering 360° customer profiles, enabling data science,
and data governance to drive real-time personalized experiences
SEMANTICS & CONTROL
Adobe Experience Cloud Evolution
Data Landing (aka Siphon)
▪ 1M Batches per Day
▪ 13 Terabytes Per Day
▪ 32 Billion Events Per Day
Customers
Siphon
Data Lake
Solutions
3rd Parties
Producers
▪ Transformation
▪ Validation
▪ Partitioning
▪ Compaction
▪ Writing with
Exactly Once
▪ Lineage Tracking
Siphon’s Cross Cutting Features
Producers Data Lake
Queue
Siphon
Siphon
Siphon
Bulkhead1
Siphon
Bulkhead2
Supervisor
Catalog
Streaming
Ingest
Batch
Ingest
Siphon’s Data Processing (aka Ingest Pipeline)
Ingest Pipeline
Producers
Data Lake
Siphon
Parse Convert Validate Report
Write
Data
Write
Errors
Engineering Bottleneck
Cross-Cutting
Data Processing
Time
Features
Cross-Cutting
DataProcessing
Option A: Path of Least Resistance
Siphon +
Feature X +
Feature Y +
…..
Input Output
▪ Deprioritize Hardening
▪ Overhead due to
Context Switching
▪ Tendency towards
Spaghetti code
▪ Increasingly difficult to
test over time
▪ Increasingly difficult to
maintain over time
Option B: “Delegate” the Problem
Siphon
Input Output
Feature X
By Service X
Output X
Feature Y
By Service Y
Output Y
Feature …
…
Output …
▪ Lack of Reuse
▪ Lack of Consistency
▪ Complex to Test E2E
▪ Complex to Monitor
E2E
▪ Complex to maintain
over time
▪ Increased Latency
▪ COGS not tenable
Option C: Composable Data Processing
Siphon
Input Output
▪ Scalable Engineering
▪ Modularized Design &
Code
▪ Clear Separation of
Responsibilities
▪ Easier to Test
▪ Easier to Maintain
▪ Maximizes re-use
▪ Minimizes Complexity
▪ Minimize Latency
▪ Minimizes COGs
Feature X
By Team X
Feature Y
By Team Y
Feature …
By Team …
The What
Goal
▪ Implement a framework that enables different teams to extend
Siphon’s data ingestion pipeline.
▪ Framework must be:
▪ Efficient
▪ Modular
▪ Pluggable
▪ Composable
▪ Supportable
Modularizing the Pipeline
{"id": "1", "name": "Jared Dunn", "bday": "1988-11-13" , "level": "bronze"}
{"id": "2", "name": "Russ Hannerman", "bday": "1972-05-20"
{"id": "3", "name": "Monica Hall", "bday": "1985-02", "level": "silver"}
{"id": "4", "Name": "Dinesh", "bday": "1985-01-05", "level": "blah"}
Field Type Constraint
Id String
firstName String
lastName String
birthDate Date
rewardsLevel String Enum
[bronze,silver,go
ld]
JSON Schema
1. Parsing
{"id": "1", "name": "Jared Dunn", "bday": "1988-11-13" , "level": "bronze"}
{"id": "2", "name": "Russ Hannerman", "bday": "1972-05-20"
{"id": "3", "name": "Monica Hall", "bday": "1985-02", "level": "silver"}
{"id": "4", "Name": "Dinesh", "bday": "1985-01-05", "level": "blah"}
Id Name bday level
1 Jared Dunn 1988-11-13 bronze
3 Monica Hall 1985-02 silver
4 Dinesh 1985-01-05 blah
Input
Pass Fail
Id _errors
2 [{"code":”101","message":”Missing closing
bracket."}]
2. Conversion
Id First Name Last Name birthDate rewardsLe
vel
1 Jared Dunn 1988-11-13 bronze
4 Dinesh 1985-01-05 blah
Id _errors
3 [{"code":"355", "message":"Invalid Date",
"column":"bday"}]
Id Name bday level
1 Jared Dunn 1988-11-13 bronze
3 Monica Hall 1985-02 silver
4 Dinesh 1985-01-05 blah
Input
Pass Fail
3.Validation
Id First Name Last Name birthDate rewardsLe
vel
1 Jared Dunn 1988-11-13 bronze
4 Dinesh 1985-01-05 blah
Input
Id First Name Last Name birthDate rewardsLe
vel
1 Jared Dunn 1988-11-13 bronze
Id _errors
4 [{"code":401","message":"Requied value","column":"lastNa
me"}, {"code":"411","message":"Invalid enum value: blah,
must be one of bronze|silver|gold.", "column":
"rewardsLevel"}]
Pass Fail
4. Persisting the Good
Id First Name Last Name birthDate rewardsLe
vel
1 Jared Dunn 1988-11-13 bronze
Data Lake
5. Quarantining the Bad
Id _errors
2 [{"code":”101","message":”Missing closing
bracket."}]
3 [{"code":"355", "message":"Invalid Date",
"column":"bday"}]
4 [{"code":401","message":"Requied value","colum
n":"lastName"}, {"code":"411","message":"Invalid
enum value: blah, must be one
of bronze|silver|gold.", "column": "rewardsLevel"}]
Id Name bday level
1 Jared Dunn 1988-11-13 bronze
3 Monica Hall 1985-02 silver
4 Dinesh 1985-01-05 blah
Quarantine
Join
Failed Parser Output
Weaving It All Together
Plugin Runtime
Parser Converter Validator Data Sink Error Sink
Errors +
Data
Errors +
Data
Errors +
Data Data Errors
Siphon
The How
Challenges
▪ DSL
▪ APIs
▪ Parsing errors
▪ Conversion/Validation errors.
▪ Error consolidation
▪ Error trapping using Custom
Expression
▪ Externalization of errors
Domain Specific Language
{
"parser": "csv",
"converters": [
"mapper"
],
"validators": [
"isRequiredCheck",
"enumCheck",
"isIdentityCheck"
],
"dataSink": "dataLake",
"errorSink": "quarantine"
}
SIP Runtime
Parser Converter Validator Data Sink Error Sink
Converters Validators
Errors +
Data
Errors +
Data
Errors +
Data Data Errors
Converter Interface
ConvertResult Interface
Validator Interface
ValidateResult Interface
Parsing Errors
Ø Processed only once by SIP at the beginning.
Ø Only applicable for file sources like CSV and JSON
Ø Relies on Spark to capture the parsing errors.
Ø Pass on appropriate read options
Ø CSV
Ø Mode = PERMISSIVE
Ø columnNameOfCorruptRecord = “_corrupt_record”
Ø Parsing error records are captured
Ø By applying predicate on _corrupt_record column.
Ø Good records are passed to plugins for further processing
Parsing Errors
{ “name”: ”John”, “age”: 30 }
{ ”name”: ”Mike”, ”age”: 20
JSON: p.json
spark.read.json("p.json").show(false)
CSV: p.csv
name,age
John,30
Mike,20,20
spark.read.schema(csvSchema).options(csvOptions).csv("p.csv").show(false)
No record terminator
Record does not confirm to schema
Conversion/Validation errors
Ø SIP invokes the plugins in sequence
Ø Converter Plugin
Ø Both good and error records are collected.
Ø The good records are passed to the next plugin in sequence.
Ø Validate Plugin
Ø Returns the error records.
Ø Process an error record multiple times.
Ø To capture all possible errors for a given record.
Ø Example, both plugin-1 and plugin-2 may find different errors for
one or more columns of same record.
Error consolidation (contd ..)
Mapping rule Target_column
first_name || last_name full_name
MAPPINGS
full_name age Row_id
John Vanau 40 1
Michael Shankar -32 3
row_id _errors
2 [[last_name, ERR-100, “Field `last_name` can not be null]]
first_name last_name age Row_id
John Vanau 40 1
Jack NULL 24 2
Michael Shankar -32 3
INPUT_DATA
DATA_WITH_ROW_ID
monotonically_increasing_id()
applying mapping rule
success
error
successful mapping
column_name data_type constraint
full_name String None
age Short age > 0
TARGET _SCHEMA
applying target schema
row_id _errors
3 [[age, ERR-200, “Field `age` cannot be < 0”]]
row_id _errors
3 [[age, ERR-200, “Field `age` cannot be < 0”]]
row_id _errors
2 [[last_name, ERR-100, “Field `last_name` can not be null]]
row_id _errors
2 [[last_name, ERR-100, “Field `last_name` can not be null]]
3 [[age, ERR-200, “Field `age` cannot be < 0”]]
Union
first_name last_nam
e
age _errors
Jack NULL 24 [[last_name, ERR-100, “Field `last_name` can not be
null”]]
Michael Shankar -32 [[age, ERR-200, “Field `age` cannot be < 0”]]
full_name age
John Vanau 40
first_name last_name age Row_id
John Vanau 40 1
Jack NULL 24 2
Michael Shankar -32 3
DATA_WITH_ROW_ID
Join
Anti-Join
final successfinal error
full_name age Row_id
John Vanau 40 1
Michael Shankar -32 3
Error Trapping
▪ Most of existing conversion and validations use UDFs.
▪ Nested type conversions use nested UDFs.
▪ Currently not possible to capture errors from nested UDFs.
▪ Custom expression used to trap errors.
▪ Captures input column value, error code and error text in case of error.
▪ Captures output column value upon successful conversion/validation.
Custom Expression
Error Trapping (Contd..)
Custom Expression
Error Trapping – Example
Error Trapping – Example ContinuedError Trapping – Example Continued
Error Trapping – Example Continued
The Results
Benefits
Cross-Cutting Data Processing
▪ Scalable Engineering
▪ Separation of
Responsibilities
▪ More Readable Code
▪ More Testable Code
▪ Easier to Maintain
▪ More ETL Features
▪ More Validation
Features
▪ More Error Reporting
Features
▪ Minimize Latency (from
10 min to 10 sec)
▪ Re-Use
▪ 50% or More Storage
Savings
▪ 50% or More Compute
Savings
Cross-Cutting
Data Processing
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Más contenido relacionado

La actualidad más candente

Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningDatabricks
 
Evolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming PipelinesEvolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming PipelinesDatabricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks
 
Tackling Scaling Challenges of Apache Spark at LinkedIn
Tackling Scaling Challenges of Apache Spark at LinkedInTackling Scaling Challenges of Apache Spark at LinkedIn
Tackling Scaling Challenges of Apache Spark at LinkedInDatabricks
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueDatabricks
 
Automating Federal Aviation Administration’s (FAA) System Wide Information Ma...
Automating Federal Aviation Administration’s (FAA) System Wide Information Ma...Automating Federal Aviation Administration’s (FAA) System Wide Information Ma...
Automating Federal Aviation Administration’s (FAA) System Wide Information Ma...Databricks
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Databricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Frequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsFrequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsDatabricks
 
Building a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFoodBuilding a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFoodDatabricks
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleDatabricks
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionDatabricks
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowDatabricks
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
 
Geospatial Options in Apache Spark
Geospatial Options in Apache SparkGeospatial Options in Apache Spark
Geospatial Options in Apache SparkDatabricks
 
Managing ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkManaging ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkDatabricks
 

La actualidad más candente (20)

Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
 
Evolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming PipelinesEvolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming Pipelines
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
Tackling Scaling Challenges of Apache Spark at LinkedIn
Tackling Scaling Challenges of Apache Spark at LinkedInTackling Scaling Challenges of Apache Spark at LinkedIn
Tackling Scaling Challenges of Apache Spark at LinkedIn
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and FugueSuperworkflow of Graph Neural Networks with K8S and Fugue
Superworkflow of Graph Neural Networks with K8S and Fugue
 
Automating Federal Aviation Administration’s (FAA) System Wide Information Ma...
Automating Federal Aviation Administration’s (FAA) System Wide Information Ma...Automating Federal Aviation Administration’s (FAA) System Wide Information Ma...
Automating Federal Aviation Administration’s (FAA) System Wide Information Ma...
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Frequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsFrequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on Embeddings
 
Building a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFoodBuilding a Real-Time Feature Store at iFood
Building a Real-Time Feature Store at iFood
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
Geospatial Options in Apache Spark
Geospatial Options in Apache SparkGeospatial Options in Apache Spark
Geospatial Options in Apache Spark
 
Managing ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkManaging ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache Spark
 

Similar a Composable Data Processing with Apache Spark

HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
 
Making Sense of Schema on Read
Making Sense of Schema on ReadMaking Sense of Schema on Read
Making Sense of Schema on ReadKent Graziano
 
Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Anuj Sahni
 
Gab document db scaling database
Gab   document db scaling databaseGab   document db scaling database
Gab document db scaling databaseMUG Perú
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
 
The Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always WantedThe Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always WantedThoughtworks
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB Europe 2016 - Graph Operations with MongoDBMongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB Europe 2016 - Graph Operations with MongoDBMongoDB
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive ApproachesData Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive ApproachesDatabricks
 
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...it-people
 
[Webinar] Introduction to Cypher
[Webinar] Introduction to Cypher[Webinar] Introduction to Cypher
[Webinar] Introduction to CypherNeo4j
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jugGerald Muecke
 
Crafting Evolvable Api Responses
Crafting Evolvable Api ResponsesCrafting Evolvable Api Responses
Crafting Evolvable Api Responsesdarrelmiller71
 
7 DDS Innovations to Improve your Next Distributed System
7 DDS Innovations to Improve your Next Distributed System7 DDS Innovations to Improve your Next Distributed System
7 DDS Innovations to Improve your Next Distributed SystemReal-Time Innovations (RTI)
 
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Codemotion
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for CassandraEdward Capriolo
 

Similar a Composable Data Processing with Apache Spark (20)

MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
Making Sense of Schema on Read
Making Sense of Schema on ReadMaking Sense of Schema on Read
Making Sense of Schema on Read
 
Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0Application development with Oracle NoSQL Database 3.0
Application development with Oracle NoSQL Database 3.0
 
Gab document db scaling database
Gab   document db scaling databaseGab   document db scaling database
Gab document db scaling database
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: Sharding
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
Os Gottfrid
Os GottfridOs Gottfrid
Os Gottfrid
 
The Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always WantedThe Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always Wanted
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB Europe 2016 - Graph Operations with MongoDBMongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB Europe 2016 - Graph Operations with MongoDB
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive ApproachesData Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive Approaches
 
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
 
[Webinar] Introduction to Cypher
[Webinar] Introduction to Cypher[Webinar] Introduction to Cypher
[Webinar] Introduction to Cypher
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jug
 
Crafting Evolvable Api Responses
Crafting Evolvable Api ResponsesCrafting Evolvable Api Responses
Crafting Evolvable Api Responses
 
7 DDS Innovations to Improve your Next Distributed System
7 DDS Innovations to Improve your Next Distributed System7 DDS Innovations to Improve your Next Distributed System
7 DDS Innovations to Improve your Next Distributed System
 
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 

Composable Data Processing with Apache Spark

  • 1.
  • 2. Composable Data Processing Shone Sadler & Dilip Biswal
  • 3. Agenda The Why Background on the problem(s) that drove our need for Composable Data processing (CDP). The What High level walk-through of our CDP design. The How Discuss challenges to achieve CDP with Spark. The Results What has been the impact of CDP and where are we headed.
  • 4. Adobe Experience Platform (AEP) Zen Statement 4 INSIGHTS – ML & QUERY ACTION POS CRM Product Usage Mktg Automate IoT Geo- Location Commerce DATA Centralize and standardize customer data and content across the enterprise – powering 360° customer profiles, enabling data science, and data governance to drive real-time personalized experiences SEMANTICS & CONTROL
  • 6. Data Landing (aka Siphon) ▪ 1M Batches per Day ▪ 13 Terabytes Per Day ▪ 32 Billion Events Per Day Customers Siphon Data Lake Solutions 3rd Parties Producers ▪ Transformation ▪ Validation ▪ Partitioning ▪ Compaction ▪ Writing with Exactly Once ▪ Lineage Tracking
  • 7. Siphon’s Cross Cutting Features Producers Data Lake Queue Siphon Siphon Siphon Bulkhead1 Siphon Bulkhead2 Supervisor Catalog Streaming Ingest Batch Ingest
  • 8. Siphon’s Data Processing (aka Ingest Pipeline) Ingest Pipeline Producers Data Lake Siphon Parse Convert Validate Report Write Data Write Errors
  • 10. Option A: Path of Least Resistance Siphon + Feature X + Feature Y + ….. Input Output ▪ Deprioritize Hardening ▪ Overhead due to Context Switching ▪ Tendency towards Spaghetti code ▪ Increasingly difficult to test over time ▪ Increasingly difficult to maintain over time
  • 11. Option B: “Delegate” the Problem Siphon Input Output Feature X By Service X Output X Feature Y By Service Y Output Y Feature … … Output … ▪ Lack of Reuse ▪ Lack of Consistency ▪ Complex to Test E2E ▪ Complex to Monitor E2E ▪ Complex to maintain over time ▪ Increased Latency ▪ COGS not tenable
  • 12. Option C: Composable Data Processing Siphon Input Output ▪ Scalable Engineering ▪ Modularized Design & Code ▪ Clear Separation of Responsibilities ▪ Easier to Test ▪ Easier to Maintain ▪ Maximizes re-use ▪ Minimizes Complexity ▪ Minimize Latency ▪ Minimizes COGs Feature X By Team X Feature Y By Team Y Feature … By Team …
  • 14. Goal ▪ Implement a framework that enables different teams to extend Siphon’s data ingestion pipeline. ▪ Framework must be: ▪ Efficient ▪ Modular ▪ Pluggable ▪ Composable ▪ Supportable
  • 15. Modularizing the Pipeline {"id": "1", "name": "Jared Dunn", "bday": "1988-11-13" , "level": "bronze"} {"id": "2", "name": "Russ Hannerman", "bday": "1972-05-20" {"id": "3", "name": "Monica Hall", "bday": "1985-02", "level": "silver"} {"id": "4", "Name": "Dinesh", "bday": "1985-01-05", "level": "blah"} Field Type Constraint Id String firstName String lastName String birthDate Date rewardsLevel String Enum [bronze,silver,go ld] JSON Schema
  • 16. 1. Parsing {"id": "1", "name": "Jared Dunn", "bday": "1988-11-13" , "level": "bronze"} {"id": "2", "name": "Russ Hannerman", "bday": "1972-05-20" {"id": "3", "name": "Monica Hall", "bday": "1985-02", "level": "silver"} {"id": "4", "Name": "Dinesh", "bday": "1985-01-05", "level": "blah"} Id Name bday level 1 Jared Dunn 1988-11-13 bronze 3 Monica Hall 1985-02 silver 4 Dinesh 1985-01-05 blah Input Pass Fail Id _errors 2 [{"code":”101","message":”Missing closing bracket."}]
  • 17. 2. Conversion Id First Name Last Name birthDate rewardsLe vel 1 Jared Dunn 1988-11-13 bronze 4 Dinesh 1985-01-05 blah Id _errors 3 [{"code":"355", "message":"Invalid Date", "column":"bday"}] Id Name bday level 1 Jared Dunn 1988-11-13 bronze 3 Monica Hall 1985-02 silver 4 Dinesh 1985-01-05 blah Input Pass Fail
  • 18. 3.Validation Id First Name Last Name birthDate rewardsLe vel 1 Jared Dunn 1988-11-13 bronze 4 Dinesh 1985-01-05 blah Input Id First Name Last Name birthDate rewardsLe vel 1 Jared Dunn 1988-11-13 bronze Id _errors 4 [{"code":401","message":"Requied value","column":"lastNa me"}, {"code":"411","message":"Invalid enum value: blah, must be one of bronze|silver|gold.", "column": "rewardsLevel"}] Pass Fail
  • 19. 4. Persisting the Good Id First Name Last Name birthDate rewardsLe vel 1 Jared Dunn 1988-11-13 bronze Data Lake
  • 20. 5. Quarantining the Bad Id _errors 2 [{"code":”101","message":”Missing closing bracket."}] 3 [{"code":"355", "message":"Invalid Date", "column":"bday"}] 4 [{"code":401","message":"Requied value","colum n":"lastName"}, {"code":"411","message":"Invalid enum value: blah, must be one of bronze|silver|gold.", "column": "rewardsLevel"}] Id Name bday level 1 Jared Dunn 1988-11-13 bronze 3 Monica Hall 1985-02 silver 4 Dinesh 1985-01-05 blah Quarantine Join Failed Parser Output
  • 21. Weaving It All Together Plugin Runtime Parser Converter Validator Data Sink Error Sink Errors + Data Errors + Data Errors + Data Data Errors Siphon
  • 23. Challenges ▪ DSL ▪ APIs ▪ Parsing errors ▪ Conversion/Validation errors. ▪ Error consolidation ▪ Error trapping using Custom Expression ▪ Externalization of errors
  • 24. Domain Specific Language { "parser": "csv", "converters": [ "mapper" ], "validators": [ "isRequiredCheck", "enumCheck", "isIdentityCheck" ], "dataSink": "dataLake", "errorSink": "quarantine" } SIP Runtime Parser Converter Validator Data Sink Error Sink Converters Validators Errors + Data Errors + Data Errors + Data Data Errors
  • 29. Parsing Errors Ø Processed only once by SIP at the beginning. Ø Only applicable for file sources like CSV and JSON Ø Relies on Spark to capture the parsing errors. Ø Pass on appropriate read options Ø CSV Ø Mode = PERMISSIVE Ø columnNameOfCorruptRecord = “_corrupt_record” Ø Parsing error records are captured Ø By applying predicate on _corrupt_record column. Ø Good records are passed to plugins for further processing
  • 30. Parsing Errors { “name”: ”John”, “age”: 30 } { ”name”: ”Mike”, ”age”: 20 JSON: p.json spark.read.json("p.json").show(false) CSV: p.csv name,age John,30 Mike,20,20 spark.read.schema(csvSchema).options(csvOptions).csv("p.csv").show(false) No record terminator Record does not confirm to schema
  • 31. Conversion/Validation errors Ø SIP invokes the plugins in sequence Ø Converter Plugin Ø Both good and error records are collected. Ø The good records are passed to the next plugin in sequence. Ø Validate Plugin Ø Returns the error records. Ø Process an error record multiple times. Ø To capture all possible errors for a given record. Ø Example, both plugin-1 and plugin-2 may find different errors for one or more columns of same record.
  • 32. Error consolidation (contd ..) Mapping rule Target_column first_name || last_name full_name MAPPINGS full_name age Row_id John Vanau 40 1 Michael Shankar -32 3 row_id _errors 2 [[last_name, ERR-100, “Field `last_name` can not be null]] first_name last_name age Row_id John Vanau 40 1 Jack NULL 24 2 Michael Shankar -32 3 INPUT_DATA DATA_WITH_ROW_ID monotonically_increasing_id() applying mapping rule success error successful mapping column_name data_type constraint full_name String None age Short age > 0 TARGET _SCHEMA applying target schema row_id _errors 3 [[age, ERR-200, “Field `age` cannot be < 0”]] row_id _errors 3 [[age, ERR-200, “Field `age` cannot be < 0”]] row_id _errors 2 [[last_name, ERR-100, “Field `last_name` can not be null]] row_id _errors 2 [[last_name, ERR-100, “Field `last_name` can not be null]] 3 [[age, ERR-200, “Field `age` cannot be < 0”]] Union first_name last_nam e age _errors Jack NULL 24 [[last_name, ERR-100, “Field `last_name` can not be null”]] Michael Shankar -32 [[age, ERR-200, “Field `age` cannot be < 0”]] full_name age John Vanau 40 first_name last_name age Row_id John Vanau 40 1 Jack NULL 24 2 Michael Shankar -32 3 DATA_WITH_ROW_ID Join Anti-Join final successfinal error full_name age Row_id John Vanau 40 1 Michael Shankar -32 3
  • 33. Error Trapping ▪ Most of existing conversion and validations use UDFs. ▪ Nested type conversions use nested UDFs. ▪ Currently not possible to capture errors from nested UDFs. ▪ Custom expression used to trap errors. ▪ Captures input column value, error code and error text in case of error. ▪ Captures output column value upon successful conversion/validation. Custom Expression
  • 36. Error Trapping – Example ContinuedError Trapping – Example Continued
  • 37. Error Trapping – Example Continued
  • 39. Benefits Cross-Cutting Data Processing ▪ Scalable Engineering ▪ Separation of Responsibilities ▪ More Readable Code ▪ More Testable Code ▪ Easier to Maintain ▪ More ETL Features ▪ More Validation Features ▪ More Error Reporting Features ▪ Minimize Latency (from 10 min to 10 sec) ▪ Re-Use ▪ 50% or More Storage Savings ▪ 50% or More Compute Savings Cross-Cutting Data Processing
  • 40. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.