SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
www.mapflat.com
Test strategies for data
processing pipelines
Lars Albertsson, independent consultant (Mapflat)
Øyvind Løkling, Schibsted Products & Technology
1
www.mapflat.com
Who’s talking?
Swedish Institute of Computer. Science. (test & debug tools)
Sun Microsystems (large machine verification)
Google (Hangouts, productivity)
Recorded Future (NLP startup) (data integrations)
Cinnober Financial Tech. (trading systems)
Spotify (data processing & modelling, productivity)
Schibsted Products & Tech (data processing & modelling)
Mapflat (independent data engineering consultant)
2
www.mapflat.com
Agenda
● Data applications from a test perspective
● Testing stream processing product
● Testing batch processing products
● Data quality testing
Main focus is functional, regression testing
Prerequisites: Backend dev testing, basic data experience
3
www.mapflat.com
Test value
For data-centric applications, in this order:
● Productivity
○ Move fast without breaking things
● Fast experimentation
○ 10% good ideas, 90% bad
● Data quality
○ Challenging, more important than
● Technical quality
○ Technical failure => ops hassle, stale data
4
www.mapflat.com
Streaming data product anatomy
5
Pub / sub
Unified log
Ingress Stream processing Egress
DB
Service
TopicJob
Pipeline
Service
Export
Business
intelligence
DB
www.mapflat.com
Test harness
Test
fixture
Test concepts
6
System under test
(SUT)
3rd party
component
(e.g. DB)
3rd party
component
3rd party
component
Test
input
Test
oracle
Test framework (e.g. JUnit, Scalatest)
Seam
IDEs
Build
tools
www.mapflat.com
Test scopes
7
Unit
Class
Component
Integration
System / acceptance
● Pick stable seam
● Small scope
○ Fast?
○ Easy, simple?
● Large scope
○ Real app value?
○ Slow, unstable?
● Maintenance, cost
○ Pick few SUTs
www.mapflat.com
● Output = function(input, code)
○ No external factors => deterministic
● Pipeline and job endpoints are stable
○ Correspond to business value
● Internal abstractions are volatile
○ Reslicing in different dimensions is common
Data-centric application properties
8
q
www.mapflat.com
● Output = function(input, code)
○ Perfect for test!
○ Avoid: external service calls, wall clock
● Pipeline/job edges are suitable seams
○ Focus on large tests
● Internal seams => high maintenance, low value
○ Omit unit tests, mocks, dependency injection!
● Long pipelines crosses teams
○ Need for end-to-end tests, but culture challenge
Data-centric app test properties
9
q
www.mapflat.com
Suitable seams, streaming
10
Pub / sub
DB
Service
TopicJob
Pipeline
Service
Export
Business
intelligence
DB
www.mapflat.com
2, 7. Scalatest. 4. Spark Streaming jobs
IDE, CI, debug integration
Streaming SUT, example harness
11
DB
Topic
Kafka
5. Test
input
6. Test
oracle
3. Docker
1. IDE / Gradle
Polling
www.mapflat.com
Test lifecycle
12
1. Start fixture containers
2. Await fixture ready
3. Allocate test case resources
4. Start jobs
5. Push input data to Kafka
6. While (!done && !timeout) { pollDatabase(); sleep(1ms) }
7. While (moreTests) { Goto 3 }
8. Tear down fixture
For absence test, send dummy sync messages at end.
www.mapflat.com
Input generation
13
● Input & output is denormalised & wide
● Fields are frequently changed
○ Additions are compatible
○ Modifications are incompatible =>
new, similar data type
● Static test input, e.g. JSON files
○ Unmaintainable
● Input generation routines
○ Robust to changes, reusable
www.mapflat.com
Test oracles
14
● Compare with expected output
● Check fields relevant for test
○ Robust to field changes
○ Reusable for new, similar types
● Tip: Use lenses
○ JSON: JsonPath (Java), Play JSON (Scala)
○ Case classes: Monocle
● Express invariants for each data type
www.mapflat.com
Batch data product anatomy
15
Cluster storage
Unified log
Ingress ETL Egress
Data
lake
DB
Service
DatasetJob
Pipeline
Service
Export
Business
intelligence
DB
DB
Import
Datasets
● Pipeline equivalent of objects
● Dataset class == homogeneous records, open-ended
○ Compatible schema
○ E.g. MobileAdImpressions
● Dataset instance = dataset class + parameters
○ Immutable
○ Finite set of homogeneous records
○ E.g. MobileAdImpressions(hour=”2016-02-06T13”)
16
www.mapflat.com
Directory datasets
17
hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS
part-00000.json
part-00001.json
● Some tools, e.g. Spark, understand Hive name
conventions
Dataset
class
Instance parameters,
Hive convention
Seal PartitionsPrivacy
level
Schema
version
Batch processing
● outDatasets =
code(inDatasets)
● Component that scale up
○ Spark, (Flink,
Scalding, Crunch)
● And scale down
○ Local mode
○ Most jobs fit in one
machine
18
Gradual refinement
1. Wash
- time shuffle, dedup, ...
2. Decorate
- geo, demographic, ...
3. Domain model
- similarity, clusters, ...
4. Application model
- Recommendations, ...
www.mapflat.com
Workflow manager
● Dataset “build tool”
● Run job instance when
○ input is available
○ output missing
● Backfills for previous failures
● DSL describes dependencies
● Includes ingress & egress
Suggested: Luigi / Airflow
19
DB
`
www.mapflat.com
ClientSessions A/B tests
DSL DAG example (Luigi)
20
class ClientActions(SparkSubmitTask):
hour = DateHourParameter()
def requires(self):
return [Actions(hour=self.hour - timedelta(hours=h)) for h in range(0, 24)] + 
[UserDB(date=self.hour.date)]
...
class ClientSessions(SparkSubmitTask):
hour = DateHourParameter()
def requires(self):
return [ClientActions(hour=self.hour - timedelta(hours=h)) for h in range(0, 3)]
...
class SessionsABResults(SparkSubmitTask):
hour = DateHourParameter()
def requires(self):
return [ClientSessions(hour=self.hour), ABExperiments(hour=self.hour)]
def output(self):
return HdfsTarget(“hdfs://production/red/ab_sessions/v1/” +
“{:year=%Y/month=%m/day=%d/hour=%H}”.format(self.hour))
...
Actions, hourly
● Expressive, embedded DSL - a must for ingress, egress
○ Avoid weak DSL tools: Oozie, AWS Data Pipeline
UserDB
Time shuffle,
user decorate
Form sessions
A/B compare
ClientActions, daily
A/B session
evaluationDataset instance
Job (aka Task) classes
www.mapflat.com
Batch processing testing
● Omit collection in SUT
○ Technically different
● Avoid clusters
○ Slow tests
● Seams
○ Between jobs
○ End of pipelines
21
Data lake
Artifact of business value
E.g. service index
Job
Pipeline
www.mapflat.com
Batch test frameworks - don’t
● Spark, Scalding, Crunch variants
● Seam == internal data structure
○ Omits I/O - common bug source
● Vendor lock-in
When switching batch framework:
○ Need tests for protection
○ Test rewrite is unnecessary burden
22
Test input Test output
Job
www.mapflat.com
Testing single job
23
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Don’t commit -
expensive to maintain.
Generate / verify with
code.
Runs well in
CI / from IDE
Job
www.mapflat.com
Testing pipelines - two options
24
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job 3. Verify output
f() p()
A:
Customised workflow manager setup
+ Runs in CI
+ Runs in IDE
+ Quick setup
- Multi-job
maintenance
p()
+ Tests workflow logic
+ More authentic
- Workflow mgr setup
for testability
- Difficult to debug
- Dataset handling
with Python
f()
B:
● Both can be extended with egress DBs
Test job with sequence of jobs
www.mapflat.com
Testing with cloud services
● PaaS components do not work locally
○ Cloud providers should provide fake
implementations
○ Exceptions: Kubernetes, Cloud SQL, (S3)
● Integrate PaaS service as fixture component
○ Distribute access tokens, etc
○ Pay $ or $$$
25
www.mapflat.com
Quality testing variants
● Functional regression
○ Binary, key to productivity
● Golden set
○ Extreme inputs => obvious output
○ No regressions tolerated
● (Saved) production data input
○ Individual regressions ok
○ Weighted sum must not decline
○ Beware of privacy
26
www.mapflat.com
Hadoop / Spark counters
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
● Dedicated quality assessment pipelines
○ Reuse test oracle invariants in production
Obtaining quality metrics
27
DB
Quality assessment job
www.mapflat.com
Quality testing in the process
● Binary self-contained
○ Validate in CI
● Relative vs history
○ E.g. large drops
○ Precondition for
publishing dataset
● Push aggregates to DB
○ Standard ops:
monitor, alert
28
DB
∆?
Code ∆!
www.mapflat.com
RAM
Input
Computer program anatomy
29
Input data
Process Output
File
HID
VariableFunction
Execution path
Lookup
structure
Output
data
Window
File
File
Device
www.mapflat.com
Data pipeline = yet another program
Don’t veer from best practices
● Regression testing
● Design: Separation of concerns, modularity, etc
● Process: CI/CD, code review, static analysis tools
● Avoid anti-patterns: Global state, hard-coding location,
duplication, ...
In data engineering, slipping is in the culture... :-(
● Mix in solid backend engineers, document “golden path”
30
www.mapflat.com
Top anti-patterns
1. Test as afterthought or in production
Data processing applications are suited for test!
2. Static test input in version control
3. Exact expected output test oracle
4. Unit testing volatile interfaces
5. Using mocks & dependency injection
6. Tool-specific test framework
7. Calling services / databases from (batch) jobs
8. Using wall clock time
9. Embedded fixture components
31
www.mapflat.com
Further resources. Questions?
32
http://www.slideshare.net/lallea/data-pipelines-from-zero-to-solid
http://www.mapflat.com/lands/resources/reading-list
http://www.slideshare.net/mathieu-bastian/the-mechanics-of-testing-large-data-
pipelines-qcon-london-2016
http://www.slideshare.net/hkarau/effective-testing-for-spark-programs-strata-ny-
2015
https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-
Practices-Anupama-Shetty-Neil-Marshall.pdf
www.mapflat.com
Bonus slides
33
www.mapflat.com
Unified log
● Replicated append-only log
● Pub / sub with history
● Decoupled producers and consumers
○ In source/deployment
○ In space
○ In time
● Recovers from link failures
● Replay on transformation bug fix
34
Credits: Confluent
www.mapflat.com
Stream pipelines
● Parallelised jobs
● Read / write to Kafka
● View egress
○ Serving index
○ SQL / cubes for Analytics
● Stream egress
○ Services subscribe to topic
○ E.g. REST post for export
35
Credits: Apache Samza
www.mapflat.com
The data lake
Unified log + snapshots
● Immutable datasets
● Raw, unprocessed
● Source of truth from batch
processing perspective
● Kept as long as permitted
● Technically homogeneous
○ Except for raw imports
36
Cluster storage
Data lake
www.mapflat.com
Batch pipelines
● Things will break
○ Input will be missing
○ Jobs will fail
○ Jobs will have bugs
● Datasets must be rebuilt
● Determinism, idempotency
● Backfill missing / failed
● Eventual correctness
37
Cluster storage
Data lake
Pristine,
immutable
datasets
Intermediate
Derived,
regenerable
www.mapflat.com
Job == function([input datasets]): [output datasets]
● No orthogonal concerns
○ Invocation
○ Scheduling
○ Input / output location
● Testable
● No other input factors, no side-effects
● Ideally: atomic, deterministic, idempotent
● Necessary for audit
Batch job
38
q
www.mapflat.com
Form teams that are driven by business cases & need
Forward-oriented -> filters implicitly applied
Beware of: duplication, tech chaos/autonomy, privacy loss
Data pipelines
www.mapflat.com
Data platform, pipeline chains
Common data infrastructure
Productivity, privacy, end-to-end agility, complexity
Beware: producer-consumer disconnect
www.mapflat.com
Ingress / egress representation
Larger variation:
● Single file
● Relational database table
● Cassandra column family, other NoSQL
● BI tool storage
● BigQuery, Redshift, ...
Egress datasets are also atomic and immutable.
E.g. write full DB table / CF, switch service to use it, never
change it.
41
www.mapflat.com
Egress datasets
● Serving
○ Precomputed user query answers
○ Denormalised
○ Cassandra, (many)
● Export & Analytics
○ SQL (single node / Hive, Presto, ..)
○ Workbenches (Zeppelin)
○ (Elasticsearch, proprietary OLAP)
● BI / analytics tool needs change frequently
○ Prepare to redirect pipelines
42
Deployment
43
Hg/git
repo Luigi DSL, jars, config
my-pipe-7.tar.gz
HDFS
Luigi
daemon
> pip install my-pipe-7.tar.gz
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule, higher
frequency + backfill (Luigi range tools)
* 10 * * * bin/my_pipe_daily 
--backfill 14
All that a pipeline needs, installed atomically

Más contenido relacionado

La actualidad más candente

Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Databricks
 

La actualidad más candente (20)

Airflow 101
Airflow 101Airflow 101
Airflow 101
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Apache flink
Apache flinkApache flink
Apache flink
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Big query
Big queryBig query
Big query
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connect
 
Greenplum 6 Changes
Greenplum 6 ChangesGreenplum 6 Changes
Greenplum 6 Changes
 

Destacado

[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림
NAVER D2
 
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
NAVER D2
 

Destacado (17)

Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago
 
[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준
 
[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림
 
[115] clean fe development_윤지수
[115] clean fe development_윤지수[115] clean fe development_윤지수
[115] clean fe development_윤지수
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
Building a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSBuilding a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWS
 
[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
 
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
 
[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 

Similar a Test strategies for data processing pipelines

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 

Similar a Test strategies for data processing pipelines (20)

Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Druid Optimizations for Scaling Customer Facing Analytics
Druid Optimizations for Scaling Customer Facing AnalyticsDruid Optimizations for Scaling Customer Facing Analytics
Druid Optimizations for Scaling Customer Facing Analytics
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
 
OOW13 Exadata and ODI with Parallel
OOW13 Exadata and ODI with ParallelOOW13 Exadata and ODI with Parallel
OOW13 Exadata and ODI with Parallel
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Under the hood of the Altalis Platform
Under the hood of the Altalis PlatformUnder the hood of the Altalis Platform
Under the hood of the Altalis Platform
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 

Más de Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
Lars Albertsson
 

Más de Lars Albertsson (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Data democratised
Data democratisedData democratised
Data democratised
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 
Privacy by design
Privacy by designPrivacy by design
Privacy by design
 

Último

Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 

Último (20)

Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 

Test strategies for data processing pipelines

  • 1. www.mapflat.com Test strategies for data processing pipelines Lars Albertsson, independent consultant (Mapflat) Øyvind Løkling, Schibsted Products & Technology 1
  • 2. www.mapflat.com Who’s talking? Swedish Institute of Computer. Science. (test & debug tools) Sun Microsystems (large machine verification) Google (Hangouts, productivity) Recorded Future (NLP startup) (data integrations) Cinnober Financial Tech. (trading systems) Spotify (data processing & modelling, productivity) Schibsted Products & Tech (data processing & modelling) Mapflat (independent data engineering consultant) 2
  • 3. www.mapflat.com Agenda ● Data applications from a test perspective ● Testing stream processing product ● Testing batch processing products ● Data quality testing Main focus is functional, regression testing Prerequisites: Backend dev testing, basic data experience 3
  • 4. www.mapflat.com Test value For data-centric applications, in this order: ● Productivity ○ Move fast without breaking things ● Fast experimentation ○ 10% good ideas, 90% bad ● Data quality ○ Challenging, more important than ● Technical quality ○ Technical failure => ops hassle, stale data 4
  • 5. www.mapflat.com Streaming data product anatomy 5 Pub / sub Unified log Ingress Stream processing Egress DB Service TopicJob Pipeline Service Export Business intelligence DB
  • 6. www.mapflat.com Test harness Test fixture Test concepts 6 System under test (SUT) 3rd party component (e.g. DB) 3rd party component 3rd party component Test input Test oracle Test framework (e.g. JUnit, Scalatest) Seam IDEs Build tools
  • 7. www.mapflat.com Test scopes 7 Unit Class Component Integration System / acceptance ● Pick stable seam ● Small scope ○ Fast? ○ Easy, simple? ● Large scope ○ Real app value? ○ Slow, unstable? ● Maintenance, cost ○ Pick few SUTs
  • 8. www.mapflat.com ● Output = function(input, code) ○ No external factors => deterministic ● Pipeline and job endpoints are stable ○ Correspond to business value ● Internal abstractions are volatile ○ Reslicing in different dimensions is common Data-centric application properties 8 q
  • 9. www.mapflat.com ● Output = function(input, code) ○ Perfect for test! ○ Avoid: external service calls, wall clock ● Pipeline/job edges are suitable seams ○ Focus on large tests ● Internal seams => high maintenance, low value ○ Omit unit tests, mocks, dependency injection! ● Long pipelines crosses teams ○ Need for end-to-end tests, but culture challenge Data-centric app test properties 9 q
  • 10. www.mapflat.com Suitable seams, streaming 10 Pub / sub DB Service TopicJob Pipeline Service Export Business intelligence DB
  • 11. www.mapflat.com 2, 7. Scalatest. 4. Spark Streaming jobs IDE, CI, debug integration Streaming SUT, example harness 11 DB Topic Kafka 5. Test input 6. Test oracle 3. Docker 1. IDE / Gradle Polling
  • 12. www.mapflat.com Test lifecycle 12 1. Start fixture containers 2. Await fixture ready 3. Allocate test case resources 4. Start jobs 5. Push input data to Kafka 6. While (!done && !timeout) { pollDatabase(); sleep(1ms) } 7. While (moreTests) { Goto 3 } 8. Tear down fixture For absence test, send dummy sync messages at end.
  • 13. www.mapflat.com Input generation 13 ● Input & output is denormalised & wide ● Fields are frequently changed ○ Additions are compatible ○ Modifications are incompatible => new, similar data type ● Static test input, e.g. JSON files ○ Unmaintainable ● Input generation routines ○ Robust to changes, reusable
  • 14. www.mapflat.com Test oracles 14 ● Compare with expected output ● Check fields relevant for test ○ Robust to field changes ○ Reusable for new, similar types ● Tip: Use lenses ○ JSON: JsonPath (Java), Play JSON (Scala) ○ Case classes: Monocle ● Express invariants for each data type
  • 15. www.mapflat.com Batch data product anatomy 15 Cluster storage Unified log Ingress ETL Egress Data lake DB Service DatasetJob Pipeline Service Export Business intelligence DB DB Import
  • 16. Datasets ● Pipeline equivalent of objects ● Dataset class == homogeneous records, open-ended ○ Compatible schema ○ E.g. MobileAdImpressions ● Dataset instance = dataset class + parameters ○ Immutable ○ Finite set of homogeneous records ○ E.g. MobileAdImpressions(hour=”2016-02-06T13”) 16
  • 17. www.mapflat.com Directory datasets 17 hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS part-00000.json part-00001.json ● Some tools, e.g. Spark, understand Hive name conventions Dataset class Instance parameters, Hive convention Seal PartitionsPrivacy level Schema version
  • 18. Batch processing ● outDatasets = code(inDatasets) ● Component that scale up ○ Spark, (Flink, Scalding, Crunch) ● And scale down ○ Local mode ○ Most jobs fit in one machine 18 Gradual refinement 1. Wash - time shuffle, dedup, ... 2. Decorate - geo, demographic, ... 3. Domain model - similarity, clusters, ... 4. Application model - Recommendations, ...
  • 19. www.mapflat.com Workflow manager ● Dataset “build tool” ● Run job instance when ○ input is available ○ output missing ● Backfills for previous failures ● DSL describes dependencies ● Includes ingress & egress Suggested: Luigi / Airflow 19 DB `
  • 20. www.mapflat.com ClientSessions A/B tests DSL DAG example (Luigi) 20 class ClientActions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [Actions(hour=self.hour - timedelta(hours=h)) for h in range(0, 24)] + [UserDB(date=self.hour.date)] ... class ClientSessions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientActions(hour=self.hour - timedelta(hours=h)) for h in range(0, 3)] ... class SessionsABResults(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientSessions(hour=self.hour), ABExperiments(hour=self.hour)] def output(self): return HdfsTarget(“hdfs://production/red/ab_sessions/v1/” + “{:year=%Y/month=%m/day=%d/hour=%H}”.format(self.hour)) ... Actions, hourly ● Expressive, embedded DSL - a must for ingress, egress ○ Avoid weak DSL tools: Oozie, AWS Data Pipeline UserDB Time shuffle, user decorate Form sessions A/B compare ClientActions, daily A/B session evaluationDataset instance Job (aka Task) classes
  • 21. www.mapflat.com Batch processing testing ● Omit collection in SUT ○ Technically different ● Avoid clusters ○ Slow tests ● Seams ○ Between jobs ○ End of pipelines 21 Data lake Artifact of business value E.g. service index Job Pipeline
  • 22. www.mapflat.com Batch test frameworks - don’t ● Spark, Scalding, Crunch variants ● Seam == internal data structure ○ Omits I/O - common bug source ● Vendor lock-in When switching batch framework: ○ Need tests for protection ○ Test rewrite is unnecessary burden 22 Test input Test output Job
  • 23. www.mapflat.com Testing single job 23 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run in local mode 3. Verify output f() p() Don’t commit - expensive to maintain. Generate / verify with code. Runs well in CI / from IDE Job
  • 24. www.mapflat.com Testing pipelines - two options 24 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run custom multi-job 3. Verify output f() p() A: Customised workflow manager setup + Runs in CI + Runs in IDE + Quick setup - Multi-job maintenance p() + Tests workflow logic + More authentic - Workflow mgr setup for testability - Difficult to debug - Dataset handling with Python f() B: ● Both can be extended with egress DBs Test job with sequence of jobs
  • 25. www.mapflat.com Testing with cloud services ● PaaS components do not work locally ○ Cloud providers should provide fake implementations ○ Exceptions: Kubernetes, Cloud SQL, (S3) ● Integrate PaaS service as fixture component ○ Distribute access tokens, etc ○ Pay $ or $$$ 25
  • 26. www.mapflat.com Quality testing variants ● Functional regression ○ Binary, key to productivity ● Golden set ○ Extreme inputs => obvious output ○ No regressions tolerated ● (Saved) production data input ○ Individual regressions ok ○ Weighted sum must not decline ○ Beware of privacy 26
  • 27. www.mapflat.com Hadoop / Spark counters ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ● Dedicated quality assessment pipelines ○ Reuse test oracle invariants in production Obtaining quality metrics 27 DB Quality assessment job
  • 28. www.mapflat.com Quality testing in the process ● Binary self-contained ○ Validate in CI ● Relative vs history ○ E.g. large drops ○ Precondition for publishing dataset ● Push aggregates to DB ○ Standard ops: monitor, alert 28 DB ∆? Code ∆!
  • 29. www.mapflat.com RAM Input Computer program anatomy 29 Input data Process Output File HID VariableFunction Execution path Lookup structure Output data Window File File Device
  • 30. www.mapflat.com Data pipeline = yet another program Don’t veer from best practices ● Regression testing ● Design: Separation of concerns, modularity, etc ● Process: CI/CD, code review, static analysis tools ● Avoid anti-patterns: Global state, hard-coding location, duplication, ... In data engineering, slipping is in the culture... :-( ● Mix in solid backend engineers, document “golden path” 30
  • 31. www.mapflat.com Top anti-patterns 1. Test as afterthought or in production Data processing applications are suited for test! 2. Static test input in version control 3. Exact expected output test oracle 4. Unit testing volatile interfaces 5. Using mocks & dependency injection 6. Tool-specific test framework 7. Calling services / databases from (batch) jobs 8. Using wall clock time 9. Embedded fixture components 31
  • 34. www.mapflat.com Unified log ● Replicated append-only log ● Pub / sub with history ● Decoupled producers and consumers ○ In source/deployment ○ In space ○ In time ● Recovers from link failures ● Replay on transformation bug fix 34 Credits: Confluent
  • 35. www.mapflat.com Stream pipelines ● Parallelised jobs ● Read / write to Kafka ● View egress ○ Serving index ○ SQL / cubes for Analytics ● Stream egress ○ Services subscribe to topic ○ E.g. REST post for export 35 Credits: Apache Samza
  • 36. www.mapflat.com The data lake Unified log + snapshots ● Immutable datasets ● Raw, unprocessed ● Source of truth from batch processing perspective ● Kept as long as permitted ● Technically homogeneous ○ Except for raw imports 36 Cluster storage Data lake
  • 37. www.mapflat.com Batch pipelines ● Things will break ○ Input will be missing ○ Jobs will fail ○ Jobs will have bugs ● Datasets must be rebuilt ● Determinism, idempotency ● Backfill missing / failed ● Eventual correctness 37 Cluster storage Data lake Pristine, immutable datasets Intermediate Derived, regenerable
  • 38. www.mapflat.com Job == function([input datasets]): [output datasets] ● No orthogonal concerns ○ Invocation ○ Scheduling ○ Input / output location ● Testable ● No other input factors, no side-effects ● Ideally: atomic, deterministic, idempotent ● Necessary for audit Batch job 38 q
  • 39. www.mapflat.com Form teams that are driven by business cases & need Forward-oriented -> filters implicitly applied Beware of: duplication, tech chaos/autonomy, privacy loss Data pipelines
  • 40. www.mapflat.com Data platform, pipeline chains Common data infrastructure Productivity, privacy, end-to-end agility, complexity Beware: producer-consumer disconnect
  • 41. www.mapflat.com Ingress / egress representation Larger variation: ● Single file ● Relational database table ● Cassandra column family, other NoSQL ● BI tool storage ● BigQuery, Redshift, ... Egress datasets are also atomic and immutable. E.g. write full DB table / CF, switch service to use it, never change it. 41
  • 42. www.mapflat.com Egress datasets ● Serving ○ Precomputed user query answers ○ Denormalised ○ Cassandra, (many) ● Export & Analytics ○ SQL (single node / Hive, Presto, ..) ○ Workbenches (Zeppelin) ○ (Elasticsearch, proprietary OLAP) ● BI / analytics tool needs change frequently ○ Prepare to redirect pipelines 42
  • 43. Deployment 43 Hg/git repo Luigi DSL, jars, config my-pipe-7.tar.gz HDFS Luigi daemon > pip install my-pipe-7.tar.gz Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency + backfill (Luigi range tools) * 10 * * * bin/my_pipe_daily --backfill 14 All that a pipeline needs, installed atomically