SlideShare una empresa de Scribd logo
1 de 53
Descargar para leer sin conexión
www.mapflat.com
Top 10 data engineering
mistakes
Berlin Buzzwords, 2018-06-11
Lars Albertsson
www.mapflat.com, www.mimeria.com
1
www.mapflat.com
Why this talk?
● Sharing what I have learnt. Please do the same.
● Mistakes mentioned are directly or indirectly experienced.
● "Do this" / "Don't do that" == "Based on what I have seen, I have come to the conclusion that ..."
2
Learning
practices
Mastering
practices
Bending
practices
www.mapflat.com
Who’s talking?
● KTH-PDC Center for High Performance Computing (MSc thesis)
● Swedish Institute of Computer Science (distributed system test+debug tools)
● Sun Microsystems (building very large machines)
● Google (Hangouts, productivity)
● Recorded Future (natural language processing startup)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling)
● Schibsted Media Group (data processing & modelling)
● Mapflat - independent data engineering consultant
○ ~15 clients: Spotify, 3 banks, 3 conglomerates, 4 startups, 5 *tech, misc
● Founder @ Mimeria - data-value-as-a-service
3
www.mapflat.com
● What is the goal?
Mistake == obstacle towards end goal
4
www.mapflat.com
Blueprint
5
Data
lake
Cluster storage
Ingress Egress
DB
Service
DatasetJob
Pipeline
Service
Export
Business
intelligence
DB
DB
Import
DB
Cold store
www.mapflat.com
Cron schedule: 10 * * * *
A first pipeline
6
#! /bin/bash
y=$(date +%Y)
m=$(date +%m)
d=$(date +%d)
h=$(date +%h)
page_view="/cold/red/page/v1/$y/$m/$d/$h"
order="/prod/red/order/v1/$y/$m/$d"
order_hour="$order/$h"
session="/prod/red/session/v1/$y/$m/$d"
kpi="/prod/green/kpi/v1/$y/$m/$d"
ss="spark-submit --master spark://spark.acme.ai:7077"
$ss --class ai.acme.FilterOrders kpi.jar $page_view $order_hour
if [ $h == 23 ]; then
$ss --class ai.acme.FormSession kpi.jar $order $session
$ss --class ai.acme.Kpi kpi.jar $session $kpi
fi
● Input: Page views from www (hourly)
● Output: key performance indexes (daily)
○ Number of orders
○ Web session length
www.mapflat.com
Cron schedule: 10 * * * *
Cron schedule: 15 1 * * *
● Works great day 1,2,3,4
A first pipeline
7
#! /bin/bash
y=$(date +%Y)
m=$(date +%m)
d=$(date +%d)
h=$(date +%h)
page_view="/cold/red/page/v1/$y/$m/$d/$h"
order="/prod/red/order/v1/$y/$m/$d"
order_hour="$order/$h"
session="/prod/red/session/v1/$y/$m/$d"
kpi="/prod/green/kpi/v1/$y/$m/$d"
ss="spark-submit --master spark://spark.acme.ai:7077"
$ss --class ai.acme.FilterOrders kpi.jar $page_view $order_hour
if [ $h == 23 ]; then
$ss --class ai.acme.FormSession kpi.jar $order $session
$ss --class ai.acme.Kpi kpi.jar $session $kpi
fi
#! /bin/bash
y=$(date +%Y)
m=$(date +%m)
d=$(date +%d)
order="/prod/red/order/v1/$y/$m/$d"
campaign="/cold/green/campaign/v1/$y/$m/$d"
impact="/cold/green/impact/v1/$y/$m/$d"
session="/prod/red/session/v1/$y/$m/$d"
ss="spark-submit --master spark://spark.acme.ai:7077"
$ss --class ai.acme.CampaignImpact kpi.jar $order $campaign 
$session $impact
www.mapflat.com
Cron schedule: 10 * * * *
Cron schedule: 15 1 * * *
● Works great day 1,2,3,4
● Day 5: FormSession not done by 1:15
Job fails
A brittle pipeline
8
#! /bin/bash
y=$(date +%Y)
m=$(date +%m)
d=$(date +%d)
h=$(date +%h)
page_view="/cold/red/page/v1/$y/$m/$d/$h"
order="/prod/red/order/v1/$y/$m/$d"
order_hour="$order/$h"
session="/prod/red/session/v1/$y/$m/$d"
kpi="/prod/green/kpi/v1/$y/$m/$d"
ss="spark-submit --master spark://spark.acme.ai:7077"
$ss --class ai.acme.FilterOrders kpi.jar $page_view $order_hour
if [ $h == 23 ]; then
$ss --class ai.acme.FormSession kpi.jar $order $session
$ss --class ai.acme.Kpi kpi.jar $session $kpi
fi
#! /bin/bash
y=$(date +%Y)
m=$(date +%m)
d=$(date +%d)
order="/prod/red/order/v1/$y/$m/$d"
campaign="/cold/green/campaign/v1/$y/$m/$d"
impact="/cold/green/impact/v1/$y/$m/$d"
session="/prod/red/session/v1/$y/$m/$d"
ss="spark-submit --master spark://spark.acme.ai:7077"
$ss --class ai.acme.CampaignImpact kpi.jar $order $campaign 
$session $impact
www.mapflat.com
Cron schedule: 10 * * * *
Cron schedule: 15 1 * * *
● Works great day 1,2,3,4
● Day 5: FormSession not done by 1:15 Silent data corruption
● Day 8: FilterOrders fails for hour 6
● Day 13: Page views for hour 19 only partially done at 20:10
A corrupt pipeline
9
#! /bin/bash
y=$(date +%Y)
m=$(date +%m)
d=$(date +%d)
h=$(date +%h)
page_view="/cold/red/page/v1/$y/$m/$d/$h"
order="/prod/red/order/v1/$y/$m/$d"
order_hour="$order/$h"
session="/prod/red/session/v1/$y/$m/$d"
kpi="/prod/green/kpi/v1/$y/$m/$d"
ss="spark-submit --master spark://spark.acme.ai:7077"
$ss --class ai.acme.FilterOrders kpi.jar $page_view $order_hour
if [ $h == 23 ]; then
$ss --class ai.acme.FormSession kpi.jar $order $session
$ss --class ai.acme.Kpi kpi.jar $session $kpi
fi
#! /bin/bash
y=$(date +%Y)
m=$(date +%m)
d=$(date +%d)
order="/prod/red/order/v1/$y/$m/$d"
campaign="/cold/green/campaign/v1/$y/$m/$d"
impact="/cold/green/impact/v1/$y/$m/$d"
session="/prod/red/session/v1/$y/$m/$d"
ss="spark-submit --master spark://spark.acme.ai:7077"
$ss --class ai.acme.CampaignImpact kpi.jar $order $campaign 
$session $impact
www.mapflat.com
● A workflow orchestrator manages the dependencies
● Weak orchestrators use graphical / XML dependency language
○ Not expressive enough
● Hadoop vendors + cloud vendors ship weak orchestrators
○ Except Google
Mistake 1: No/weak workflow orchestration
10
www.mapflat.com
Remedy 1: Python-DSL orchestrators
Luigi
● Simple, stable
● Easy to run
● Slim DataOps
● Does one thing well
11
Airflow
● Good momentum
○ Google managed service
● More moving parts
● Complex DataOps
● Includes scheduler & monitoring UI
www.mapflat.com
Hadoop has a SQL interface, right?
● Let's pretend it is an RDBMS
12
Data
lake
Hadoop
Service
Service
Transactions
www.mapflat.com
Hadoop for everything
● Or a NoSQL database
13
Data
lake
Hadoop
Service
Service
Service
Queries
www.mapflat.com
Mistake 2: New technology, same old patterns
● Distributed technology is single-purpose
○ RDBMS == multi-purpose
● Hadoop + ecosystem = offline
Big data components + small data patterns → worst of both worlds
14
www.mapflat.com
● New technology is not the key to success
● Benefits of big data == collaboration and pipelines
Data democratisation
Remedy 2: Work patterns >> technology
15
Stream storage
Data lake
Offline, forgiving environment
→
iteration speed
www.mapflat.com
● Wait for all service nodes to send data
○ Then start processing
● For how long?
○ Dashboards?
○ Reports?
We need all the data
16
ServiceServiceServiceServiceServiceService
www.mapflat.com
● Start right away
● Hope we have most data
● If we get x% more, rerun
We need it now
17
ServiceServiceServiceServiceServiceService
www.mapflat.com
● We have an IP→ geo service
● Decorate incoming events!
● Works great until
○ service fails
○ has low quality geo data
○ has a bug
○ we overload it
Let's improve the events
18
Service
www.mapflat.com
● DB has data. We wants it.
● Just read it from production?
● Works until
○ You overload the database
○ You need to rerun the job later
Lazy data retrieval
19
DB
www.mapflat.com
Mistake 3: Deviate from functional principles
● Immutability
● Idempotency
○ Jobs must be able to rerun without harmful effects
● Reproducibility
○ Necessary for data science
○ Many exceptions - be conscious
20
www.mapflat.com
Knock on the back door
● Let's dump the DB to lake
● Spark/Sqoop can speak JDBC
● The more mappers, the merrier?
21
Data lake
DB
Service
Our preciousss
www.mapflat.com
Knock on the front door
● We can dump from the REST interface
○ Unavailability by elephant
22
Data lake
DB
Service
Our preciousss
www.mapflat.com
Here is a new ML model for you
● New ML algorithm is much better
● Let's broadcast new
recommendation indexes
● Why is the Atlantic link saturated?
2323
DC1 DC2
DC3
Credits: Tuplejump
www.mapflat.com
Mistake 4: Weak online / offline separation
24
10000s of customers, imprecise feedback
Need low probability =>
Proactive prevention =>
Low ROI
Risk = probability * impact
www.mapflat.com
Mistake 4: Weak online / offline separation
25
10000s of customers, imprecise feedback
Need low probability =>
Proactive prevention =>
Low ROI
10s of employees, precise feedback
Ok with medium probability =>
Reactive repair =>
High ROI
Risk = probability * impact
www.mapflat.com
Remedy 4: Careful handover
26
Fraud
serviceFraud
model
Orders Orders
Replication /
Backup
Standard procedures Standard proceduresLightweight procedures
Careful handover Careful handover
www.mapflat.com
Bring on the clusters
● "Our clients have petabytes, and they want to do machine learning."
Ali Ghodsi, Databricks CEO
100s TBs/day
KBs/day
● Why not use scalable components?
○ In case we get 100M users?
27
www.mapflat.com
Distributed processing is difficult
28
Data skew
More
parameters
to tune
Logs are
distributed
Problem not
parallelisable
More
sysadmin
Errors not
reproducible
Serialisation
problems
Noisy
neighbours
Excessive
shuffling
Limited
speedup
Mysteriously
slow nodes
Immature
frameworks
Difficult to
reuse old code
www.mapflat.com
Mistake 5: Premature scaling
● Most companies have GBs, not PBs
○ Limited by human users
● Largest cloud instances ~4 TB memory.
○ Your data fits
○ Or: One time period of your data fits
Primary value of "Big Data" is not data size, but data sharing and data agility.
Avoid clusters until necessary - use single nodes. E.g. "spark-submit --master=local"
29
www.mapflat.com
Slower by the day
● What if we want top list of last 365 days, every day?
● Why is this team always using 20% of the cluster, every day?
30
www.mapflat.com
● Read it all:
● Incremental counting:
Strive to parallelise workflows, not jobs
Mistake 6: Read all the data
31
OrderCountryCount
CountryTopList
Order User
CountryTopList
Order User
www.mapflat.com
Context, streaming
● Data lake has a real-time cousin -
the unified log
● Stream storage bus
○ Kafka / cloud service
● Pipelines of stream jobs
32
Job
Ads Search Feed
App App App
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job
Data lake
Business
intelligence
Job
www.mapflat.com
We need real-time everything
● Result now >> result in two hours?
● Today's numbers >> yesterday's?
● Fraud analysis must be quick?
● The demos look great!
"Because life doesn't happen in batch."
- Audi @ Kafka Summit
33
www.mapflat.com
Streaming operations
● Streaming ops is difficult
○ Schema change needed here
34
Job
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job Job
www.mapflat.com
Streaming operations
● Streaming ops is difficult
○ Three days of invalid output here
35
Job
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job Job
www.mapflat.com
Invalid output, batch operations
36
1. Revert serving datasets to old
2. Fix bug
3. Remove faulty datasets
4. Done!
Backfill is automatic (Luigi)
www.mapflat.com
Streaming operations
● Works for a single job, not pipeline. :-(
37
Job
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job Job
Reprocessing in Kafka Streams
www.mapflat.com
Mistake 7: Premature streaming
● Stream processing reasonable if one of:
○ Strong use case - worth the cost
○ Single job pipeline
○ Low data quality ok
● Stream tooling is getting better
○ Thanks Confluent, Google!
"Because humans operate on a batch time scale."
- Me @ Berlin Buzzwords.
38
www.mapflat.com
We see a discrepancy
● These numbers don't add up
○ Have you seen anything suspicious in your pipeline?
○ How many records are invalid?
○ Do these datasets agree?
○ What is your data quality, BTW?
● "Who put 'content-type: bullshit' in these messages?"
"We are the sewer of the company. All garbage dumped upstream ends up here."
39
www.mapflat.com
Data quality dimensions
● Timeliness
○ E.g. the customer engagement report was produced at the expected time
● Correctness
○ The numbers in the reports were calculated correctly
● Completeness
○ The report includes information on all customers, using all information from the whole time period
● Consistency
○ The customer summaries are all based on the same time period
40
www.mapflat.com
Mistake 8: Crunching in the dark
● Data products are organic
● Measure odd and unexpected values
○ Spark, Flink, Beam: counters
○ Difficult with SQL processing
● Measure consistency
○ Within records
○ Between records
○ Between datasets
● Be proactive, display in office!
41
www.mapflat.com
Something broke our pipeline
● "Invalid UTF-8 encoding"
○ It worked yesterday!
● "Field not found"
○ Who changed the schema?
42
www.mapflat.com
Pipeline fossil
● "Invalid UTF-8 encoding"
○ It worked yesterday!
● "Field not found"
○ Who changed the schema?
43
● "Yes, we should change that field"
○ But it is too costly
● "We built a new pipeline in a few weeks"
○ And removed the old after 18 months
www.mapflat.com
Mistake 9: Weak end-to-end agility
Data agility measure: Time from client logging field → pipeline → use in feature / BI
● Siloed company: ~6 months
● Agile company: 1 month
● Coordinated company: days
Remedy: End-to-end testing. Rare, often in conflict with company culture.
44
www.mapflat.com
Users everywhere
Functional architecture:
● Event-oriented - append only
● Immutability
● At-least-once semantics
● Reproducibility
○ Through 1000s of copies
● Redundancy
45
GDPR came along
- Please delete me?
- What data have you got on me?
- Please correct this data
- Hold on a second...
www.mapflat.com
Negative value heterogeneity
● We use Avro everywhere, why is this one in Parquet?
○ Beam did not support Parquet. :-(
● Why do we have 25 time formats?
○ ISO 8601, UTC assumed
○ ISO 8601 + timezone
○ Millis since epoch, UTC
○ Nanos since epoch, UTC
○ Millis since epoch, user local time
○ ...
○ ...
○ Float of seconds since epoch, as string. WTF?!?
46
www.mapflat.com
Mistake 10: Late governance
Data governance for techies:
Proactive things you do in order to avoid later WTF.
47
GDPR compliance
● Erasure
● Anonymisation & pseudonymisation
● Enforced retention
● Access limitations
Security
● Blast radius
● Disaster
recovery
Development speed
● Technical processes
● Domain definitions
● Schema evolution
● End-to-end testing
Data democracy
● Data lake rules
● Data quality
communication
Financial compliance
● Reproducibility
● Data completeness
● Data consistency
www.mapflat.com
Technology over principles
Complexity over business value
Premature
scaling
Premature
streaming
Late
governance
No workflow
orchestration
End-to-end
agility
Crunching in
the dark
Read all
the dataNew tech,
old patterns
Functional
principles
Offline / online
Simplicity
Driven by
value
Scope down
Hype xor
profit
Common themes
48
www.mapflat.com
Related work
● https://tinyurl.com/mapflat-10-stumble
● Avoiding big data anti-patterns - https://youtu.be/t63SF2UkD0A
● https://www.slideshare.net/JesseAnderson/the-five-dysfunctions-of-a-data-engineering-team
● https://tinyurl.com/mapflat-reading-list
● http://www.mapflat.com/presentations/
49
www.mapflat.com
Bonus slides
50
www.mapflat.com
class FraudCandidate(SparkSubmitTask):
date = DateParameter()
def requires(self):
return [OrderDaily(date=self.date), self._credibility()]
def _credibility(self):
max_age_days = 3
for lookback in range(max_age_days):
candidate = CredibilityScore(date - timedelta(days=lookback))
if candidate.output().exists():
return candidate
return CredibilityScore(date - timedelta(days=max_age_days))
Dynamic DAGs
CredibilityScore
FraudCandidate
Order
51
www.mapflat.com
Luigi or Airflow?
Luigi & Kubernetes:
● 1 hour to production
52
Airflow & Kubernetes:
● AIRFLOW-1314, created 2017-06-16
○ 9/12 sub-tasks resolved
○ 6 PRs
○ 28 commits
○ ~30 files touched
○ ~2000 lines of code
● Airflow monitoring has value, however
myjob.yaml:
kind: CronJob
spec:
schedule: "17 * * * *"
jobTemplate:
spec:
containers:
- name: "MyJob"
image: myregistry/mypipeline:latest
command: ["luigi"]
args: ["--module", "mypipeline",
"RangeDaily", "--of",
"MyJob",
"--days-back", "14"]
www.mapflat.com
Counters
● User-defined
● Technical from framework
○ Execution time
○ Memory consumption
○ Data volumes
○ ...
53
case class Order(item: ItemId, userId: UserId)
case class User(id: UserId, country: String)
val orders = read(orderPath)
val users = read(userPath)
val orderNoUserCounter = longAccumulator("order-no-user")
val joined: C[(Order, Option[User])] = orders
.groupBy(_.userId)
.leftJoin(users.groupBy(_.id))
.values
val orderWithUser: C[(Order, User)] = joined
.flatMap( orderUser match
case (order, Some(user)) => Some((order, user))
case (order, None) => {
orderNoUserCounter.add(1)
None
})
SQL: Nope

Más contenido relacionado

Más de Lars Albertsson

Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityLars Albertsson
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processingLars Albertsson
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisisLars Albertsson
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipelineLars Albertsson
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platformLars Albertsson
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science teamLars Albertsson
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big dataLars Albertsson
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practiceLars Albertsson
 

Más de Lars Albertsson (20)

Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Data democratised
Data democratisedData democratised
Data democratised
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 
Privacy by design
Privacy by designPrivacy by design
Privacy by design
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practice
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Top 10 data engineering mistakes

  • 1. www.mapflat.com Top 10 data engineering mistakes Berlin Buzzwords, 2018-06-11 Lars Albertsson www.mapflat.com, www.mimeria.com 1
  • 2. www.mapflat.com Why this talk? ● Sharing what I have learnt. Please do the same. ● Mistakes mentioned are directly or indirectly experienced. ● "Do this" / "Don't do that" == "Based on what I have seen, I have come to the conclusion that ..." 2 Learning practices Mastering practices Bending practices
  • 3. www.mapflat.com Who’s talking? ● KTH-PDC Center for High Performance Computing (MSc thesis) ● Swedish Institute of Computer Science (distributed system test+debug tools) ● Sun Microsystems (building very large machines) ● Google (Hangouts, productivity) ● Recorded Future (natural language processing startup) ● Cinnober Financial Tech. (trading systems) ● Spotify (data processing & modelling) ● Schibsted Media Group (data processing & modelling) ● Mapflat - independent data engineering consultant ○ ~15 clients: Spotify, 3 banks, 3 conglomerates, 4 startups, 5 *tech, misc ● Founder @ Mimeria - data-value-as-a-service 3
  • 4. www.mapflat.com ● What is the goal? Mistake == obstacle towards end goal 4
  • 6. www.mapflat.com Cron schedule: 10 * * * * A first pipeline 6 #! /bin/bash y=$(date +%Y) m=$(date +%m) d=$(date +%d) h=$(date +%h) page_view="/cold/red/page/v1/$y/$m/$d/$h" order="/prod/red/order/v1/$y/$m/$d" order_hour="$order/$h" session="/prod/red/session/v1/$y/$m/$d" kpi="/prod/green/kpi/v1/$y/$m/$d" ss="spark-submit --master spark://spark.acme.ai:7077" $ss --class ai.acme.FilterOrders kpi.jar $page_view $order_hour if [ $h == 23 ]; then $ss --class ai.acme.FormSession kpi.jar $order $session $ss --class ai.acme.Kpi kpi.jar $session $kpi fi ● Input: Page views from www (hourly) ● Output: key performance indexes (daily) ○ Number of orders ○ Web session length
  • 7. www.mapflat.com Cron schedule: 10 * * * * Cron schedule: 15 1 * * * ● Works great day 1,2,3,4 A first pipeline 7 #! /bin/bash y=$(date +%Y) m=$(date +%m) d=$(date +%d) h=$(date +%h) page_view="/cold/red/page/v1/$y/$m/$d/$h" order="/prod/red/order/v1/$y/$m/$d" order_hour="$order/$h" session="/prod/red/session/v1/$y/$m/$d" kpi="/prod/green/kpi/v1/$y/$m/$d" ss="spark-submit --master spark://spark.acme.ai:7077" $ss --class ai.acme.FilterOrders kpi.jar $page_view $order_hour if [ $h == 23 ]; then $ss --class ai.acme.FormSession kpi.jar $order $session $ss --class ai.acme.Kpi kpi.jar $session $kpi fi #! /bin/bash y=$(date +%Y) m=$(date +%m) d=$(date +%d) order="/prod/red/order/v1/$y/$m/$d" campaign="/cold/green/campaign/v1/$y/$m/$d" impact="/cold/green/impact/v1/$y/$m/$d" session="/prod/red/session/v1/$y/$m/$d" ss="spark-submit --master spark://spark.acme.ai:7077" $ss --class ai.acme.CampaignImpact kpi.jar $order $campaign $session $impact
  • 8. www.mapflat.com Cron schedule: 10 * * * * Cron schedule: 15 1 * * * ● Works great day 1,2,3,4 ● Day 5: FormSession not done by 1:15 Job fails A brittle pipeline 8 #! /bin/bash y=$(date +%Y) m=$(date +%m) d=$(date +%d) h=$(date +%h) page_view="/cold/red/page/v1/$y/$m/$d/$h" order="/prod/red/order/v1/$y/$m/$d" order_hour="$order/$h" session="/prod/red/session/v1/$y/$m/$d" kpi="/prod/green/kpi/v1/$y/$m/$d" ss="spark-submit --master spark://spark.acme.ai:7077" $ss --class ai.acme.FilterOrders kpi.jar $page_view $order_hour if [ $h == 23 ]; then $ss --class ai.acme.FormSession kpi.jar $order $session $ss --class ai.acme.Kpi kpi.jar $session $kpi fi #! /bin/bash y=$(date +%Y) m=$(date +%m) d=$(date +%d) order="/prod/red/order/v1/$y/$m/$d" campaign="/cold/green/campaign/v1/$y/$m/$d" impact="/cold/green/impact/v1/$y/$m/$d" session="/prod/red/session/v1/$y/$m/$d" ss="spark-submit --master spark://spark.acme.ai:7077" $ss --class ai.acme.CampaignImpact kpi.jar $order $campaign $session $impact
  • 9. www.mapflat.com Cron schedule: 10 * * * * Cron schedule: 15 1 * * * ● Works great day 1,2,3,4 ● Day 5: FormSession not done by 1:15 Silent data corruption ● Day 8: FilterOrders fails for hour 6 ● Day 13: Page views for hour 19 only partially done at 20:10 A corrupt pipeline 9 #! /bin/bash y=$(date +%Y) m=$(date +%m) d=$(date +%d) h=$(date +%h) page_view="/cold/red/page/v1/$y/$m/$d/$h" order="/prod/red/order/v1/$y/$m/$d" order_hour="$order/$h" session="/prod/red/session/v1/$y/$m/$d" kpi="/prod/green/kpi/v1/$y/$m/$d" ss="spark-submit --master spark://spark.acme.ai:7077" $ss --class ai.acme.FilterOrders kpi.jar $page_view $order_hour if [ $h == 23 ]; then $ss --class ai.acme.FormSession kpi.jar $order $session $ss --class ai.acme.Kpi kpi.jar $session $kpi fi #! /bin/bash y=$(date +%Y) m=$(date +%m) d=$(date +%d) order="/prod/red/order/v1/$y/$m/$d" campaign="/cold/green/campaign/v1/$y/$m/$d" impact="/cold/green/impact/v1/$y/$m/$d" session="/prod/red/session/v1/$y/$m/$d" ss="spark-submit --master spark://spark.acme.ai:7077" $ss --class ai.acme.CampaignImpact kpi.jar $order $campaign $session $impact
  • 10. www.mapflat.com ● A workflow orchestrator manages the dependencies ● Weak orchestrators use graphical / XML dependency language ○ Not expressive enough ● Hadoop vendors + cloud vendors ship weak orchestrators ○ Except Google Mistake 1: No/weak workflow orchestration 10
  • 11. www.mapflat.com Remedy 1: Python-DSL orchestrators Luigi ● Simple, stable ● Easy to run ● Slim DataOps ● Does one thing well 11 Airflow ● Good momentum ○ Google managed service ● More moving parts ● Complex DataOps ● Includes scheduler & monitoring UI
  • 12. www.mapflat.com Hadoop has a SQL interface, right? ● Let's pretend it is an RDBMS 12 Data lake Hadoop Service Service Transactions
  • 13. www.mapflat.com Hadoop for everything ● Or a NoSQL database 13 Data lake Hadoop Service Service Service Queries
  • 14. www.mapflat.com Mistake 2: New technology, same old patterns ● Distributed technology is single-purpose ○ RDBMS == multi-purpose ● Hadoop + ecosystem = offline Big data components + small data patterns → worst of both worlds 14
  • 15. www.mapflat.com ● New technology is not the key to success ● Benefits of big data == collaboration and pipelines Data democratisation Remedy 2: Work patterns >> technology 15 Stream storage Data lake Offline, forgiving environment → iteration speed
  • 16. www.mapflat.com ● Wait for all service nodes to send data ○ Then start processing ● For how long? ○ Dashboards? ○ Reports? We need all the data 16 ServiceServiceServiceServiceServiceService
  • 17. www.mapflat.com ● Start right away ● Hope we have most data ● If we get x% more, rerun We need it now 17 ServiceServiceServiceServiceServiceService
  • 18. www.mapflat.com ● We have an IP→ geo service ● Decorate incoming events! ● Works great until ○ service fails ○ has low quality geo data ○ has a bug ○ we overload it Let's improve the events 18 Service
  • 19. www.mapflat.com ● DB has data. We wants it. ● Just read it from production? ● Works until ○ You overload the database ○ You need to rerun the job later Lazy data retrieval 19 DB
  • 20. www.mapflat.com Mistake 3: Deviate from functional principles ● Immutability ● Idempotency ○ Jobs must be able to rerun without harmful effects ● Reproducibility ○ Necessary for data science ○ Many exceptions - be conscious 20
  • 21. www.mapflat.com Knock on the back door ● Let's dump the DB to lake ● Spark/Sqoop can speak JDBC ● The more mappers, the merrier? 21 Data lake DB Service Our preciousss
  • 22. www.mapflat.com Knock on the front door ● We can dump from the REST interface ○ Unavailability by elephant 22 Data lake DB Service Our preciousss
  • 23. www.mapflat.com Here is a new ML model for you ● New ML algorithm is much better ● Let's broadcast new recommendation indexes ● Why is the Atlantic link saturated? 2323 DC1 DC2 DC3 Credits: Tuplejump
  • 24. www.mapflat.com Mistake 4: Weak online / offline separation 24 10000s of customers, imprecise feedback Need low probability => Proactive prevention => Low ROI Risk = probability * impact
  • 25. www.mapflat.com Mistake 4: Weak online / offline separation 25 10000s of customers, imprecise feedback Need low probability => Proactive prevention => Low ROI 10s of employees, precise feedback Ok with medium probability => Reactive repair => High ROI Risk = probability * impact
  • 26. www.mapflat.com Remedy 4: Careful handover 26 Fraud serviceFraud model Orders Orders Replication / Backup Standard procedures Standard proceduresLightweight procedures Careful handover Careful handover
  • 27. www.mapflat.com Bring on the clusters ● "Our clients have petabytes, and they want to do machine learning." Ali Ghodsi, Databricks CEO 100s TBs/day KBs/day ● Why not use scalable components? ○ In case we get 100M users? 27
  • 28. www.mapflat.com Distributed processing is difficult 28 Data skew More parameters to tune Logs are distributed Problem not parallelisable More sysadmin Errors not reproducible Serialisation problems Noisy neighbours Excessive shuffling Limited speedup Mysteriously slow nodes Immature frameworks Difficult to reuse old code
  • 29. www.mapflat.com Mistake 5: Premature scaling ● Most companies have GBs, not PBs ○ Limited by human users ● Largest cloud instances ~4 TB memory. ○ Your data fits ○ Or: One time period of your data fits Primary value of "Big Data" is not data size, but data sharing and data agility. Avoid clusters until necessary - use single nodes. E.g. "spark-submit --master=local" 29
  • 30. www.mapflat.com Slower by the day ● What if we want top list of last 365 days, every day? ● Why is this team always using 20% of the cluster, every day? 30
  • 31. www.mapflat.com ● Read it all: ● Incremental counting: Strive to parallelise workflows, not jobs Mistake 6: Read all the data 31 OrderCountryCount CountryTopList Order User CountryTopList Order User
  • 32. www.mapflat.com Context, streaming ● Data lake has a real-time cousin - the unified log ● Stream storage bus ○ Kafka / cloud service ● Pipelines of stream jobs 32 Job Ads Search Feed App App App StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Data lake Business intelligence Job
  • 33. www.mapflat.com We need real-time everything ● Result now >> result in two hours? ● Today's numbers >> yesterday's? ● Fraud analysis must be quick? ● The demos look great! "Because life doesn't happen in batch." - Audi @ Kafka Summit 33
  • 34. www.mapflat.com Streaming operations ● Streaming ops is difficult ○ Schema change needed here 34 Job StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Job
  • 35. www.mapflat.com Streaming operations ● Streaming ops is difficult ○ Three days of invalid output here 35 Job StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Job
  • 36. www.mapflat.com Invalid output, batch operations 36 1. Revert serving datasets to old 2. Fix bug 3. Remove faulty datasets 4. Done! Backfill is automatic (Luigi)
  • 37. www.mapflat.com Streaming operations ● Works for a single job, not pipeline. :-( 37 Job StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Job Reprocessing in Kafka Streams
  • 38. www.mapflat.com Mistake 7: Premature streaming ● Stream processing reasonable if one of: ○ Strong use case - worth the cost ○ Single job pipeline ○ Low data quality ok ● Stream tooling is getting better ○ Thanks Confluent, Google! "Because humans operate on a batch time scale." - Me @ Berlin Buzzwords. 38
  • 39. www.mapflat.com We see a discrepancy ● These numbers don't add up ○ Have you seen anything suspicious in your pipeline? ○ How many records are invalid? ○ Do these datasets agree? ○ What is your data quality, BTW? ● "Who put 'content-type: bullshit' in these messages?" "We are the sewer of the company. All garbage dumped upstream ends up here." 39
  • 40. www.mapflat.com Data quality dimensions ● Timeliness ○ E.g. the customer engagement report was produced at the expected time ● Correctness ○ The numbers in the reports were calculated correctly ● Completeness ○ The report includes information on all customers, using all information from the whole time period ● Consistency ○ The customer summaries are all based on the same time period 40
  • 41. www.mapflat.com Mistake 8: Crunching in the dark ● Data products are organic ● Measure odd and unexpected values ○ Spark, Flink, Beam: counters ○ Difficult with SQL processing ● Measure consistency ○ Within records ○ Between records ○ Between datasets ● Be proactive, display in office! 41
  • 42. www.mapflat.com Something broke our pipeline ● "Invalid UTF-8 encoding" ○ It worked yesterday! ● "Field not found" ○ Who changed the schema? 42
  • 43. www.mapflat.com Pipeline fossil ● "Invalid UTF-8 encoding" ○ It worked yesterday! ● "Field not found" ○ Who changed the schema? 43 ● "Yes, we should change that field" ○ But it is too costly ● "We built a new pipeline in a few weeks" ○ And removed the old after 18 months
  • 44. www.mapflat.com Mistake 9: Weak end-to-end agility Data agility measure: Time from client logging field → pipeline → use in feature / BI ● Siloed company: ~6 months ● Agile company: 1 month ● Coordinated company: days Remedy: End-to-end testing. Rare, often in conflict with company culture. 44
  • 45. www.mapflat.com Users everywhere Functional architecture: ● Event-oriented - append only ● Immutability ● At-least-once semantics ● Reproducibility ○ Through 1000s of copies ● Redundancy 45 GDPR came along - Please delete me? - What data have you got on me? - Please correct this data - Hold on a second...
  • 46. www.mapflat.com Negative value heterogeneity ● We use Avro everywhere, why is this one in Parquet? ○ Beam did not support Parquet. :-( ● Why do we have 25 time formats? ○ ISO 8601, UTC assumed ○ ISO 8601 + timezone ○ Millis since epoch, UTC ○ Nanos since epoch, UTC ○ Millis since epoch, user local time ○ ... ○ ... ○ Float of seconds since epoch, as string. WTF?!? 46
  • 47. www.mapflat.com Mistake 10: Late governance Data governance for techies: Proactive things you do in order to avoid later WTF. 47 GDPR compliance ● Erasure ● Anonymisation & pseudonymisation ● Enforced retention ● Access limitations Security ● Blast radius ● Disaster recovery Development speed ● Technical processes ● Domain definitions ● Schema evolution ● End-to-end testing Data democracy ● Data lake rules ● Data quality communication Financial compliance ● Reproducibility ● Data completeness ● Data consistency
  • 48. www.mapflat.com Technology over principles Complexity over business value Premature scaling Premature streaming Late governance No workflow orchestration End-to-end agility Crunching in the dark Read all the dataNew tech, old patterns Functional principles Offline / online Simplicity Driven by value Scope down Hype xor profit Common themes 48
  • 49. www.mapflat.com Related work ● https://tinyurl.com/mapflat-10-stumble ● Avoiding big data anti-patterns - https://youtu.be/t63SF2UkD0A ● https://www.slideshare.net/JesseAnderson/the-five-dysfunctions-of-a-data-engineering-team ● https://tinyurl.com/mapflat-reading-list ● http://www.mapflat.com/presentations/ 49
  • 51. www.mapflat.com class FraudCandidate(SparkSubmitTask): date = DateParameter() def requires(self): return [OrderDaily(date=self.date), self._credibility()] def _credibility(self): max_age_days = 3 for lookback in range(max_age_days): candidate = CredibilityScore(date - timedelta(days=lookback)) if candidate.output().exists(): return candidate return CredibilityScore(date - timedelta(days=max_age_days)) Dynamic DAGs CredibilityScore FraudCandidate Order 51
  • 52. www.mapflat.com Luigi or Airflow? Luigi & Kubernetes: ● 1 hour to production 52 Airflow & Kubernetes: ● AIRFLOW-1314, created 2017-06-16 ○ 9/12 sub-tasks resolved ○ 6 PRs ○ 28 commits ○ ~30 files touched ○ ~2000 lines of code ● Airflow monitoring has value, however myjob.yaml: kind: CronJob spec: schedule: "17 * * * *" jobTemplate: spec: containers: - name: "MyJob" image: myregistry/mypipeline:latest command: ["luigi"] args: ["--module", "mypipeline", "RangeDaily", "--of", "MyJob", "--days-back", "14"]
  • 53. www.mapflat.com Counters ● User-defined ● Technical from framework ○ Execution time ○ Memory consumption ○ Data volumes ○ ... 53 case class Order(item: ItemId, userId: UserId) case class User(id: UserId, country: String) val orders = read(orderPath) val users = read(userPath) val orderNoUserCounter = longAccumulator("order-no-user") val joined: C[(Order, Option[User])] = orders .groupBy(_.userId) .leftJoin(users.groupBy(_.id)) .values val orderWithUser: C[(Order, User)] = joined .flatMap( orderUser match case (order, Some(user)) => Some((order, user)) case (order, None) => { orderNoUserCounter.add(1) None }) SQL: Nope