Data engineering in 10 years.pdf

www.scling.com
Data engineering in 10 years
Lars Albertsson, Founder, Scling
2022-11-09
1

www.scling.com
Prediction of future?
Opinion + belief
2
Functional languages (Scala, Kotlin, …) are better
suited for data processing than Python. I believe that
they will be dominant in the future.

www.scling.com
How to predict the future?
● Promises
● Extrapolation
○ Leading to tipping points
3

www.scling.com
How to predict the future?
● Promises
● Extrapolation
○ Leading to tipping points
4
● Patterns
○ Similar contexts ahead in the journey
● Future is unevenly divided
○ Some are already there

www.scling.com
Vintage digital disruption - MRP
● Materials resource planning
○ What materials are needed for manufacturing (this month)
○ Computerised in the 80s
○ Expensive manual monthly → automatically overnight
● MRP hype
○ People → software
○ … that is executed each month
● C.f. adoption today
○ Cloud
○ Agile
○ Data
○ ML
5

www.scling.com
Technology adoption
Eliyahu M. Goldratt on adopting new technology:
"Technology can bring benefits if, and only if, it diminishes a limitation."
● What is the power of the technology?
● What limitation does it diminish?
● What rules helped us accommodate the limitation?
● What rules should we use now?
6

www.scling.com
Technology adoption
Eliyahu M. Goldratt on adopting new technology:
"Technology can bring benefits if, and only if, it diminishes a limitation."
● What is the power of the technology?
● What limitation does it diminish?
● What rules helped us accommodate the limitation?
● What rules should we use now?
Future = new technology - old rules + new rules
7
Primary cause of waste in
data value creation

www.scling.com
New rules?
● C.f. steam factory → electricity
○ Without new rules → backlash
● Scoped out
○ Covered yesterday
8

www.scling.com
What is the power of data engineering?
● Feasible to store all (raw) data
● Cheap (re)computations
● Build more complex data processing flows
● Share data across teams with minimal operational risk
● Fast experiment iteration and feedback with minimal operational risk
(Scoping out data science and machine learning.)
9

www.scling.com
Efficiency gap, data cost & value
● Data processing produces datasets
○ Each dataset has business value
● Proxy value/cost metric: datasets / day
○ S-M traditional: < 10
○ Bank, telecom, media: 100-1000
10
2014: 6500 datasets / day
2016: 20000 datasets / day
2018: 100000+ datasets / day,
25% of staff use BigQuery
2021: 500B events collected / day
2016: 1600 000 000
datasets / day
Disruptive value of data, machine learning
Financial, reporting
Insights, data-fed features
effort
value

www.scling.com
Data agility
11
● Siloed: 6+ months
Cultural work
● Autonomous: 1 month
Technical work
● Coordinated: days
Data lake
∆
∆
Latency?

www.scling.com
Enabling innovation
12
"The actual work that went into
Discover Weekly was very little,
because we're reusing things we
already had."
https://youtu.be/A259Yo8hBRs
https://youtu.be/ZcmJxli8WS8
https://musically.com/2018/08/08/daniel-ek-would-have-killed-discover-weekly-before-launch/
"Discover Weekly wasn't a great
strategic plan and 100 engineers.
It was 3 engineers that decided to
build something."
"I would have killed it. All of a sudden,
they shipped it. It’s one of the most
loved product features that we have."
- Daniel Ek, CEO

www.scling.com
Manual, mechanised, industrialised
13

www.scling.com
IT craft to factory
14
Security Waterfall
Application
delivery
Traditional
operations
Traditional
QA
Infrastructure
DevSecOps Agile
Containers
DevOps CI/CD
Infrastructure
as code

www.scling.com
Security Waterfall
Data factories
15
Application
delivery
Traditional
operations
DevSecOps
Traditional
QA
Infrastructure
DB-oriented
architecture
Agile
Containers
DevOps CI/CD
Infrastructure
as code
Data factories,
data pipelines,
DataOps

www.scling.com
100x 100x
Data artifacts produced
Manual, mechanised, industrialised
16
Spotify's pipelines ~2013

www.scling.com
Crafted artifacts: data models
17
● Data (warehouse) models are carefully crafted
○ Built with hand-crafted SQL
○ Primitive automation
○ Reproducible?
● Require careful modelling to avoid trouble
○ E.g. slowly changing dimensions
○ Data vault, star schemas, satellites, …
● Pets, not cattle

www.scling.com
Artisanal vs industrialised data modelling
Artisanal:
● Create single shared model artifact
● Used for many use cases
● Innovate fast model → use case
Industrial:
● Create model for each use case
● Reuse code that produces model
● Each model may be unique
● Innovate fast raw → model → use case
18

www.scling.com
Premature modelling is waste
● Power: Recompute model quickly
● Lifted limitation: Expensive to compute model
● Old rule: Careful manual modelling work
● New rules: Guard rails preventing model iteration from breaking downstream
○ Code QA = testing
○ Code + data QA = monitoring
Yes, on purpose!
19

www.scling.com
Artisanal vs industrialised knowledge graphs
Artisanal:
● Create single shared graph
● Used for many use cases
● Innovate fast graph → use case
Industrial:
● Create graph for each use case
● Reuse code that produces graph
● Each graph may be unique
● Innovate fast raw → graph → use case
20

www.scling.com
Artisanal vs industrialised machine learning models
Google MLOps maturity model:
● MLOps level 0: Manual process
● MLOps level 1: ML pipeline automation
● MLOps level 2: CI/CD pipeline automation
https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
21

www.scling.com
Road towards industrialisation
22
Data warehouse age -
mechanised analytics
DW
LAMP stack age -
manual analytics
Hadoop age -
industrialised analytics,
data-fed features,
machine learning
Significant change in workflows
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations

www.scling.com
Simplifying use of new technology
23
DW
Enterprise big data failures
"Modern data stack" -
traditional workflows, new technology
Low-code, no-code

www.scling.com
We have seen this before
24
Difficult adoption
4GL, UML, low-code, no-code
Software engineering education

www.scling.com
Data engineering in the future
25
DW
~10 year capability gap
"data factory engineering"
Enterprise big data failures
"Modern data stack" -
traditional workflows, new technology
4GL / UML phase of data engineering
Data engineering education

www.scling.com
Low-code web creation works.
Future of low-code & no-code
26
Low-code application
development does not.
Low-code data?

www.scling.com
● Static content (mostly)
● Low complexity
● Simple QA
● Inbound data + user
defines content
● High complexity
● QA depends on
user + data
Future of low-code & no-code
27
● User defines content
● Medium complexity
● QA depends on user
behaviour

www.scling.com
SQL for data processing
● SQL used in 3 distinct contexts
○ Interactive exploration
○ Backend data record retrieval
○ ETL data processing?
28
Important data language features:
● Can express (complex) business logic
● Composability
● Reusability
● Testability
● Seamless integration with external logic
● Tools to guide towards good path
○ Type system
○ Inspection tools
● IDE experience
● Debuggability
● Data quality measurement support
● Data quality improvement support
● Learning curve

www.scling.com
SQL for data processing
● SQL used in 3 distinct contexts
○ Interactive exploration
○ Backend data record retrieval
○ ETL data processing?
29
Important data language features:
● Can express (complex) business logic
● Composability
● Reusability
● Testability
● Seamless integration with external logic
● Tools to guide towards good path
○ Type system
○ Inspection tools
● IDE experience
● Debuggability
● Data quality measurement support
● Data quality improvement support
● Learning curve
https://threadreaderapp.com/thread/1353832649664692225.html

www.scling.com
SQL inadequate for mature applications
● SQL from scratch - things seem ok
● Porting a mature application
○ Cannot reasonably express logic
○ ~5x slower (Hive 1.x)
○ Give up quality metrics
● Data quality measurements
● Data quality improvement
30
case class Order(item: ItemId, userId: UserId)
case class User(id: UserId, country: String)
val orders = read(orderPath)
val users = read(userPath)
val orderNoUserCounter = longAccumulator("order-no-user")
val joined: C[(Order, Option[User])] = orders
.groupBy(_.userId)
.leftJoin(users.groupBy(_.id))
.values
val orderWithUser: C[(Order, User)] = joined
.flatMap( orderUser match
case (order, Some(user)) => Some((order, user))
case (order, None) => {
orderNoUserCounter.add(1)
None
})

www.scling.com
Technology adoption & modern data stack
● New power:
Build more complex data processing flows
● Old limitation:
Brain capability to understand full flow
● Rules to mitigate limitation:
Declarative & low code languages
● New rules:
Software engineering / DevOps
31

www.scling.com
Data-centric innovation
● Need data from teams
○ willing?
○ backlog?
○ collected?
○ useful?
○ quality?
○ extraction?
○ data governance?
○ history?
32

www.scling.com
Data platform
Big data - a collaboration paradigm
33
Stream storage?
Data lake
Data
democratised

www.scling.com
Technology adoption & data lake collaboration
● New powers:
Share data across teams with minimal operational risk
Fast experiment iteration and feedback with minimal operational risk
● Old limitations:
Operational risk. Governance risk. Political.
● Rules to mitigate limitation:
Data isolated.
Internal API = technical contract
● New rules:
DataOps - holistic QA
New governance mechanisms
34

www.scling.com
Data platform
Data products / contracts = old rules, new context
35
Stream storage?
Data lake
Data contract
Data product

www.scling.com
Left is up
36
Winston W Royce:
"Managing the development
of large software systems"

www.scling.com
Extreme programming
37

www.scling.com
Vintage team contracts and products
40
● Rational Unified Process
● Strong separation between
teams / developers
● Contracts at handoff points
● Maximum number of handoffs
in a value stream

www.scling.com
MLOps
43
DATA
SCIENCE

www.scling.com
Which methodologies fade or prevail?
44
● Perpendicular to value stream
○ Barriers between people & teams
○ Extra non-value adding work
○ More handoffs
○ Homogeneous competence
● Waterfall
● RUP
● Data products / data mesh
● Data contracts
● Aligned along value stream
○ Few handoffs from raw to value
○ Enabled teams
○ Remove waste (in lean terms)
○ Heterogeneous competence
● Extreme programming / TDD
● Agile
● Big data
● DevOps
● DataOps

www.scling.com
Risk management by shifting left
● Manual governance
● Automated process
● DevOps:
○ Automated quality risk management
○ Quick feedback up in value stream
○ Left shifted QA risk management has
improved both speed and quality
● DataOps:
○ Contracts are automated tests
○ Inter-system protocols are
implementation details
○ New rule: Holistic QA
○ New governance
45

www.scling.com
Risk management by shifting left
● Manual governance
● Automated process
● DevOps:
○ Automated quality risk management
○ Quick feedback up in value stream
○ Left shifted QA risk management has
improved both speed and quality
● DataOps:
○ Contracts are automated tests
○ Inter-system protocols are
implementation details
○ New rule: Holistic QA
○ New governance
46
● DevSecOps
○ Security team approval
○ One-off vulnerability scans
○ Automated security rule validation
○ Feedback on change in vulnerabilities
● GovernanceOps?
○ Manual approval
○ Automated governance rule validation?
● ComplianceOps?
○ Manual one-off audits
○ Automated compliance inspections?

www.scling.com
DevSecOps
47
SECURITY

www.scling.com
ComplianceOps
48
COMPLIANCE

www.scling.com
Wrapup
49
● The future is faster
○ Patterns from other disciplines
○ How do leaders work?
○ Rules that hold us back today
● Look at software engineering evolution
○ Industrialised process eliminates
big design up front
○ Enabled, high code components
○ Stream-aligned teams
○ Shift left continues

www.scling.com
Wrapup
50
● The future is faster
○ Patterns from other disciplines
○ How do leaders work?
○ Rules that hold us back today
● Look at software engineering evolution
○ Industrialised process eliminates
big design up front
○ Enabled, high code components
○ Stream-aligned teams
○ Shift left continues
● Change is difficult, takes years
○ Agile transformations
○ DevOps transformations
● Current methods ineffective
○ Organically grow competence
○ Buy stuff
○ Consultants
● Belief: new collaboration methods

www.scling.com
Scling - data-factory-as-a-service
51
Data value through collaboration
Customer
Data factory
Data platform & lake
data
domain
expertise
Value from data!
Rapid data
innovation
Learning by doing,
in collaboration

www.scling.com
Tech has massive impact on society
52
Product?
Supplier?
Employer?
Make an active
choice whether to
have an impact!
Cloud?

Data engineering in 10 years.pdf

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Data engineering in 10 years.pdf

Similar a Data engineering in 10 years.pdf (20)

Más de Lars Albertsson

Más de Lars Albertsson (19)

Último

Último (20)

Data engineering in 10 years.pdf