Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Data engineering in 10 years.pdf

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 52 Anuncio

Data engineering in 10 years.pdf

Descargar para leer sin conexión

If we could only predict the future of the software industry, we could make better investments and decisions. We could waste less resources on technology and processes we know will not last, or at least be conscious in our decisions to choose solutions with a limited life time. It turns out that for data engineering, we can predict the future, because it has already happened. Not in our workplace, but at a few leading companies that are blazing ahead. It has also already happened in the neighbouring field of software engineering, which is two decades ahead of data engineering regarding process maturity. In this presentation, we will glimpse into the future of data engineering. Data engineering has gone from legacy data warehouses with stored procedures, to big data with Hadoop and data lakes, on to a new form of modern data warehouses and low code tools aka "the modern data stack". Where does it go from here? We will look at the points where data leaders differ from the crowd and combine with observations on how software engineering has evolved, to see that it points towards a new, more industrialised form of data engineering - "data factory engineering".

If we could only predict the future of the software industry, we could make better investments and decisions. We could waste less resources on technology and processes we know will not last, or at least be conscious in our decisions to choose solutions with a limited life time. It turns out that for data engineering, we can predict the future, because it has already happened. Not in our workplace, but at a few leading companies that are blazing ahead. It has also already happened in the neighbouring field of software engineering, which is two decades ahead of data engineering regarding process maturity. In this presentation, we will glimpse into the future of data engineering. Data engineering has gone from legacy data warehouses with stored procedures, to big data with Hadoop and data lakes, on to a new form of modern data warehouses and low code tools aka "the modern data stack". Where does it go from here? We will look at the points where data leaders differ from the crowd and combine with observations on how software engineering has evolved, to see that it points towards a new, more industrialised form of data engineering - "data factory engineering".

Anuncio
Anuncio

Más Contenido Relacionado

Más de Lars Albertsson (20)

Más reciente (20)

Anuncio

Data engineering in 10 years.pdf

  1. 1. www.scling.com Data engineering in 10 years Lars Albertsson, Founder, Scling 2022-11-09 1
  2. 2. www.scling.com Prediction of future? Opinion + belief 2 Functional languages (Scala, Kotlin, …) are better suited for data processing than Python. I believe that they will be dominant in the future.
  3. 3. www.scling.com How to predict the future? ● Promises ● Extrapolation ○ Leading to tipping points 3
  4. 4. www.scling.com How to predict the future? ● Promises ● Extrapolation ○ Leading to tipping points 4 ● Patterns ○ Similar contexts ahead in the journey ● Future is unevenly divided ○ Some are already there
  5. 5. www.scling.com Vintage digital disruption - MRP ● Materials resource planning ○ What materials are needed for manufacturing (this month) ○ Computerised in the 80s ○ Expensive manual monthly → automatically overnight ● MRP hype ○ People → software ○ … that is executed each month ● C.f. adoption today ○ Cloud ○ Agile ○ Data ○ ML 5
  6. 6. www.scling.com Technology adoption Eliyahu M. Goldratt on adopting new technology: "Technology can bring benefits if, and only if, it diminishes a limitation." ● What is the power of the technology? ● What limitation does it diminish? ● What rules helped us accommodate the limitation? ● What rules should we use now? 6
  7. 7. www.scling.com Technology adoption Eliyahu M. Goldratt on adopting new technology: "Technology can bring benefits if, and only if, it diminishes a limitation." ● What is the power of the technology? ● What limitation does it diminish? ● What rules helped us accommodate the limitation? ● What rules should we use now? Future = new technology - old rules + new rules 7 Primary cause of waste in data value creation
  8. 8. www.scling.com New rules? ● C.f. steam factory → electricity ○ Without new rules → backlash ● Scoped out ○ Covered yesterday 8
  9. 9. www.scling.com What is the power of data engineering? ● Feasible to store all (raw) data ● Cheap (re)computations ● Build more complex data processing flows ● Share data across teams with minimal operational risk ● Fast experiment iteration and feedback with minimal operational risk (Scoping out data science and machine learning.) 9
  10. 10. www.scling.com Efficiency gap, data cost & value ● Data processing produces datasets ○ Each dataset has business value ● Proxy value/cost metric: datasets / day ○ S-M traditional: < 10 ○ Bank, telecom, media: 100-1000 10 2014: 6500 datasets / day 2016: 20000 datasets / day 2018: 100000+ datasets / day, 25% of staff use BigQuery 2021: 500B events collected / day 2016: 1600 000 000 datasets / day Disruptive value of data, machine learning Financial, reporting Insights, data-fed features effort value
  11. 11. www.scling.com Data agility 11 ● Siloed: 6+ months Cultural work ● Autonomous: 1 month Technical work ● Coordinated: days Data lake ∆ ∆ Latency?
  12. 12. www.scling.com Enabling innovation 12 "The actual work that went into Discover Weekly was very little, because we're reusing things we already had." https://youtu.be/A259Yo8hBRs https://youtu.be/ZcmJxli8WS8 https://musically.com/2018/08/08/daniel-ek-would-have-killed-discover-weekly-before-launch/ "Discover Weekly wasn't a great strategic plan and 100 engineers. It was 3 engineers that decided to build something." "I would have killed it. All of a sudden, they shipped it. It’s one of the most loved product features that we have." - Daniel Ek, CEO
  13. 13. www.scling.com Manual, mechanised, industrialised 13
  14. 14. www.scling.com IT craft to factory 14 Security Waterfall Application delivery Traditional operations Traditional QA Infrastructure DevSecOps Agile Containers DevOps CI/CD Infrastructure as code
  15. 15. www.scling.com Security Waterfall Data factories 15 Application delivery Traditional operations DevSecOps Traditional QA Infrastructure DB-oriented architecture Agile Containers DevOps CI/CD Infrastructure as code Data factories, data pipelines, DataOps
  16. 16. www.scling.com 100x 100x Data artifacts produced Manual, mechanised, industrialised 16 Spotify's pipelines ~2013
  17. 17. www.scling.com Crafted artifacts: data models 17 ● Data (warehouse) models are carefully crafted ○ Built with hand-crafted SQL ○ Primitive automation ○ Reproducible? ● Require careful modelling to avoid trouble ○ E.g. slowly changing dimensions ○ Data vault, star schemas, satellites, … ● Pets, not cattle
  18. 18. www.scling.com Artisanal vs industrialised data modelling Artisanal: ● Create single shared model artifact ● Used for many use cases ● Innovate fast model → use case Industrial: ● Create model for each use case ● Reuse code that produces model ● Each model may be unique ● Innovate fast raw → model → use case 18
  19. 19. www.scling.com Premature modelling is waste ● Power: Recompute model quickly ● Lifted limitation: Expensive to compute model ● Old rule: Careful manual modelling work ● New rules: Guard rails preventing model iteration from breaking downstream ○ Code QA = testing ○ Code + data QA = monitoring Yes, on purpose! 19
  20. 20. www.scling.com Artisanal vs industrialised knowledge graphs Artisanal: ● Create single shared graph ● Used for many use cases ● Innovate fast graph → use case Industrial: ● Create graph for each use case ● Reuse code that produces graph ● Each graph may be unique ● Innovate fast raw → graph → use case 20
  21. 21. www.scling.com Artisanal vs industrialised machine learning models Google MLOps maturity model: ● MLOps level 0: Manual process ● MLOps level 1: ML pipeline automation ● MLOps level 2: CI/CD pipeline automation https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning 21
  22. 22. www.scling.com Road towards industrialisation 22 Data warehouse age - mechanised analytics DW LAMP stack age - manual analytics Hadoop age - industrialised analytics, data-fed features, machine learning Significant change in workflows Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations
  23. 23. www.scling.com Simplifying use of new technology 23 DW Enterprise big data failures "Modern data stack" - traditional workflows, new technology Low-code, no-code
  24. 24. www.scling.com We have seen this before 24 Difficult adoption 4GL, UML, low-code, no-code Software engineering education
  25. 25. www.scling.com Data engineering in the future 25 DW ~10 year capability gap "data factory engineering" Enterprise big data failures "Modern data stack" - traditional workflows, new technology 4GL / UML phase of data engineering Data engineering education
  26. 26. www.scling.com Low-code web creation works. Future of low-code & no-code 26 Low-code application development does not. Low-code data?
  27. 27. www.scling.com ● Static content (mostly) ● Low complexity ● Simple QA ● Inbound data + user defines content ● High complexity ● QA depends on user + data Future of low-code & no-code 27 ● User defines content ● Medium complexity ● QA depends on user behaviour
  28. 28. www.scling.com SQL for data processing ● SQL used in 3 distinct contexts ○ Interactive exploration ○ Backend data record retrieval ○ ETL data processing? 28 Important data language features: ● Can express (complex) business logic ● Composability ● Reusability ● Testability ● Seamless integration with external logic ● Tools to guide towards good path ○ Type system ○ Inspection tools ● IDE experience ● Debuggability ● Data quality measurement support ● Data quality improvement support ● Learning curve
  29. 29. www.scling.com SQL for data processing ● SQL used in 3 distinct contexts ○ Interactive exploration ○ Backend data record retrieval ○ ETL data processing? 29 Important data language features: ● Can express (complex) business logic ● Composability ● Reusability ● Testability ● Seamless integration with external logic ● Tools to guide towards good path ○ Type system ○ Inspection tools ● IDE experience ● Debuggability ● Data quality measurement support ● Data quality improvement support ● Learning curve https://threadreaderapp.com/thread/1353832649664692225.html
  30. 30. www.scling.com SQL inadequate for mature applications ● SQL from scratch - things seem ok ● Porting a mature application ○ Cannot reasonably express logic ○ ~5x slower (Hive 1.x) ○ Give up quality metrics ● Data quality measurements ● Data quality improvement 30 case class Order(item: ItemId, userId: UserId) case class User(id: UserId, country: String) val orders = read(orderPath) val users = read(userPath) val orderNoUserCounter = longAccumulator("order-no-user") val joined: C[(Order, Option[User])] = orders .groupBy(_.userId) .leftJoin(users.groupBy(_.id)) .values val orderWithUser: C[(Order, User)] = joined .flatMap( orderUser match case (order, Some(user)) => Some((order, user)) case (order, None) => { orderNoUserCounter.add(1) None })
  31. 31. www.scling.com Technology adoption & modern data stack ● New power: Build more complex data processing flows ● Old limitation: Brain capability to understand full flow ● Rules to mitigate limitation: Declarative & low code languages ● New rules: Software engineering / DevOps 31
  32. 32. www.scling.com Data-centric innovation ● Need data from teams ○ willing? ○ backlog? ○ collected? ○ useful? ○ quality? ○ extraction? ○ data governance? ○ history? 32
  33. 33. www.scling.com Data platform Big data - a collaboration paradigm 33 Stream storage? Data lake Data democratised
  34. 34. www.scling.com Technology adoption & data lake collaboration ● New powers: Share data across teams with minimal operational risk Fast experiment iteration and feedback with minimal operational risk ● Old limitations: Operational risk. Governance risk. Political. ● Rules to mitigate limitation: Data isolated. Internal API = technical contract ● New rules: DataOps - holistic QA New governance mechanisms 34
  35. 35. www.scling.com Data platform Data products / contracts = old rules, new context 35 Stream storage? Data lake Data contract Data product
  36. 36. www.scling.com Left is up 36 Winston W Royce: "Managing the development of large software systems"
  37. 37. www.scling.com Extreme programming 37
  38. 38. www.scling.com Agile 38
  39. 39. www.scling.com DevOps 39
  40. 40. www.scling.com Vintage team contracts and products 40 ● Rational Unified Process ● Strong separation between teams / developers ● Contracts at handoff points ● Maximum number of handoffs in a value stream
  41. 41. www.scling.com Big data 41
  42. 42. www.scling.com DataOps 42
  43. 43. www.scling.com MLOps 43 DATA SCIENCE
  44. 44. www.scling.com Which methodologies fade or prevail? 44 ● Perpendicular to value stream ○ Barriers between people & teams ○ Extra non-value adding work ○ More handoffs ○ Homogeneous competence ● Waterfall ● RUP ● Data products / data mesh ● Data contracts ● Aligned along value stream ○ Few handoffs from raw to value ○ Enabled teams ○ Remove waste (in lean terms) ○ Heterogeneous competence ● Extreme programming / TDD ● Agile ● Big data ● DevOps ● DataOps
  45. 45. www.scling.com Risk management by shifting left ● Manual governance ● Automated process ● DevOps: ○ Automated quality risk management ○ Quick feedback up in value stream ○ Left shifted QA risk management has improved both speed and quality ● DataOps: ○ Contracts are automated tests ○ Inter-system protocols are implementation details ○ New rule: Holistic QA ○ New governance 45
  46. 46. www.scling.com Risk management by shifting left ● Manual governance ● Automated process ● DevOps: ○ Automated quality risk management ○ Quick feedback up in value stream ○ Left shifted QA risk management has improved both speed and quality ● DataOps: ○ Contracts are automated tests ○ Inter-system protocols are implementation details ○ New rule: Holistic QA ○ New governance 46 ● DevSecOps ○ Security team approval ○ One-off vulnerability scans ○ Automated security rule validation ○ Feedback on change in vulnerabilities ● GovernanceOps? ○ Manual approval ○ Automated governance rule validation? ● ComplianceOps? ○ Manual one-off audits ○ Automated compliance inspections?
  47. 47. www.scling.com DevSecOps 47 SECURITY
  48. 48. www.scling.com ComplianceOps 48 COMPLIANCE
  49. 49. www.scling.com Wrapup 49 ● The future is faster ○ Patterns from other disciplines ○ How do leaders work? ○ Rules that hold us back today ● Look at software engineering evolution ○ Industrialised process eliminates big design up front ○ Enabled, high code components ○ Stream-aligned teams ○ Shift left continues
  50. 50. www.scling.com Wrapup 50 ● The future is faster ○ Patterns from other disciplines ○ How do leaders work? ○ Rules that hold us back today ● Look at software engineering evolution ○ Industrialised process eliminates big design up front ○ Enabled, high code components ○ Stream-aligned teams ○ Shift left continues ● Change is difficult, takes years ○ Agile transformations ○ DevOps transformations ● Current methods ineffective ○ Organically grow competence ○ Buy stuff ○ Consultants ● Belief: new collaboration methods
  51. 51. www.scling.com Scling - data-factory-as-a-service 51 Data value through collaboration Customer Data factory Data platform & lake data domain expertise Value from data! Rapid data innovation Learning by doing, in collaboration
  52. 52. www.scling.com Tech has massive impact on society 52 Product? Supplier? Employer? Make an active choice whether to have an impact! Cloud?

×