SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
Introduction to Data Engineering
(with Scala)
John Nestor 47 Degrees
www.47deg.com
June 27, 2016
Galvanize
147deg.com
47deg.com © Copyright 2015 47 Degrees
Outline
• Introduction
• Data Engineering Requirements
• Data Engineering Design Patterns
• Recommended Data Engineering Tools and Systems
• Final Thoughts
2
Introduction
3
47deg.com © Copyright 2015 47 Degrees
Typical Data Engineering Systems
• Low latency response to HTTP or REST requests
• Database reads and writes
• Run ML models
• Produce event streams for later processing
• Near real time event processing
• Simple analytics and alerts
• Analysis of server information
• Logs and metrics
• Produce data for later analysis by data scientists
4
47deg.com © Copyright 2015 47 Degrees
Big Data
• (Much) Too big to fit on a single machine
• Must have both
• distributed computation
• distributed data (bases)
• Distributed systems means no single main memory
• Must pass data across servers
• Large number of distributed components means failure
is common
• Dealing with failure must be part of the fundamental
architecture
5
47deg.com © Copyright 2015 47 Degrees
• https://blogs.oracle.com/jag/resource/Fallacies.html
Peter Deutsch
• The network is reliable
• Latency is zero
• Bandwidth is infinite
• The network is secure
• Topology doesn’t change
• There is one administrator
• Transport cost is zero
• The network is homogeneous
6
Fallacies of Distributed Computing
47deg.com © Copyright 2015 47 Degrees
Reactive Manifesto
• http://www.reactivemanifesto.org/
• Responsive - predictable latency
• Resilient - fault tolerant
• Elastic - (auto) scalability
• Message driven - basis of a distributed implementation
7
Data Engineering
Requirements
8
47deg.com © Copyright 2015 47 Degrees
Scalability
• New systems are getting bigger all the time
• Hardware is getting cheaper
• Business requirements to stay competitive are
increasing
• Cloud computing permits easy expansion based on
instantaneous need
• No single server is ever big enough
• Scalability goal: performance increases (close to)
linearly with the number of servers
9
47deg.com © Copyright 2015 47 Degrees
Availability
• Systems are increasingly expected to be available 24/7
with no downtime
• Any server can fail, others must be able to take over
• No downtime for maintenance. Software upgrades
occur without shutting system down.
• Must avoid availability killing features such a 2 phase
commit
• SLA’s # of nine’s
• The best most achieve is 3 nines (8.8 hours per year)
• Most strive for 6 nines (30 minutes per year)
• AWS S3 claims 9 nines (32 msec per year)
10
47deg.com © Copyright 2015 47 Degrees
Durability
• Loosing data is never acceptable
• Since any single point can fail, we must replicate data
• Replication to
• main memory
• different server
• server in different zone
• across geo-distributed data centers
• AWS S3 will loose at most one object out of 32K objects
every 10 million years
11
47deg.com © Copyright 2015 47 Degrees
Latency and Bandwidth
• Latency - msec to process a single request
• More hops can increase latency
• Very fast network hardware can reduce latency
• Speed of light is still the upper bound
• Bandwidth - number of requests processed per sec
• More servers can increase bandwidth
• Latency Numbers Every Programmer Should Know
• main memory (0.0001 msec)
• different server (0.5 msec)
• across geo-distributed data centers (150 msec)
12
Data Engineering
Design Patterns
13
47deg.com © Copyright 2015 47 Degrees
Immutable Data
• Concurrent access to mutable data requires
synchronization. Immutable data does not.
• Data passed between servers will be immutable
• Immutable data plus functional programming results in
code that is easier to understand and test
14
47deg.com © Copyright 2015 47 Degrees
Messaging (1 of 2)
• Message sent from A to B
• A gets ack from B
• A gets no ack from B
• message never got to B
• ack from B never got to A
• What kind?
• at most once (never resend)
• at least once (resend if no ack)
• exactly once (resend idempotently if no ack)
15
47deg.com © Copyright 2015 47 Degrees
Messaging (2 of 2)
• Idempotence
• Multiple sends have same effect
• set X to 3, NOT add 2 to X
• Attach GUID, destination must handle
• In order delivery
• Waiting for an ack before sending next increases
latency
• Attach sequence number, destination must handle
• Batching multiple messages together can help
• Design so order does not matter
16
47deg.com © Copyright 2015 47 Degrees
Persistent Data (1 of 3)
• CAP theorem (pick 2)
• Consistency (ACID)
• Availability
• Partition tolerance (closely tied to fault tolerance)
• Distributed consistency solutions: 2-phase commit is
“the anti-availability protocol” (Helland)
• For very large highly available systems, AP is only
possible choice
17
47deg.com © Copyright 2015 47 Degrees
Persistent Data (2 of 3)
• Detecting conflicts with Vector clocks
• Each server has own time
• Vector has one element for each server
• Forms a partial order
• Resolving conflicts (for example: 2 different phone numbers)
• Select the latest
• Ask someone
• Keep both
• CRDTs (generalization of keep both)
• conflict free replicated data sets
• merge must be commutative, associative, idempotent
18
47deg.com © Copyright 2015 47 Degrees
Persistent Data (3 of 3)
• Log based stores
• Sequence of transformational steps
• Each step is immutable
• Log is append only (fast sequential write to disk)
• Database is a cache of some point in the log
• Log is primary
• Database can be deleted and recreated from log
19
47deg.com © Copyright 2015 47 Degrees
Concurrency and Distribution
• Individual servers are getting ever more cores.
• Utilization is key
• Large data applications require multiple servers
• Connections between servers are frequent points of
failure
• Parallel data operations help: parallel collections, Spark
• Traditional synchronization (locks, monitors) are error
prone and very hard to get right.
• Message bases systems (Hoare’s CSP, Hewitt’s actors)
are a better solution and work well across servers.
20
47deg.com © Copyright 2015 47 Degrees
Logging and Monitoring
• As systems involve more and more servers
• Detecting and locating failure is getting harder
• Understanding system performance and performance
tuning is getting harder
• We now produce massive amounts of logs and
monitoring data
• Making sense of this huge volume of data is hard
• For failures we need near real-time analysis
• Increasing need for data science solutions
21
47deg.com © Copyright 2015 47 Degrees
Continuous Deployment (1 of 2)
• High availability means we can no longer shut down for
upgrades to
• Application code
• Operating system upgrades and patches
• Hardware maintenance
• Automatic server failover
• Rolling upgrades
• Backward compatibility
• Messages
• Database schemas
22
47deg.com © Copyright 2015 47 Degrees
Continuous Deployment (2 of 2)
• Deployment of lots of small changes reduces the chance of
errors in any single deployment
• Requires comprehensive automation for testing and
deployment
• But errors still do occur
• Although we have good methods for testing individual
components, integration testing is still hard and error prone.
• Some approaches
• Roll back
• A-B testing
• Database checkpoints
23
Recommended Data
Engineering Tools and
Systems
24
47deg.com © Copyright 2015 47 Degrees
Choices
• Open source preferred
• Personal favorites
• Widely used (best practices in leading companies)
25
47deg.com © Copyright 2015 47 Degrees
Prefer Open Source
• “Free”
• Full source is available
• Community participation
• Can move very fast
• More responsive
• Plus if there is a commercial company providing
support
26
47deg.com © Copyright 2015 47 Degrees
Programming Language (1 of 3)
• Compiled versus interpreted
• Compiled: C, C++, Go
• Semi-compiled: Java, C#, Scala
• Interpreted: Python, Ruby, R
• Static versus dynamic type checking
• Static catches more errors at compile-time
• Static are easier to understand and maintain
• Static requires more work writing
• Garbage collection. Safety versus performance
27
47deg.com © Copyright 2015 47 Degrees
Programming Languages (2 of 3)
• Choice of language does not matter
• I can write any algorithm in any language
• Lets avoid pointless “language religion” wars
• Choice of language matters a lot
• Language can have a big impact on performance,
productivity and reliability
• Programming languages shape the way we think
28
47deg.com © Copyright 2015 47 Degrees
Programming Languages (3 of 3)
• Scala
• Semi-compiled. Compiled with JIT compiler.
• Statically typed but concise syntax of untyped
• Garbage collected
• Runs on JVM. Full ecosystem of libraries and tools available.
• Key features
• Functional plus immutable data (major advance in program quality)
• Scala Futures and Akka Actors (major advance in easy to
understand, easy to get correct, and fault-tolerant distributed
computation)
• Main language for Spark
• Suitable for both data engineers and data scientists (better
cooperation)
29
47deg.com © Copyright 2015 47 Degrees
Messaging
• Kafka (written in Scala)
• Reliable buffer between produced and consumer
• Can replay
• Multiple produces and consumers
• Multiple topics
• Linearly scalable
• Kafka stream
• Other
• Reactive streams
• Spark streaming
30
47deg.com © Copyright 2015 47 Degrees
Databases
• Relational: Postgres (scaling can be a problem)
• Embedded: LevelDB, MapDB
• NoSQL: Cassandra, Couchbase
• Graph: Neo4j, Titan, DataStax Enterprise Graph
31
47deg.com © Copyright 2015 47 Degrees
Analytics
• Hadoop (let it die!)
• Spark (Written in Scala, Scala API is best)
• Trend toward SQL
• Improved performance via query optimizer
• Widely understood (but poor?) programming model
• Somewhat abandoned functional programming
(RDDs)
• dataset transforms: experiment to combine functional
programming with support for query optimization
32
47deg.com © Copyright 2015 47 Degrees
Data Center Infrastructure and Continuous Deployment
• GitHub, SBT, Artifactory, Jenkins
• Docker/Rkt, Etcd, CoreOS
• Mesos, Kubernetes
• Cloud: AWS, Google, Microsoft
33
Final Thoughts
34
47deg.com © Copyright 2015 47 Degrees
Final Thoughts
• Scala is the best choice for both data engineers and
data scientists
• Spark is the best choice for data analysis
• Data will continue to grow in size and importance
• The number of servers we use will continue to grow
requiring better fault tolerance and better automation
• When data engineers and data scientists work closely
together both benefit and better results are achieved
• We need to break down traditional silos
• We need shared tools and technologies that work
well for both groups
35
Questions
36

Más contenido relacionado

Destacado

Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick GuideAsim Jalis
 
11 Hard to Ignore Data Analytics Quotes
11 Hard to Ignore Data Analytics Quotes11 Hard to Ignore Data Analytics Quotes
11 Hard to Ignore Data Analytics QuotesCloudlytics
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineeringnathanmarz
 
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.Yongho Ha
 
Big Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business NeedsBig Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business NeedsBernard Marr
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBernard Marr
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr
 

Destacado (10)

Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Big data road map
Big data road mapBig data road map
Big data road map
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
 
11 Hard to Ignore Data Analytics Quotes
11 Hard to Ignore Data Analytics Quotes11 Hard to Ignore Data Analytics Quotes
11 Hard to Ignore Data Analytics Quotes
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
자습해도 모르겠던 딥러닝, 머리속에 인스톨 시켜드립니다.
 
Big Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business NeedsBig Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business Needs
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must Know
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 

Más de John Nestor

LambdaFlow: Scala Functional Message Processing
LambdaFlow: Scala Functional Message Processing LambdaFlow: Scala Functional Message Processing
LambdaFlow: Scala Functional Message Processing John Nestor
 
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsJohn Nestor
 
Logging in Scala
Logging in ScalaLogging in Scala
Logging in ScalaJohn Nestor
 
Messaging patterns
Messaging patternsMessaging patterns
Messaging patternsJohn Nestor
 
Experience Converting from Ruby to Scala
Experience Converting from Ruby to ScalaExperience Converting from Ruby to Scala
Experience Converting from Ruby to ScalaJohn Nestor
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataJohn Nestor
 
Scala Json Features and Performance
Scala Json Features and PerformanceScala Json Features and Performance
Scala Json Features and PerformanceJohn Nestor
 

Más de John Nestor (9)

LambdaFlow: Scala Functional Message Processing
LambdaFlow: Scala Functional Message Processing LambdaFlow: Scala Functional Message Processing
LambdaFlow: Scala Functional Message Processing
 
LambdaTest
LambdaTestLambdaTest
LambdaTest
 
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset Transforms
 
Logging in Scala
Logging in ScalaLogging in Scala
Logging in Scala
 
Messaging patterns
Messaging patternsMessaging patterns
Messaging patterns
 
Experience Converting from Ruby to Scala
Experience Converting from Ruby to ScalaExperience Converting from Ruby to Scala
Experience Converting from Ruby to Scala
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big Data
 
Scala Json Features and Performance
Scala Json Features and PerformanceScala Json Features and Performance
Scala Json Features and Performance
 
Neutronium
NeutroniumNeutronium
Neutronium
 

Último

Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 

Último (20)

Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 

Introduction to Data Engineering (with Scala)

  • 1. Introduction to Data Engineering (with Scala) John Nestor 47 Degrees www.47deg.com June 27, 2016 Galvanize 147deg.com
  • 2. 47deg.com © Copyright 2015 47 Degrees Outline • Introduction • Data Engineering Requirements • Data Engineering Design Patterns • Recommended Data Engineering Tools and Systems • Final Thoughts 2
  • 4. 47deg.com © Copyright 2015 47 Degrees Typical Data Engineering Systems • Low latency response to HTTP or REST requests • Database reads and writes • Run ML models • Produce event streams for later processing • Near real time event processing • Simple analytics and alerts • Analysis of server information • Logs and metrics • Produce data for later analysis by data scientists 4
  • 5. 47deg.com © Copyright 2015 47 Degrees Big Data • (Much) Too big to fit on a single machine • Must have both • distributed computation • distributed data (bases) • Distributed systems means no single main memory • Must pass data across servers • Large number of distributed components means failure is common • Dealing with failure must be part of the fundamental architecture 5
  • 6. 47deg.com © Copyright 2015 47 Degrees • https://blogs.oracle.com/jag/resource/Fallacies.html Peter Deutsch • The network is reliable • Latency is zero • Bandwidth is infinite • The network is secure • Topology doesn’t change • There is one administrator • Transport cost is zero • The network is homogeneous 6 Fallacies of Distributed Computing
  • 7. 47deg.com © Copyright 2015 47 Degrees Reactive Manifesto • http://www.reactivemanifesto.org/ • Responsive - predictable latency • Resilient - fault tolerant • Elastic - (auto) scalability • Message driven - basis of a distributed implementation 7
  • 9. 47deg.com © Copyright 2015 47 Degrees Scalability • New systems are getting bigger all the time • Hardware is getting cheaper • Business requirements to stay competitive are increasing • Cloud computing permits easy expansion based on instantaneous need • No single server is ever big enough • Scalability goal: performance increases (close to) linearly with the number of servers 9
  • 10. 47deg.com © Copyright 2015 47 Degrees Availability • Systems are increasingly expected to be available 24/7 with no downtime • Any server can fail, others must be able to take over • No downtime for maintenance. Software upgrades occur without shutting system down. • Must avoid availability killing features such a 2 phase commit • SLA’s # of nine’s • The best most achieve is 3 nines (8.8 hours per year) • Most strive for 6 nines (30 minutes per year) • AWS S3 claims 9 nines (32 msec per year) 10
  • 11. 47deg.com © Copyright 2015 47 Degrees Durability • Loosing data is never acceptable • Since any single point can fail, we must replicate data • Replication to • main memory • different server • server in different zone • across geo-distributed data centers • AWS S3 will loose at most one object out of 32K objects every 10 million years 11
  • 12. 47deg.com © Copyright 2015 47 Degrees Latency and Bandwidth • Latency - msec to process a single request • More hops can increase latency • Very fast network hardware can reduce latency • Speed of light is still the upper bound • Bandwidth - number of requests processed per sec • More servers can increase bandwidth • Latency Numbers Every Programmer Should Know • main memory (0.0001 msec) • different server (0.5 msec) • across geo-distributed data centers (150 msec) 12
  • 14. 47deg.com © Copyright 2015 47 Degrees Immutable Data • Concurrent access to mutable data requires synchronization. Immutable data does not. • Data passed between servers will be immutable • Immutable data plus functional programming results in code that is easier to understand and test 14
  • 15. 47deg.com © Copyright 2015 47 Degrees Messaging (1 of 2) • Message sent from A to B • A gets ack from B • A gets no ack from B • message never got to B • ack from B never got to A • What kind? • at most once (never resend) • at least once (resend if no ack) • exactly once (resend idempotently if no ack) 15
  • 16. 47deg.com © Copyright 2015 47 Degrees Messaging (2 of 2) • Idempotence • Multiple sends have same effect • set X to 3, NOT add 2 to X • Attach GUID, destination must handle • In order delivery • Waiting for an ack before sending next increases latency • Attach sequence number, destination must handle • Batching multiple messages together can help • Design so order does not matter 16
  • 17. 47deg.com © Copyright 2015 47 Degrees Persistent Data (1 of 3) • CAP theorem (pick 2) • Consistency (ACID) • Availability • Partition tolerance (closely tied to fault tolerance) • Distributed consistency solutions: 2-phase commit is “the anti-availability protocol” (Helland) • For very large highly available systems, AP is only possible choice 17
  • 18. 47deg.com © Copyright 2015 47 Degrees Persistent Data (2 of 3) • Detecting conflicts with Vector clocks • Each server has own time • Vector has one element for each server • Forms a partial order • Resolving conflicts (for example: 2 different phone numbers) • Select the latest • Ask someone • Keep both • CRDTs (generalization of keep both) • conflict free replicated data sets • merge must be commutative, associative, idempotent 18
  • 19. 47deg.com © Copyright 2015 47 Degrees Persistent Data (3 of 3) • Log based stores • Sequence of transformational steps • Each step is immutable • Log is append only (fast sequential write to disk) • Database is a cache of some point in the log • Log is primary • Database can be deleted and recreated from log 19
  • 20. 47deg.com © Copyright 2015 47 Degrees Concurrency and Distribution • Individual servers are getting ever more cores. • Utilization is key • Large data applications require multiple servers • Connections between servers are frequent points of failure • Parallel data operations help: parallel collections, Spark • Traditional synchronization (locks, monitors) are error prone and very hard to get right. • Message bases systems (Hoare’s CSP, Hewitt’s actors) are a better solution and work well across servers. 20
  • 21. 47deg.com © Copyright 2015 47 Degrees Logging and Monitoring • As systems involve more and more servers • Detecting and locating failure is getting harder • Understanding system performance and performance tuning is getting harder • We now produce massive amounts of logs and monitoring data • Making sense of this huge volume of data is hard • For failures we need near real-time analysis • Increasing need for data science solutions 21
  • 22. 47deg.com © Copyright 2015 47 Degrees Continuous Deployment (1 of 2) • High availability means we can no longer shut down for upgrades to • Application code • Operating system upgrades and patches • Hardware maintenance • Automatic server failover • Rolling upgrades • Backward compatibility • Messages • Database schemas 22
  • 23. 47deg.com © Copyright 2015 47 Degrees Continuous Deployment (2 of 2) • Deployment of lots of small changes reduces the chance of errors in any single deployment • Requires comprehensive automation for testing and deployment • But errors still do occur • Although we have good methods for testing individual components, integration testing is still hard and error prone. • Some approaches • Roll back • A-B testing • Database checkpoints 23
  • 25. 47deg.com © Copyright 2015 47 Degrees Choices • Open source preferred • Personal favorites • Widely used (best practices in leading companies) 25
  • 26. 47deg.com © Copyright 2015 47 Degrees Prefer Open Source • “Free” • Full source is available • Community participation • Can move very fast • More responsive • Plus if there is a commercial company providing support 26
  • 27. 47deg.com © Copyright 2015 47 Degrees Programming Language (1 of 3) • Compiled versus interpreted • Compiled: C, C++, Go • Semi-compiled: Java, C#, Scala • Interpreted: Python, Ruby, R • Static versus dynamic type checking • Static catches more errors at compile-time • Static are easier to understand and maintain • Static requires more work writing • Garbage collection. Safety versus performance 27
  • 28. 47deg.com © Copyright 2015 47 Degrees Programming Languages (2 of 3) • Choice of language does not matter • I can write any algorithm in any language • Lets avoid pointless “language religion” wars • Choice of language matters a lot • Language can have a big impact on performance, productivity and reliability • Programming languages shape the way we think 28
  • 29. 47deg.com © Copyright 2015 47 Degrees Programming Languages (3 of 3) • Scala • Semi-compiled. Compiled with JIT compiler. • Statically typed but concise syntax of untyped • Garbage collected • Runs on JVM. Full ecosystem of libraries and tools available. • Key features • Functional plus immutable data (major advance in program quality) • Scala Futures and Akka Actors (major advance in easy to understand, easy to get correct, and fault-tolerant distributed computation) • Main language for Spark • Suitable for both data engineers and data scientists (better cooperation) 29
  • 30. 47deg.com © Copyright 2015 47 Degrees Messaging • Kafka (written in Scala) • Reliable buffer between produced and consumer • Can replay • Multiple produces and consumers • Multiple topics • Linearly scalable • Kafka stream • Other • Reactive streams • Spark streaming 30
  • 31. 47deg.com © Copyright 2015 47 Degrees Databases • Relational: Postgres (scaling can be a problem) • Embedded: LevelDB, MapDB • NoSQL: Cassandra, Couchbase • Graph: Neo4j, Titan, DataStax Enterprise Graph 31
  • 32. 47deg.com © Copyright 2015 47 Degrees Analytics • Hadoop (let it die!) • Spark (Written in Scala, Scala API is best) • Trend toward SQL • Improved performance via query optimizer • Widely understood (but poor?) programming model • Somewhat abandoned functional programming (RDDs) • dataset transforms: experiment to combine functional programming with support for query optimization 32
  • 33. 47deg.com © Copyright 2015 47 Degrees Data Center Infrastructure and Continuous Deployment • GitHub, SBT, Artifactory, Jenkins • Docker/Rkt, Etcd, CoreOS • Mesos, Kubernetes • Cloud: AWS, Google, Microsoft 33
  • 35. 47deg.com © Copyright 2015 47 Degrees Final Thoughts • Scala is the best choice for both data engineers and data scientists • Spark is the best choice for data analysis • Data will continue to grow in size and importance • The number of servers we use will continue to grow requiring better fault tolerance and better automation • When data engineers and data scientists work closely together both benefit and better results are achieved • We need to break down traditional silos • We need shared tools and technologies that work well for both groups 35