Introduction to Data Engineering (with Scala)

Introduction to Data Engineering
(with Scala)
John Nestor 47 Degrees
www.47deg.com
June 27, 2016
Galvanize
147deg.com

47deg.com © Copyright 2015 47 Degrees
Outline
• Introduction
• Data Engineering Requirements
• Data Engineering Design Patterns
• Recommended Data Engineering Tools and Systems
• Final Thoughts
2

Typical Data Engineering Systems
• Low latency response to HTTP or REST requests
• Database reads and writes
• Run ML models
• Produce event streams for later processing
• Near real time event processing
• Simple analytics and alerts
• Analysis of server information
• Logs and metrics
• Produce data for later analysis by data scientists
4

Big Data
• (Much) Too big to ﬁt on a single machine
• Must have both
• distributed computation
• distributed data (bases)
• Distributed systems means no single main memory
• Must pass data across servers
• Large number of distributed components means failure
is common
• Dealing with failure must be part of the fundamental
architecture
5

• https://blogs.oracle.com/jag/resource/Fallacies.html
Peter Deutsch
• The network is reliable
• Latency is zero
• Bandwidth is inﬁnite
• The network is secure
• Topology doesn’t change
• There is one administrator
• Transport cost is zero
• The network is homogeneous
6
Fallacies of Distributed Computing

Reactive Manifesto
• http://www.reactivemanifesto.org/
• Responsive - predictable latency
• Resilient - fault tolerant
• Elastic - (auto) scalability
• Message driven - basis of a distributed implementation
7

Data Engineering
Requirements
8

Scalability
• New systems are getting bigger all the time
• Hardware is getting cheaper
• Business requirements to stay competitive are
increasing
• Cloud computing permits easy expansion based on
instantaneous need
• No single server is ever big enough
• Scalability goal: performance increases (close to)
linearly with the number of servers
9

Availability
• Systems are increasingly expected to be available 24/7
with no downtime
• Any server can fail, others must be able to take over
• No downtime for maintenance. Software upgrades
occur without shutting system down.
• Must avoid availability killing features such a 2 phase
commit
• SLA’s # of nine’s
• The best most achieve is 3 nines (8.8 hours per year)
• Most strive for 6 nines (30 minutes per year)
• AWS S3 claims 9 nines (32 msec per year)
10

Durability
• Loosing data is never acceptable
• Since any single point can fail, we must replicate data
• Replication to
• main memory
• different server
• server in different zone
• across geo-distributed data centers
• AWS S3 will loose at most one object out of 32K objects
every 10 million years
11

Latency and Bandwidth
• Latency - msec to process a single request
• More hops can increase latency
• Very fast network hardware can reduce latency
• Speed of light is still the upper bound
• Bandwidth - number of requests processed per sec
• More servers can increase bandwidth
• Latency Numbers Every Programmer Should Know
• main memory (0.0001 msec)
• different server (0.5 msec)
• across geo-distributed data centers (150 msec)
12

Data Engineering
Design Patterns
13

Immutable Data
• Concurrent access to mutable data requires
synchronization. Immutable data does not.
• Data passed between servers will be immutable
• Immutable data plus functional programming results in
code that is easier to understand and test
14

Messaging (1 of 2)
• Message sent from A to B
• A gets ack from B
• A gets no ack from B
• message never got to B
• ack from B never got to A
• What kind?
• at most once (never resend)
• at least once (resend if no ack)
• exactly once (resend idempotently if no ack)
15

Messaging (2 of 2)
• Idempotence
• Multiple sends have same effect
• set X to 3, NOT add 2 to X
• Attach GUID, destination must handle
• In order delivery
• Waiting for an ack before sending next increases
latency
• Attach sequence number, destination must handle
• Batching multiple messages together can help
• Design so order does not matter
16

Persistent Data (1 of 3)
• CAP theorem (pick 2)
• Consistency (ACID)
• Availability
• Partition tolerance (closely tied to fault tolerance)
• Distributed consistency solutions: 2-phase commit is
“the anti-availability protocol” (Helland)
• For very large highly available systems, AP is only
possible choice
17

• Detecting conflicts with Vector clocks
• Each server has own time
• Vector has one element for each server
• Forms a partial order
• Resolving conflicts (for example: 2 different phone numbers)
• Select the latest
• Ask someone
• Keep both
• CRDTs (generalization of keep both)
• conflict free replicated data sets
• merge must be commutative, associative, idempotent
18

• Log based stores
• Sequence of transformational steps
• Each step is immutable
• Log is append only (fast sequential write to disk)
• Database is a cache of some point in the log
• Log is primary
• Database can be deleted and recreated from log
19

Concurrency and Distribution
• Individual servers are getting ever more cores.
• Utilization is key
• Large data applications require multiple servers
• Connections between servers are frequent points of
failure
• Parallel data operations help: parallel collections, Spark
• Traditional synchronization (locks, monitors) are error
prone and very hard to get right.
• Message bases systems (Hoare’s CSP, Hewitt’s actors)
are a better solution and work well across servers.
20

Logging and Monitoring
• As systems involve more and more servers
• Detecting and locating failure is getting harder
• Understanding system performance and performance
tuning is getting harder
• We now produce massive amounts of logs and
monitoring data
• Making sense of this huge volume of data is hard
• For failures we need near real-time analysis
• Increasing need for data science solutions
21

Continuous Deployment (1 of 2)
• High availability means we can no longer shut down for
upgrades to
• Application code
• Operating system upgrades and patches
• Hardware maintenance
• Automatic server failover
• Rolling upgrades
• Backward compatibility
• Messages
• Database schemas
22

Continuous Deployment (2 of 2)
• Deployment of lots of small changes reduces the chance of
errors in any single deployment
• Requires comprehensive automation for testing and
deployment
• But errors still do occur
• Although we have good methods for testing individual
components, integration testing is still hard and error prone.
• Some approaches
• Roll back
• A-B testing
• Database checkpoints
23

Recommended Data
Engineering Tools and
Systems
24

Choices
• Open source preferred
• Personal favorites
• Widely used (best practices in leading companies)
25

Prefer Open Source
• “Free”
• Full source is available
• Community participation
• Can move very fast
• More responsive
• Plus if there is a commercial company providing
support
26

Programming Language (1 of 3)
• Compiled versus interpreted
• Compiled: C, C++, Go
• Semi-compiled: Java, C#, Scala
• Interpreted: Python, Ruby, R
• Static versus dynamic type checking
• Static catches more errors at compile-time
• Static are easier to understand and maintain
• Static requires more work writing
• Garbage collection. Safety versus performance
27

Programming Languages (2 of 3)
• Choice of language does not matter
• I can write any algorithm in any language
• Lets avoid pointless “language religion” wars
• Choice of language matters a lot
• Language can have a big impact on performance,
productivity and reliability
• Programming languages shape the way we think
28

Programming Languages (3 of 3)
• Scala
• Semi-compiled. Compiled with JIT compiler.
• Statically typed but concise syntax of untyped
• Garbage collected
• Runs on JVM. Full ecosystem of libraries and tools available.
• Key features
• Functional plus immutable data (major advance in program quality)
• Scala Futures and Akka Actors (major advance in easy to
understand, easy to get correct, and fault-tolerant distributed
computation)
• Main language for Spark
• Suitable for both data engineers and data scientists (better
cooperation)
29

Messaging
• Kafka (written in Scala)
• Reliable buffer between produced and consumer
• Can replay
• Multiple produces and consumers
• Multiple topics
• Linearly scalable
• Kafka stream
• Other
• Reactive streams
• Spark streaming
30

Databases
• Relational: Postgres (scaling can be a problem)
• Embedded: LevelDB, MapDB
• NoSQL: Cassandra, Couchbase
• Graph: Neo4j, Titan, DataStax Enterprise Graph
31

Analytics
• Hadoop (let it die!)
• Spark (Written in Scala, Scala API is best)
• Trend toward SQL
• Improved performance via query optimizer
• Widely understood (but poor?) programming model
• Somewhat abandoned functional programming
(RDDs)
• dataset transforms: experiment to combine functional
programming with support for query optimization
32

Data Center Infrastructure and Continuous Deployment
• GitHub, SBT, Artifactory, Jenkins
• Docker/Rkt, Etcd, CoreOS
• Mesos, Kubernetes
• Cloud: AWS, Google, Microsoft
33

Final Thoughts
• Scala is the best choice for both data engineers and
data scientists
• Spark is the best choice for data analysis
• Data will continue to grow in size and importance
• The number of servers we use will continue to grow
requiring better fault tolerance and better automation
• When data engineers and data scientists work closely
together both beneﬁt and better results are achieved
• We need to break down traditional silos
• We need shared tools and technologies that work
well for both groups
35

Introduction to Data Engineering (with Scala)

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (10)

Más de John Nestor

Más de John Nestor (9)

Último

Último (20)

Introduction to Data Engineering (with Scala)