The world has changed and having one huge server won’t do the job anymore, when you’re talking about vast amounts of data, growing all the time the ability to Scale Out would be your saviour.
This lecture will be about the basics of Apache Spark and distributed computing and the development tools needed to have a functional environment.
Bio:
Demi Ben-Ari, Sr. Data Engineer @Windward, Ofek Alumni
Has over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-i...
2. About me
Demi Ben-Ari
Senior Software Engineer at Windward Ltd.
BS’c Computer Science – Academic College Tel-Aviv Yaffo
Co-Founder “Big Things” Big Data Community
In the Past:
Software Team Leader & Senior Java Software Engineer,
Missile defense and Alert System - “Ofek” – IAF
• Interested in almost every kind of technology – A True Geek
3.
4. Agenda
What is Spark?
Spark Infrastructure and Basics
Spark Features and Suite
◦ Spark-Shell Live Demo
◦ Cassandra & Spark
Development with Spark
Conclusion
5. What is Spark?
Efficient Usable
General execution
graphs
In-memory storage
Rich APIs in Java,
Scala, Python
Interactive shell
Fast and Expressive Cluster Computing
Engine Compatible with Apache Hadoop
6. What is Spark?
Apache Spark is a general-purpose, cluster
computing framework
Spark does computation In Memory & on
Disk
Apache Spark has low level and high level
APIs
7. Spark Philosophy
Make life easy and productive for data
scientists
Well documented, expressive API’s
Powerful domain specific libraries
Easy integration with storage systems
… and caching to avoid data movement
Predictable releases, stable API’s
Stable release each 3 months
8.
9. Spark Contributors
Highly active open source community
(09/2015)
◦ https://github.com/apache/spark/
https://www.openhub.net/p/apache-spark
10. About Spark project
Spark was founded at UC Berkeley and the
main contributor is “Databricks”.
Interactive shell Spark in Scala and Python
◦ (spark-shell, pyspark)
Currently stable in version 1.6
16. Driver and Spark Context
Spark Context is your “handle” to the Spark
cluster.
The driver program contains the main
method.
You use your Spark Context to access your
cluster.
◦ Configure the connection to the cluster
◦ It lets you create RDDs.
The variable named sc (for the Spark
Context) is already defined in your Driver in
the Spark Shell.
17. What’s an RDD?
Resilient Distributed Datasets
◦ Fault tolerant
◦ Parallel data structure
◦ Distributed on the nodes in the cluster
◦ Immutable!!!
◦ Can persist intermediate results in memory
◦ Transformations are operators and are Lazy
evaluated
20. RDD Persistence and
partitioning
Users have control which RDD will be
reuse (in memory and disk storage)
◦ Persist, Cache, Unpersist
Users can order an RDD’s to be
partitioned across machines
Only the lost partitions of an RDD
need to be recomputed upon failure.
21. Spark execution engine
Spark uses lazy evaluation
◦ Runs the code only when it encounters an
action operation
There is no need to design and write a
single complex map-reduce job.
◦ In Spark we can write smaller and
manageable operations
◦ Spark will group operations together
22. Spark execution engine
Serializes your code to the executors
◦ Can choose your serialization method
(Java serialization, Kryo)
In Java - functions are specified as
objects that implement one of Spark’s
Function interfaces.
◦ Can use the same method of
implementation in Scala and Python as
well.
23. Persistence layers for Spark
Distributed system
◦ Hadoop (HDFS)
◦ Local file system
◦ Amazon S3
◦ Cassandra
◦ Hive
◦ Hbase
File formats
◦ Text file
CSV, TSV, Plain Text
◦ Sequence File
◦ AVRO
◦ Parquet
24.
25.
26. Spark Core Features
Distributed In memory Computation
Stand alone and Local Capabilities
History server for Spark UI
Resource management Integration
Unified job submission tool
27. History Server
Can be run on all Spark deployments,
◦ Stand Alone, YARN, Mesos
Integrates both with YARN and Mesos
In Yarn / Mesos, run history server as
a daemon.
30. Cassandra & Spark
Cassandra cluster
◦ Bare metal vs. On the cloud
DSE – DataStax Enterprise
◦ Cassandra & Spark in each node
Vs
◦ Separate Cassandra and Spark clusters
32. Where do I start from?!
Download spark as a package
◦ Run it on “local” mode (no need of a real
cluster)
◦ “spark-ec2” scripts to ramp-up a Stand Alone
mode cluster
◦ Amazon Elastic Map Reduce (EMR)
Yarn vs. Mesos vs. Stand Alone
33. Running Environments
Development – Testing – Production
◦ Don’t you need more?
◦ Be as flexible as you can
Cluster Utilization
◦ Unified Cluster for all environments
Vs.
◦ Cluster per Environment
(Cluster per Data Center)
Configuration
◦ Local Files vs. Distributed
34. Saving and Maintaining the
Data Local File System – Not effective in a distributed
environment
HDFS
◦ Might be very Expensive
◦ Locality Rules – Spark + HDFS node + Same machine
S3
◦ High latency and pretty slow but low costs
Cassandra
◦ Rigid data model
◦ Very fast and depends on the Volume of the data can be
35. DevOps – Keep It Simple,
Stupid Linux
◦ Bash scripts
◦ Crontab
Automation via Jenkins
Continuous Deployment – with every GIT push
Dev Testing
Live
Staging
Production
Daily ManualAutomaticAutomatic
36. Build Automation
Maven
◦ Sonatype Nexus artifact management
-
◦ Deploy and Script generation scripts
◦ Per Environment Testing
◦ Data Validation
◦ Scheduled Tasks
40. Conclusion
Spark is a popular and very powerful
distributed in memory computation
framework
Broadly used and has lots of contributors
Leading tool in the new world of Petabytes
of unexplored data in the world