We will be showing the use case of the implementation of a Data Pipeline in the maritime domain @Windward via Spark applications.
The process was converting a Monolith application to a fully distributed and scalable application.
We'll be talking about all the tools and the process of taking an idea and developing Spark applications around it, And will show the development of an application End to End, from DevOps to the method of thinking about the development of applications, showing use-cases and the "lessons learned" at Windward Ltd, I hope that after the talk, it will give you some more Practical tools to "Spark"ing your way around.
2. About me
Demi Ben-Ari
Senior Software Engineer at Windward Ltd.
BS’c Computer Science – Academic College Tel-Aviv Yaffo
In the Past:
Software Team Leader & Senior Java Software Engineer,
Missile defense and Alert System - “Ofek” unit - IAF
3. Agenda
Windward Domain and Use Case
Spark – Short Introduction
Windward’s Spark Tool Box to
Production
Q & A
4. What does Windward do?
Windward is a maritime data and analytics company,
bringing unprecedented visibility to the maritime
domain. Windward has built the world's first maritime
data platform, the Windward Mind,
which analyzes and organizes the world's maritime
data
5. Where does the data come
from?
Maritime
Databases
AIS
Automatic
Identificatio
n
System
Port Agent
Reports
Other
Sources
8. Special in Windward’s
Domain
Maritime
Mind
Data Mining Scope
Market
Trends
Anomaly
Detection
• Single Data
point scope
• Going in Detail
• Fraud detection
• Sample /
Total
Data scope
• Trends
• Data
Sampling
problems
11. What is Apache Spark?
General-purpose, cluster computing
framework
Does computation In Memory (But not only)
Fast also for heavy operations that run on
disk
13. Resilient Distributed Datasets
Are fault tolerant
Parallel data structure
Are Immutable
Can persist intermediate results in
memory
Transformations are operators and are
Lazy evaluated
14. RDD Persistence and
partitioning
Users have control which RDD will be
reuse (in memory and disk storage)
◦ Persist, Cache, Unpersist
Users can define an RDD’s to be
partitioned across machines
Only the lost partitions of an RDD
need to be recomputed upon failure.
15. Persistence layers for Spark
Distributed system
◦ Hadoop (HDFS)
◦ Local file system
◦ Amazon S3
◦ Cassandra
◦ Hive
◦ Hbase
File formats
◦ Text file
◦ Sequence File
◦ AVRO
◦ Parquet
◦ Other Hadoop formats
16. History Server
Stand Alone Cluster
Integrates both with Yarn and Mesos
In Spark Standalone, history server is
embedded in the master.
In Yarn/Mesos, run history server as a
daemon.
20. Where do I start from?!
Download spark as a package
◦ Run it on “local” mode (no need of a real
cluster)
◦ “spark-ec2” scripts to ramp-up a Stand Alone
mode cluster
◦ Amazon Elastic Map Reduce (EMR)
Yarn vs. Mesos vs. Stand Alone
21. Running Environments
Development – Testing – Production
◦ Don’t you need more?
◦ Be as flexible as you can
Cluster Utilization
◦ Unified Cluster for all environments
Vs.
◦ Cluster per Environment
(Cluster per Data Center)
Configuration
◦ Local Files vs. Distributed
22. Saving and Maintaining the
Data Local File System – Not effective in a distributed
environment
HDFS
◦ Might be very Expensive
◦ Locality Rules – Spark + HDFS node + Same machine
S3
◦ High latency and pretty slow but low costs
Cassandra
◦ Rigid data model
◦ Very fast and depends on the Volume of the data can be
23. DevOps – Keep It Simple,
Stupid Linux
◦ Bash scripts
◦ Crontab
Automation via Jenkins
Continuous Deployment – with every GIT push
Dev Testing
Live
Staging
Production
Daily ManualAutomaticAutomatic
24. Build Automation
Maven
◦ Sonatype Nexus artifact management
-
◦ Deploy and Script generation scripts
◦ Per Environment Testing
◦ Data Validation
◦ Scheduled Tasks
25. Workflow Management
Oozie – Very hard to integrate with Spark
◦ XML configuration based and not that convenient
Azkaban (Haven’t tried it)
Chosen:
◦ Luigi
◦ Crontab + Jenkins (KISS again)
26. Testing
Unit
◦ JUnit tests that run on the Spark “Functions”
End to End
◦ Simulate the full execution of an application on a
single JVM (local mode) – Real input, Real output
Functional
◦ Stand alone application
◦ Running on the cluster
◦ Minimal coverage – Shows working data flow
28. Logging
Runs by default log4j (slf4j)
How to log correctly:
◦ Separate logs for different applications
◦ Driver and Executors log to different locations
◦ Yarn logging also exists (Might find problems there too)
ELK Stack (Logstash - ElasticSearch – Kibana)
◦ By Logstash Shippers (Intrusive) or UDP Socket Appender (Log4j2)
◦ DO NOT use the regular TCP Log4J appender
29. Reporting and Monitoring
Graphite
◦ Online application metrics
Grafana
◦ Good Graphite visualization
Jenkins - Monitoring
◦ Scheduled tests
◦ Validate result set of the applications
◦ Hung or stuck applications
◦ Failed application
32. What we’ve talked about?
Windward Maritime Mind and Domain
Spark Framework
◦ Utilizing distributed computation framework
Spark to Production Tool Box
34. Thanks, Resources and Contact
Demi Ben-Ari
◦ LinkedIn
◦ Twitter: @demibenari
◦ Blog: http://progexc.blogspot.com/
◦ Email: demi.benari@gmail.com
◦ Windward Ltd.
◦ Big Things are Happening Here – Facebook
group
◦ Meetup – Big Things