Spark in the Maritime Domain

About me
Demi Ben-Ari
Senior Software Engineer at Windward Ltd.
BS’c Computer Science – Academic College Tel-Aviv Yaffo
In the Past:
Software Team Leader & Senior Java Software Engineer,
Missile defense and Alert System - “Ofek” unit - IAF

Agenda
 Windward Domain and Use Case
 Spark – Short Introduction
 Windward’s Spark Tool Box to
Production
 Q & A

What does Windward do?
Windward is a maritime data and analytics company,
bringing unprecedented visibility to the maritime
domain. Windward has built the world's first maritime
data platform, the Windward Mind,
which analyzes and organizes the world's maritime
data

Where does the data come
from?
Maritime
Databases
AIS
Automatic
Identificatio
n
System
Port Agent
Reports
Other
Sources

VESSEL + AREA STORIES
Windward Mind

Special in Windward’s
Domain
Maritime
Mind
Data Mining Scope
Market
Trends
Anomaly
Detection
• Single Data
point scope
• Going in Detail
• Fraud detection
• Sample /
Total
Data scope
• Trends
• Data
Sampling
problems

Windward Data Flow
Extern
al Data
Source
s
Analytics Layers Data Output

What is Apache Spark?
 General-purpose, cluster computing
framework
 Does computation In Memory (But not only)
 Fast also for heavy operations that run on
disk

Basic Terms
 Cluster
 Driver (Master)
 Executors
 Spark Context
 RDD – Resilient Distributed Dataset

Resilient Distributed Datasets
 Are fault tolerant
 Parallel data structure
 Are Immutable
 Can persist intermediate results in
memory
 Transformations are operators and are
Lazy evaluated

RDD Persistence and
partitioning
 Users have control which RDD will be
reuse (in memory and disk storage)
◦ Persist, Cache, Unpersist
 Users can define an RDD’s to be
partitioned across machines
 Only the lost partitions of an RDD
need to be recomputed upon failure.

Persistence layers for Spark
 Distributed system
◦ Hadoop (HDFS)
◦ Local file system
◦ Amazon S3
◦ Cassandra
◦ Hive
◦ Hbase
 File formats
◦ Text file
◦ Sequence File
◦ AVRO
◦ Parquet
◦ Other Hadoop formats

History Server
 Stand Alone Cluster
 Integrates both with Yarn and Mesos
 In Spark Standalone, history server is
embedded in the master.
 In Yarn/Mesos, run history server as a
daemon.

Multi Language API Support
 Scala
 Java
 Python
 Closure

Unified Tools Platform
Spark
SQL
GraphX
MLlib
Machine
Learning
Spark
Streamin
g
Spark Core

Where do I start from?!
 Download spark as a package
◦ Run it on “local” mode (no need of a real
cluster)
◦ “spark-ec2” scripts to ramp-up a Stand Alone
mode cluster
◦ Amazon Elastic Map Reduce (EMR)
 Yarn vs. Mesos vs. Stand Alone

Running Environments
 Development – Testing – Production
◦ Don’t you need more?
◦ Be as flexible as you can
 Cluster Utilization
◦ Unified Cluster for all environments
 Vs.
◦ Cluster per Environment
 (Cluster per Data Center)
 Configuration
◦ Local Files vs. Distributed

Saving and Maintaining the
Data Local File System – Not effective in a distributed
environment
 HDFS
◦ Might be very Expensive
◦ Locality Rules – Spark + HDFS node + Same machine
 S3
◦ High latency and pretty slow but low costs
 Cassandra
◦ Rigid data model
◦ Very fast and depends on the Volume of the data can be

DevOps – Keep It Simple,
Stupid Linux
◦ Bash scripts
◦ Crontab
 Automation via Jenkins
 Continuous Deployment – with every GIT push
Dev Testing
Live
Staging
Production
Daily ManualAutomaticAutomatic

Build Automation
 Maven
◦ Sonatype Nexus artifact management
 -
◦ Deploy and Script generation scripts
◦ Per Environment Testing
◦ Data Validation
◦ Scheduled Tasks

Workflow Management
 Oozie – Very hard to integrate with Spark
◦ XML configuration based and not that convenient
 Azkaban (Haven’t tried it)
 Chosen:
◦ Luigi
◦ Crontab + Jenkins (KISS again)

Testing
 Unit
◦ JUnit tests that run on the Spark “Functions”
 End to End
◦ Simulate the full execution of an application on a
single JVM (local mode) – Real input, Real output
 Functional
◦ Stand alone application
◦ Running on the cluster
◦ Minimal coverage – Shows working data flow

Testing
Dev Testing
Live
Staging
Production

Logging
 Runs by default log4j (slf4j)
 How to log correctly:
◦ Separate logs for different applications
◦ Driver and Executors log to different locations
◦ Yarn logging also exists (Might find problems there too)
 ELK Stack (Logstash - ElasticSearch – Kibana)
◦ By Logstash Shippers (Intrusive) or UDP Socket Appender (Log4j2)
◦ DO NOT use the regular TCP Log4J appender

Reporting and Monitoring
 Graphite
◦ Online application metrics
 Grafana
◦ Good Graphite visualization
 Jenkins - Monitoring
◦ Scheduled tests
◦ Validate result set of the applications
◦ Hung or stuck applications
◦ Failed application

Reporting and Monitoring
 Grafana + Graphite - Example

Summary
Cluster
Dev Testing
Live
Staging
ProductionEnv
ELK

What we’ve talked about?
 Windward Maritime Mind and Domain
 Spark Framework
◦ Utilizing distributed computation framework
 Spark to Production Tool Box

Questions?
Any thoughts or ponderings? – Just Ask

Thanks, Resources and Contact
 Demi Ben-Ari
◦ LinkedIn
◦ Twitter: @demibenari
◦ Blog: http://progexc.blogspot.com/
◦ Email: demi.benari@gmail.com
◦ Windward Ltd.
◦ Big Things are Happening Here – Facebook
group
◦ Meetup – Big Things

Spark in the Maritime Domain

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Spark in the Maritime Domain

Similar a Spark in the Maritime Domain (20)

Más de Demi Ben-Ari

Más de Demi Ben-Ari (20)

Último

Último (20)

Spark in the Maritime Domain