S3 cassandra or outer space? dumping time series data using spark

S3, Cassandra or Outer
Space? Dumping Time
Series Data using Spark
Demi Ben-Ari
Sr. Software Engineer @ Windward
17.02.2016

About me
Demi Ben-Ari
Senior Software Engineer at Windward Ltd.
BS’c Computer Science – Academic College Tel-Aviv Yaffo
In the Past:
Software Team Leader & Senior Java Software Engineer,
Missile defense and Alert System - “Ofek” unit - IAF

Agenda
 Dataflow and Environment
 What’s our time series data like?
 Where we started from?
 Problems and our Decisions
 Conclusion

Data Flow Diagram
Externa
l Data
Source
Analytics Layers
Data Pipeline
Parsed
Raw
Entity Resolution Process
Building insights
on top of the entities
Data Output
Layer
Anomaly Detection
Trends

Environment Description
Cluster
Dev Testing
Live
Staging
ProductionEnv
OB1
K
RESTful Java
Services

Basic Terms
 Idempotence is the property of
certain operations in mathematics and
computer science, that can be
applied multiple times without
changing the result beyond the initial
application.
 Same input => Same output

Basic Terms
 Missing Parts in Time Series Data
◦ Data arriving from the satellites
 Might be causing delays because of bad
transmission
◦ Data vendors delaying the data stream
◦ Calculation in Layers may cause Holes in
the Data
 Calculating the Data layers by time
slices

The Problem - Receiving
DATA
Beginning state, no data, and the time line
begins
T = 0
Level 3
Entity
Level 2
Entity
Level 1
Entity

DATA
T = 10
Level 3
Entity
Level 2
Entity
Level 1
Entity
Computation sliding window size
Level 1 entities data
arrives and gets stored

DATA
T = 10
Level 3
Entity
Level 2
Entity
Level 1
Entity
Level 3 entities are created
on top of Level 2’s Data
(Decreased amount of data)
Level 2 entities are
created on top of Level 1’s
Data
(Decreased amount of
data)

DATA
T = 20
Level 3
Entity
Level 2
Entity
Level 1
Entity
Because of the sliding window’s
back size, level 2 and 3 entities
would not be created properly
and there would be “Holes” in the
Data
Level 1 entity's
data arriving late

Solution to the Problem
 Creating Dependent Micro services forming a
data pipeline
◦ Mainly Apache Spark applications
◦ Services are only dependent on the Data - not
the previous service’s run
 Forming a structure and scheduling of “Back
Sliding Window”
◦ Know your data and it’s relevance trough time
◦ Don’t try to foresee the future – it might Bias the
results

How we started?
 Spark Standalone – via ec2 scripts
◦ Around 5 nodes (r3.xlarge instances)
◦ Didn’t want to keep a persistent HDFS – Costs a
lot
◦ 100 GB (per day) => ~150 TB for 4 years
◦ Cost for server per year (r3.xlarge):
 On demand: ~2900$
 Reserved: ~1750$
 Know your costs:
http://www.ec2instances.info/

Decision
 Working with S3 as the persistence layer
◦ Pay extra for
 Put (0.005 per 1000 requests)
 Get (0.004 per 10,000 requests)
◦ 150TB => ~210$ for 4 years of Data
 Same format as HDFS (CSV files)
◦ s3n://some-bucket/entity1/201412010000/part-00000
◦ s3n://some-bucket/entity1/201412010000/part-00001
◦ ……

MongoDB for Serving
Worker 1
Worker 2
….
….
…
…
Worker N
MongoDB
Replica Set
Spark
Cluster
Master
Write
Read

Spark Slave - Server Specs
 Instance Type: r3.xlarge
 CPU’s: 4
 RAM: 30.5GB
 Storage: ephemeral
 Amount: 10+

MongoDB - Server Specs
 MongoDB version: 2.6.1
 Instance Type: m3.xlarge (AWS)
 CPU’s: 4
 RAM: 15GB
 Storage: EBS
 DB Size: ~500GB
 Collection Indexes: 5 (4 compound)

The Problem
 Batch jobs
◦ Should run for 5-10 minutes in total
◦ Actual - runs for ~40 minutes
 Why?
◦ ~20 minutes to write with the Java mongo driver –
Async (Unacknowledged)
◦ ~20 minutes to sync the journal
◦ Total: ~ 40 Minutes of the DB being unavailable
◦ No batch process response and no UI serving

Alternative Solutions
 Shareded MongoDB (With replica sets)
◦ Pros:
 Increases Throughput by the amount of shards
 Increases the availability of the DB
◦ Cons:
 Very hard to manage DevOps wise (for a small team of
developers)
 High cost of servers – because each shared need 3
replicas

Workflow with MongoDB
Worker 1
Worker 2
….
….
…
…
Worker N
Spark
Cluster
Master
Write
Read
Master

Our DevOps – After that solution
We had no
DevOps guy at
that time at all


The Solution
 Migration to Apache Cassandra
 Create easily a Cassandra cluster using DataStax
Community AMI on AWS
◦ First easy step – Using the spark-cassandra-
connector
 (Easy bootstrap move to Spark  Cassandra)
◦ Creating a monitoring dashboard to
Cassandra

Workflow with Cassandra
Worker 1
Worker 2
….
….
…
…
Worker N
Cassandra
Cluster
Spark
Cluster
Write
Read

Result
 Performance improvement
◦ Batch write parts of the job run in 3 minutes
instead of ~ 40 minutes in MongoDB
 Took 2 weeks to go from “Zero to Hero”, and to
ramp up a running solution that work without
glitches

Transferring the Heaviest
Process
 Micro service that runs every 10
minutes
 Writes to Cassandra 30GB per
iteration
◦ (Replication factor 3 => 90GB)
 At first took us 18 minutes to do all of
the writes
◦ Not Acceptable in a 10 minute process

Transferring the Heaviest
Process
 Solutions
◦ We chose the i2.xlarge
◦ Optimization of the Cluster
◦ Changing the JDK to Java-8
 Changing the GC algorithm to G1
◦ Tuning the Operation system
 Ulimit, removing the swap
◦ Write time went down to ~5 minutes (For
30GB RF=3) – Sounds good right? I don’t
think so

The Solution
 Taking the same Data Model that we
held in Cassandra (All of the Raw data
per 10 minutes) and put it on S3
◦ Write time went down from ~5 minutes to
1.5 minutes
 Added another process, not dependent
on the main one, happens every 15
minutes
◦ Reads from S3, downscales the amount and
Writes them to Cassandra for serving

Conclusion
 Always give an estimate to your data
◦ Frequency
◦ Volume
◦ Arrangement of the previous phase
 There is no “Best” persistence layer
◦ The is the right one for the job
◦ Don’t overload an existing solution

Thanks,
Resources and Contact
 Demi Ben-Ari
◦ LinkedIn
◦ Twitter: @demibenari
◦ Blog: http://progexc.blogspot.com/
◦ Email: demi.benari@gmail.com
◦ “Big Things” Community
 Meetup, YouTube, Facebook, Twitter

S3 cassandra or outer space? dumping time series data using spark

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (10)

Similar a S3 cassandra or outer space? dumping time series data using spark

Similar a S3 cassandra or outer space? dumping time series data using spark (20)

Más de Demi Ben-Ari

Más de Demi Ben-Ari (20)

Último

Último (20)

S3 cassandra or outer space? dumping time series data using spark