SlideShare una empresa de Scribd logo
1 de 35
S3, Cassandra or Outer
Space? Dumping Time
Series Data using Spark
Demi Ben-Ari
Sr. Software Engineer @ Windward
17.02.2016
About me
Demi Ben-Ari
Senior Software Engineer at Windward Ltd.
BS’c Computer Science – Academic College Tel-Aviv Yaffo
In the Past:
Software Team Leader & Senior Java Software Engineer,
Missile defense and Alert System - “Ofek” unit - IAF
Agenda
 Dataflow and Environment
 What’s our time series data like?
 Where we started from?
 Problems and our Decisions
 Conclusion
Data Flow Diagram
Externa
l Data
Source
Analytics Layers
Data Pipeline
Parsed
Raw
Entity Resolution Process
Building insights
on top of the entities
Data Output
Layer
Anomaly Detection
Trends
Environment Description
Cluster
Dev Testing
Live
Staging
ProductionEnv
OB1
K
RESTful Java
Services
Basic Terms
 Idempotence is the property of
certain operations in mathematics and
computer science, that can be
applied multiple times without
changing the result beyond the initial
application.
 Same input => Same output
Basic Terms
 Missing Parts in Time Series Data
◦ Data arriving from the satellites
 Might be causing delays because of bad
transmission
◦ Data vendors delaying the data stream
◦ Calculation in Layers may cause Holes in
the Data
 Calculating the Data layers by time
slices
So what’s the problem?
The Problem - Receiving
DATA
Beginning state, no data, and the time line
begins
T = 0
Level 3
Entity
Level 2
Entity
Level 1
Entity
The Problem - Receiving
DATA
T = 10
Level 3
Entity
Level 2
Entity
Level 1
Entity
Computation sliding window size
Level 1 entities data
arrives and gets stored
The Problem - Receiving
DATA
T = 10
Level 3
Entity
Level 2
Entity
Level 1
Entity
Computation sliding window size
Level 3 entities are created
on top of Level 2’s Data
(Decreased amount of data)
Level 2 entities are
created on top of Level 1’s
Data
(Decreased amount of
data)
The Problem - Receiving
DATA
T = 20
Level 3
Entity
Level 2
Entity
Level 1
Entity
Computation sliding window size
Because of the sliding window’s
back size, level 2 and 3 entities
would not be created properly
and there would be “Holes” in the
Data
Level 1 entity's
data arriving late
Solution to the Problem
 Creating Dependent Micro services forming a
data pipeline
◦ Mainly Apache Spark applications
◦ Services are only dependent on the Data - not
the previous service’s run
 Forming a structure and scheduling of “Back
Sliding Window”
◦ Know your data and it’s relevance trough time
◦ Don’t try to foresee the future – it might Bias the
results
How we started?
 Spark Standalone – via ec2 scripts
◦ Around 5 nodes (r3.xlarge instances)
◦ Didn’t want to keep a persistent HDFS – Costs a
lot
◦ 100 GB (per day) => ~150 TB for 4 years
◦ Cost for server per year (r3.xlarge):
 On demand: ~2900$
 Reserved: ~1750$
 Know your costs:
http://www.ec2instances.info/
Decision
 Working with S3 as the persistence layer
◦ Pay extra for
 Put (0.005 per 1000 requests)
 Get (0.004 per 10,000 requests)
◦ 150TB => ~210$ for 4 years of Data
 Same format as HDFS (CSV files)
◦ s3n://some-bucket/entity1/201412010000/part-00000
◦ s3n://some-bucket/entity1/201412010000/part-00001
◦ ……
What about the serving?
MongoDB for Serving
Worker 1
Worker 2
….
….
…
…
Worker N
MongoDB
Replica Set
Spark
Cluster
Master
Write
Read
Spark Slave - Server Specs
 Instance Type: r3.xlarge
 CPU’s: 4
 RAM: 30.5GB
 Storage: ephemeral
 Amount: 10+
MongoDB - Server Specs
 MongoDB version: 2.6.1
 Instance Type: m3.xlarge (AWS)
 CPU’s: 4
 RAM: 15GB
 Storage: EBS
 DB Size: ~500GB
 Collection Indexes: 5 (4 compound)
The Problem
 Batch jobs
◦ Should run for 5-10 minutes in total
◦ Actual - runs for ~40 minutes
 Why?
◦ ~20 minutes to write with the Java mongo driver –
Async (Unacknowledged)
◦ ~20 minutes to sync the journal
◦ Total: ~ 40 Minutes of the DB being unavailable
◦ No batch process response and no UI serving
Alternative Solutions
 Shareded MongoDB (With replica sets)
◦ Pros:
 Increases Throughput by the amount of shards
 Increases the availability of the DB
◦ Cons:
 Very hard to manage DevOps wise (for a small team of
developers)
 High cost of servers – because each shared need 3
replicas
Workflow with MongoDB
Worker 1
Worker 2
….
….
…
…
Worker N
Spark
Cluster
Master
Write
Read
Master
Our DevOps – After that solution
We had no
DevOps guy at
that time at all

The Solution
 Migration to Apache Cassandra
 Create easily a Cassandra cluster using DataStax
Community AMI on AWS
◦ First easy step – Using the spark-cassandra-
connector
 (Easy bootstrap move to Spark  Cassandra)
◦ Creating a monitoring dashboard to
Cassandra
Workflow with Cassandra
Worker 1
Worker 2
….
….
…
…
Worker N
Cassandra
Cluster
Spark
Cluster
Write
Read
Result
 Performance improvement
◦ Batch write parts of the job run in 3 minutes
instead of ~ 40 minutes in MongoDB
 Took 2 weeks to go from “Zero to Hero”, and to
ramp up a running solution that work without
glitches
So what’s the problem?
Transferring the Heaviest
Process
 Micro service that runs every 10
minutes
 Writes to Cassandra 30GB per
iteration
◦ (Replication factor 3 => 90GB)
 At first took us 18 minutes to do all of
the writes
◦ Not Acceptable in a 10 minute process
Cluster On OpsCenter -
Before
Transferring the Heaviest
Process
 Solutions
◦ We chose the i2.xlarge
◦ Optimization of the Cluster
◦ Changing the JDK to Java-8
 Changing the GC algorithm to G1
◦ Tuning the Operation system
 Ulimit, removing the swap
◦ Write time went down to ~5 minutes (For
30GB RF=3) – Sounds good right? I don’t
think so
Cloud Watch After Tuning
The Solution
 Taking the same Data Model that we
held in Cassandra (All of the Raw data
per 10 minutes) and put it on S3
◦ Write time went down from ~5 minutes to
1.5 minutes
 Added another process, not dependent
on the main one, happens every 15
minutes
◦ Reads from S3, downscales the amount and
Writes them to Cassandra for serving
Conclusion
 Always give an estimate to your data
◦ Frequency
◦ Volume
◦ Arrangement of the previous phase
 There is no “Best” persistence layer
◦ The is the right one for the job
◦ Don’t overload an existing solution
Questions?
Thanks,
Resources and Contact
 Demi Ben-Ari
◦ LinkedIn
◦ Twitter: @demibenari
◦ Blog: http://progexc.blogspot.com/
◦ Email: demi.benari@gmail.com
◦ “Big Things” Community
 Meetup, YouTube, Facebook, Twitter

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Cassandra serving netflix @ scale
Cassandra serving netflix @ scaleCassandra serving netflix @ scale
Cassandra serving netflix @ scale
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
 
ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015 ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015
 
Lambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandLambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie Strickland
 
Apache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.comApache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.com
 
Apache Samza Past, Present and Future
Apache Samza  Past, Present and FutureApache Samza  Past, Present and Future
Apache Samza Past, Present and Future
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
 

Destacado

Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 
NoSQL Technology and Real-time, Accurate Predictive Analytics
NoSQL Technology and Real-time, Accurate Predictive AnalyticsNoSQL Technology and Real-time, Accurate Predictive Analytics
NoSQL Technology and Real-time, Accurate Predictive Analytics
InfiniteGraph
 

Destacado (10)

Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Polyglot Persistence in the Real World: Cassandra + S3 + MapReducePolyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
 
MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB and RDBMS: Using Polyglot Persistence at Equifax MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB and RDBMS: Using Polyglot Persistence at Equifax
 
Webinar: MongoDB and Polyglot Persistence Architecture
Webinar: MongoDB and Polyglot Persistence ArchitectureWebinar: MongoDB and Polyglot Persistence Architecture
Webinar: MongoDB and Polyglot Persistence Architecture
 
NoSQL Technology and Real-time, Accurate Predictive Analytics
NoSQL Technology and Real-time, Accurate Predictive AnalyticsNoSQL Technology and Real-time, Accurate Predictive Analytics
NoSQL Technology and Real-time, Accurate Predictive Analytics
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
AWS Architecting Cloud Apps - Best Practices and Design Patterns By Jinesh Varia
AWS Architecting Cloud Apps - Best Practices and Design Patterns By Jinesh VariaAWS Architecting Cloud Apps - Best Practices and Design Patterns By Jinesh Varia
AWS Architecting Cloud Apps - Best Practices and Design Patterns By Jinesh Varia
 
(DAT207) Amazon Aurora: The New Amazon Relational Database Engine
(DAT207) Amazon Aurora: The New Amazon Relational Database Engine(DAT207) Amazon Aurora: The New Amazon Relational Database Engine
(DAT207) Amazon Aurora: The New Amazon Relational Database Engine
 
Bootstrapping a Tech Community - Demi Ben-Ari
Bootstrapping a Tech Community - Demi Ben-AriBootstrapping a Tech Community - Demi Ben-Ari
Bootstrapping a Tech Community - Demi Ben-Ari
 

Similar a S3 cassandra or outer space? dumping time series data using spark

Similar a S3 cassandra or outer space? dumping time series data using spark (20)

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
Migrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraMigrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to Cassandra
 
Store stream data on Data Lake
Store stream data on Data LakeStore stream data on Data Lake
Store stream data on Data Lake
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
"EventStoreDb: To be, or not to be, that is the question", Illia Maier
"EventStoreDb: To be, or not to be, that is the question",  Illia Maier"EventStoreDb: To be, or not to be, that is the question",  Illia Maier
"EventStoreDb: To be, or not to be, that is the question", Illia Maier
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
 
EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)
 
Apache samza past, present and future
Apache samza  past, present and futureApache samza  past, present and future
Apache samza past, present and future
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
What’s New in Amazon Aurora
What’s New in Amazon AuroraWhat’s New in Amazon Aurora
What’s New in Amazon Aurora
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
 

Más de Demi Ben-Ari

Más de Demi Ben-Ari (20)

Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriThinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
 
CTO Management Tool Box - Demi Ben-Ari at Panorays
CTO Management Tool Box - Demi Ben-Ari at PanoraysCTO Management Tool Box - Demi Ben-Ari at Panorays
CTO Management Tool Box - Demi Ben-Ari at Panorays
 
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...
 
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
Hacker vs company, Cloud Cyber Security Automated with Kubernetes - Demi Ben-...
 
CTO Management ToolBox - Demi Ben-Ari -- Panorays
CTO Management ToolBox - Demi Ben-Ari -- PanoraysCTO Management ToolBox - Demi Ben-Ari -- Panorays
CTO Management ToolBox - Demi Ben-Ari -- Panorays
 
All I Wanted Is to Found a Startup - Demi Ben-Ari - Panorays
All I Wanted Is to Found a Startup - Demi Ben-Ari - PanoraysAll I Wanted Is to Found a Startup - Demi Ben-Ari - Panorays
All I Wanted Is to Found a Startup - Demi Ben-Ari - Panorays
 
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - PanoraysHacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
 
Community, Unifying the Geeks to Create Value - Demi Ben-Ari
Community, Unifying the Geeks to Create Value - Demi Ben-AriCommunity, Unifying the Geeks to Create Value - Demi Ben-Ari
Community, Unifying the Geeks to Create Value - Demi Ben-Ari
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
Know the Startup World - Demi Ben-Ari - Ofek Alumni
Know the Startup World - Demi Ben-Ari - Ofek AlumniKnow the Startup World - Demi Ben-Ari - Ofek Alumni
Know the Startup World - Demi Ben-Ari - Ofek Alumni
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Know the Startup World - Demi Ben Ari - Ofek Alumni
Know the Startup World - Demi Ben Ari - Ofek AlumniKnow the Startup World - Demi Ben Ari - Ofek Alumni
Know the Startup World - Demi Ben Ari - Ofek Alumni
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
 
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniSpark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 

Último

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Último (20)

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 

S3 cassandra or outer space? dumping time series data using spark

  • 1. S3, Cassandra or Outer Space? Dumping Time Series Data using Spark Demi Ben-Ari Sr. Software Engineer @ Windward 17.02.2016
  • 2. About me Demi Ben-Ari Senior Software Engineer at Windward Ltd. BS’c Computer Science – Academic College Tel-Aviv Yaffo In the Past: Software Team Leader & Senior Java Software Engineer, Missile defense and Alert System - “Ofek” unit - IAF
  • 3. Agenda  Dataflow and Environment  What’s our time series data like?  Where we started from?  Problems and our Decisions  Conclusion
  • 4. Data Flow Diagram Externa l Data Source Analytics Layers Data Pipeline Parsed Raw Entity Resolution Process Building insights on top of the entities Data Output Layer Anomaly Detection Trends
  • 6. Basic Terms  Idempotence is the property of certain operations in mathematics and computer science, that can be applied multiple times without changing the result beyond the initial application.  Same input => Same output
  • 7. Basic Terms  Missing Parts in Time Series Data ◦ Data arriving from the satellites  Might be causing delays because of bad transmission ◦ Data vendors delaying the data stream ◦ Calculation in Layers may cause Holes in the Data  Calculating the Data layers by time slices
  • 8. So what’s the problem?
  • 9. The Problem - Receiving DATA Beginning state, no data, and the time line begins T = 0 Level 3 Entity Level 2 Entity Level 1 Entity
  • 10. The Problem - Receiving DATA T = 10 Level 3 Entity Level 2 Entity Level 1 Entity Computation sliding window size Level 1 entities data arrives and gets stored
  • 11. The Problem - Receiving DATA T = 10 Level 3 Entity Level 2 Entity Level 1 Entity Computation sliding window size Level 3 entities are created on top of Level 2’s Data (Decreased amount of data) Level 2 entities are created on top of Level 1’s Data (Decreased amount of data)
  • 12. The Problem - Receiving DATA T = 20 Level 3 Entity Level 2 Entity Level 1 Entity Computation sliding window size Because of the sliding window’s back size, level 2 and 3 entities would not be created properly and there would be “Holes” in the Data Level 1 entity's data arriving late
  • 13. Solution to the Problem  Creating Dependent Micro services forming a data pipeline ◦ Mainly Apache Spark applications ◦ Services are only dependent on the Data - not the previous service’s run  Forming a structure and scheduling of “Back Sliding Window” ◦ Know your data and it’s relevance trough time ◦ Don’t try to foresee the future – it might Bias the results
  • 14. How we started?  Spark Standalone – via ec2 scripts ◦ Around 5 nodes (r3.xlarge instances) ◦ Didn’t want to keep a persistent HDFS – Costs a lot ◦ 100 GB (per day) => ~150 TB for 4 years ◦ Cost for server per year (r3.xlarge):  On demand: ~2900$  Reserved: ~1750$  Know your costs: http://www.ec2instances.info/
  • 15. Decision  Working with S3 as the persistence layer ◦ Pay extra for  Put (0.005 per 1000 requests)  Get (0.004 per 10,000 requests) ◦ 150TB => ~210$ for 4 years of Data  Same format as HDFS (CSV files) ◦ s3n://some-bucket/entity1/201412010000/part-00000 ◦ s3n://some-bucket/entity1/201412010000/part-00001 ◦ ……
  • 16. What about the serving?
  • 17. MongoDB for Serving Worker 1 Worker 2 …. …. … … Worker N MongoDB Replica Set Spark Cluster Master Write Read
  • 18. Spark Slave - Server Specs  Instance Type: r3.xlarge  CPU’s: 4  RAM: 30.5GB  Storage: ephemeral  Amount: 10+
  • 19. MongoDB - Server Specs  MongoDB version: 2.6.1  Instance Type: m3.xlarge (AWS)  CPU’s: 4  RAM: 15GB  Storage: EBS  DB Size: ~500GB  Collection Indexes: 5 (4 compound)
  • 20. The Problem  Batch jobs ◦ Should run for 5-10 minutes in total ◦ Actual - runs for ~40 minutes  Why? ◦ ~20 minutes to write with the Java mongo driver – Async (Unacknowledged) ◦ ~20 minutes to sync the journal ◦ Total: ~ 40 Minutes of the DB being unavailable ◦ No batch process response and no UI serving
  • 21. Alternative Solutions  Shareded MongoDB (With replica sets) ◦ Pros:  Increases Throughput by the amount of shards  Increases the availability of the DB ◦ Cons:  Very hard to manage DevOps wise (for a small team of developers)  High cost of servers – because each shared need 3 replicas
  • 22. Workflow with MongoDB Worker 1 Worker 2 …. …. … … Worker N Spark Cluster Master Write Read Master
  • 23. Our DevOps – After that solution We had no DevOps guy at that time at all 
  • 24. The Solution  Migration to Apache Cassandra  Create easily a Cassandra cluster using DataStax Community AMI on AWS ◦ First easy step – Using the spark-cassandra- connector  (Easy bootstrap move to Spark  Cassandra) ◦ Creating a monitoring dashboard to Cassandra
  • 25. Workflow with Cassandra Worker 1 Worker 2 …. …. … … Worker N Cassandra Cluster Spark Cluster Write Read
  • 26. Result  Performance improvement ◦ Batch write parts of the job run in 3 minutes instead of ~ 40 minutes in MongoDB  Took 2 weeks to go from “Zero to Hero”, and to ramp up a running solution that work without glitches
  • 27. So what’s the problem?
  • 28. Transferring the Heaviest Process  Micro service that runs every 10 minutes  Writes to Cassandra 30GB per iteration ◦ (Replication factor 3 => 90GB)  At first took us 18 minutes to do all of the writes ◦ Not Acceptable in a 10 minute process
  • 30. Transferring the Heaviest Process  Solutions ◦ We chose the i2.xlarge ◦ Optimization of the Cluster ◦ Changing the JDK to Java-8  Changing the GC algorithm to G1 ◦ Tuning the Operation system  Ulimit, removing the swap ◦ Write time went down to ~5 minutes (For 30GB RF=3) – Sounds good right? I don’t think so
  • 32. The Solution  Taking the same Data Model that we held in Cassandra (All of the Raw data per 10 minutes) and put it on S3 ◦ Write time went down from ~5 minutes to 1.5 minutes  Added another process, not dependent on the main one, happens every 15 minutes ◦ Reads from S3, downscales the amount and Writes them to Cassandra for serving
  • 33. Conclusion  Always give an estimate to your data ◦ Frequency ◦ Volume ◦ Arrangement of the previous phase  There is no “Best” persistence layer ◦ The is the right one for the job ◦ Don’t overload an existing solution
  • 35. Thanks, Resources and Contact  Demi Ben-Ari ◦ LinkedIn ◦ Twitter: @demibenari ◦ Blog: http://progexc.blogspot.com/ ◦ Email: demi.benari@gmail.com ◦ “Big Things” Community  Meetup, YouTube, Facebook, Twitter