SlideShare a Scribd company logo
1 of 30
Download to read offline
SMACK Architectures
Building data processing platforms with
Spark, Mesos, Akka, Cassandra and Kafka
Anton Kirillov Big Data AW Meetup
Sep 2015
Who is this guy?
● Scala programmer
● Focused on distributed systems
● Building data platforms with SMACK/Hadoop
● Ph.D. in Computer Science
● Big Data engineer/consultant at Big Data AB
● Currently at Ooyala Stockholm (Videoplaza AB)
● Working with startups
2
Roadmap
● SMACK stack overview
● Storage layer layout
● Fixing NoSQL limitations
● Cluster resource management
● Reliable scheduling and execution
● Data ingestion options
● Preparing for failures
3
SMACK Stack
● Spark - fast and general engine for distributed, large-scale data
processing
● Mesos - cluster resource management system that provides efficient
resource isolation and sharing across distributed applications
● Akka - a toolkit and runtime for building highly concurrent, distributed,
and resilient message-driven applications on the JVM
● Cassandra - distributed, highly available database designed to handle
large amounts of data across multiple datacenters
● Kafka - a high-throughput, low-latency distributed messaging system
designed for handling real-time data feeds
4
Storage Layer: Cassandra
● optimized for heavy write
loads
● configurable CA (CAP)
● linearly scalable
● XDCR support
● easy cluster resizing and
inter-DC data migration
5
Cassandra Data Model
● nested sorted map
● should be optimized for
read queries
● data is distributed across
nodes by partition key
CREATE TABLE campaign(
id uuid,
year int,
month int,
day int,
views bigint,
clicks bigint,
PRIMARY KEY (id, year, month, day)
);
INSERT INTO campaign(id, year, month, day, views, clicks)
VALUES(40b08953-a…,2015, 9, 10, 1000, 42);
SELECT views, clicks FROM campaign
WHERE id=40b08953-a… and year=2015 and month>8; 6
Spark/Cassandra Example
● calculate total views per
campaign for given month
for all campaigns
CREATE TABLE event(
id uuid,
ad_id uuid,
campaign uuid,
ts bigint,
type text,
PRIMARY KEY(id)
);
val sc = new SparkContext(conf)
case class Event(id: UUID, ad_id: UUID, campaign: UUID, ts: Long, `type`: String)
sc.cassandraTable[Event]("keyspace", "event")
.filter(e => e.`type` == "view" && checkMonth(e.ts))
.map(e => (e.campaign, 1))
.reduceByKey(_ + _)
.collect() 7
Naive Lambda example with Spark SQL
case class CampaignReport(id: String, views: Long, clicks: Long)
sql("""SELECT campaign.id as id, campaign.views as views,
campaign.clicks as clicks, event.type as type
FROM campaign
JOIN event ON campaign.id = event.campaign
""").rdd
.groupBy(row => row.getAs[String]("id"))
.map{ case (id, rows) =>
val views = rows.head.getAs[Long]("views")
val clicks = rows.head.getAs[Long]("clicks")
val res = rows.groupBy(row => row.getAs[String]("type")).mapValues(_.size)
CampaignReport(id, views = views + res("view"), clicks = clicks + res("click"))
}.saveToCassandra(“keyspace”, “campaign_report”)
8
Let’s take a step back: Spark Basics
● RDD operations(transformations and actions) form DAG
● DAG is split into stages of tasks which are then submitted to cluster manager
● stages combine tasks which don’t require shuffling/repartitioning
● tasks run on workers and results then return to client 9
Architecture of Spark/Cassandra Clusters
Separate Write & Analytics:
● clusters can be scaled
independently
● data is replicated by
Cassandra asynchronously
● Analytics has different
Read/Write load patterns
● Analytics contains additional
data and processing results
● Spark resource impact
limited to only one DC
To fully facilitate Spark-C* connector data locality awareness,
Spark workers should be collocated with Cassandra nodes 10
Spark Applications Deployment Revisited
Cluster Manager:
● Spark Standalone
● YARN
● Mesos
11
Managing Cluster Resources: Mesos
● heterogenous workloads
● full cluster utilization
● static vs. dynamic resource
allocation
● fault tolerance and disaster
recovery
● single resource view at
datacenter levelimage source: http://www.slideshare.net/caniszczyk/apache-mesos-at-twitter-texas-linuxfest-2014
12
Mesos Architecture Overview
● leader election and
service discovery via
ZooKeeper
● slaves publish available
resources to master
● master sends resource
offers to frameworks
● scheduler replies with
tasks and resources
needed per task
● master sends tasks to
slaves
13
Bringing Spark, Mesos and Cassandra Together
Deployment example
● Mesos Masters and
ZooKeepers collocated
● Mesos Slaves and Cassandra
nodes collocated to enforce
better data locality for Spark
● Spark binaries deployed to all
worker nodes and spark-env is
configured
● Spark Executor JAR uploaded
to S3
Invocation example
spark-submit --class io.datastrophic.SparkJob /etc/jobs/spark-jobs.jar
14
Marathon
● long running tasks
execution
● HA mode with ZooKeeper
● Docker executor
● REST API
15
Chronos
● distributed cron
● HA mode with ZooKeeper
● supports graphs of jobs
● sensitive to network failures
16
More Mesos frameworks
● Hadoop
● Cassandra
● Kafka
● Myriad: YARN on Mesos
● Storm
● Samza
17
Data ingestion: endpoints to consume the data
Endpoint requirements:
● high throughput
● resiliency
● easy scalability
● back pressure 18
Akka features
class JsonParserActor extends Actor {
def receive = {
case s: String => Try(Json.parse(s).as[Event]) match {
case Failure(ex) => log.error(ex)
case Success(event) => sender ! event
}
}
}
class HttpActor extends Actor {
def receive = {
case req: HttpRequest =>
system.actorOf(Props[JsonParserActor]) ! req.body
case e: Event =>
system.actorOf(Props[CassandraWriterActor]) ! e
}
}
● actor model
implementation for JVM
● message-based and
asynchronous
● no shared mutable state
● easy scalability from one
process to cluster of
machines
● actor hierarchies with
parental supervision
● not only concurrency
framework:
○ akka-http
○ akka-streams
○ akka-persistence
19
Writing to Cassandra with Akka
class CassandraWriterActor extends Actor with ActorLogging {
//for demo purposes, session initialized here
val session = Cluster.builder()
.addContactPoint("cassandra.host")
.build()
.connect()
override def receive: Receive = {
case event: Event =>
val statement = new SimpleStatement(event.createQuery)
.setConsistencyLevel(ConsistencyLevel.QUORUM)
Try(session.execute(statement)) match {
case Failure(ex) => //error handling code
case Success => sender ! WriteSuccessfull
}
}
} 20
Cassandra meets Batch Processing
● writing raw data (events) to Cassandra with Akka is easy
● but computation time of aggregations/rollups will grow with
amount of data
● Cassandra is still designed for fast serving but not batch
processing, so pre-aggregation of incoming data is needed
● actors are not suitable for performing aggregation due to
stateless design model
● micro-batches partially solve the problem
● reliable storage for raw data is still needed
21
Kafka: distributed commit log
● pre-aggregation of incoming data
● consumers read data in batches
● available as Kinesis on AWS
22
Publishing to Kafka with Akka Http
val config = new ProducerConfig(KafkaConfig())
lazy val producer = new KafkaProducer[A, A](config)
val topic = “raw_events”
val routes: Route = {
post{
decodeRequest{
entity(as[String]){ str =>
JsonParser.parse(str).validate[Event] match {
case s: JsSuccess[String] => producer.send(new KeyedMessage(topic, str))
case e: JsError => BadRequest -> JsError.toFlatJson(e).toString()
}
}
}
}
}
object AkkaHttpMicroservice extends App with Service {
Http().bindAndHandle(routes, config.getString("http.interface"), config.getInt("http.
port"))
}
23
Spark Streaming
24
● variety of data sources
● at-least-once semantics
● exactly-once semantics
available with Kafka Direct
and idempotent storage
Spark Streaming: Kinesis example
val ssc = new StreamingContext(conf, Seconds(10))
val kinesisStream = KinesisUtils.createStream(ssc,appName,streamName,
endpointURL,regionName, InitialPositionInStream.LATEST,
Duration(checkpointInterval), StorageLevel.MEMORY_ONLY)
}
//transforming given stream to Event and saving to C*
kinesisStream.map(JsonUtils.byteArrayToEvent)
.saveToCassandra(keyspace, table)
ssc.start()
ssc.awaitTermination()
25
Designing for Failure: Backups and Patching
● be prepared for failures and broken data
● design backup and patching strategies upfront
● idempotece should be enforced 26
Restoring backup from S3
val sc = new SparkContext(conf)
sc.textFile(s"s3n://bucket/2015/*/*.gz")
.map(s => Try(JsonUtils.stringToEvent(s)))
.filter(_.isSuccess).map(_.get)
.saveToCassandra(config.keyspace, config.table)
27
The big picture
28
So what SMACK is
● concise toolbox for wide variety of data processing scenarios
● battle-tested and widely used software with large communities
● easy scalability and replication of data while preserving low latencies
● unified cluster management for heterogeneous loads
● single platform for any kind of applications
● implementation platform for different architecture designs
● really short time-to-market (e.g. for MVP verification)
29
Questions
@antonkirillov datastrophic.io
30

More Related Content

What's hot

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 

What's hot (20)

Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Building a fully-automated Fast Data Platform
Building a fully-automated Fast Data PlatformBuilding a fully-automated Fast Data Platform
Building a fully-automated Fast Data Platform
 
The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache Spark
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Lambda Architecture Using SQL
Lambda Architecture Using SQLLambda Architecture Using SQL
Lambda Architecture Using SQL
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK Stack
 
Building a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at YieldbotBuilding a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at Yieldbot
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and Kafka
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 

Viewers also liked

Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
Brendan Gregg
 

Viewers also liked (19)

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
 
Laying down the smack on your data pipelines
Laying down the smack on your data pipelinesLaying down the smack on your data pipelines
Laying down the smack on your data pipelines
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
 
Digitizing Europe
Digitizing EuropeDigitizing Europe
Digitizing Europe
 
H2O - the optimized HTTP server
H2O - the optimized HTTP serverH2O - the optimized HTTP server
H2O - the optimized HTTP server
 
Container Orchestration Wars
Container Orchestration WarsContainer Orchestration Wars
Container Orchestration Wars
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 

Similar to Data processing platforms architectures with Spark, Mesos, Akka, Cassandra and Kafka

Similar to Data processing platforms architectures with Spark, Mesos, Akka, Cassandra and Kafka (20)

Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 
Event Sourcing - what could go wrong - Jfokus 2022
Event Sourcing - what could go wrong - Jfokus 2022Event Sourcing - what could go wrong - Jfokus 2022
Event Sourcing - what could go wrong - Jfokus 2022
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
Stream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data MicroservicesStream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data Microservices
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
Event sourcing  - what could possibly go wrong ? Devoxx PL 2021Event sourcing  - what could possibly go wrong ? Devoxx PL 2021
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Recently uploaded (20)

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 

Data processing platforms architectures with Spark, Mesos, Akka, Cassandra and Kafka

  • 1. SMACK Architectures Building data processing platforms with Spark, Mesos, Akka, Cassandra and Kafka Anton Kirillov Big Data AW Meetup Sep 2015
  • 2. Who is this guy? ● Scala programmer ● Focused on distributed systems ● Building data platforms with SMACK/Hadoop ● Ph.D. in Computer Science ● Big Data engineer/consultant at Big Data AB ● Currently at Ooyala Stockholm (Videoplaza AB) ● Working with startups 2
  • 3. Roadmap ● SMACK stack overview ● Storage layer layout ● Fixing NoSQL limitations ● Cluster resource management ● Reliable scheduling and execution ● Data ingestion options ● Preparing for failures 3
  • 4. SMACK Stack ● Spark - fast and general engine for distributed, large-scale data processing ● Mesos - cluster resource management system that provides efficient resource isolation and sharing across distributed applications ● Akka - a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM ● Cassandra - distributed, highly available database designed to handle large amounts of data across multiple datacenters ● Kafka - a high-throughput, low-latency distributed messaging system designed for handling real-time data feeds 4
  • 5. Storage Layer: Cassandra ● optimized for heavy write loads ● configurable CA (CAP) ● linearly scalable ● XDCR support ● easy cluster resizing and inter-DC data migration 5
  • 6. Cassandra Data Model ● nested sorted map ● should be optimized for read queries ● data is distributed across nodes by partition key CREATE TABLE campaign( id uuid, year int, month int, day int, views bigint, clicks bigint, PRIMARY KEY (id, year, month, day) ); INSERT INTO campaign(id, year, month, day, views, clicks) VALUES(40b08953-a…,2015, 9, 10, 1000, 42); SELECT views, clicks FROM campaign WHERE id=40b08953-a… and year=2015 and month>8; 6
  • 7. Spark/Cassandra Example ● calculate total views per campaign for given month for all campaigns CREATE TABLE event( id uuid, ad_id uuid, campaign uuid, ts bigint, type text, PRIMARY KEY(id) ); val sc = new SparkContext(conf) case class Event(id: UUID, ad_id: UUID, campaign: UUID, ts: Long, `type`: String) sc.cassandraTable[Event]("keyspace", "event") .filter(e => e.`type` == "view" && checkMonth(e.ts)) .map(e => (e.campaign, 1)) .reduceByKey(_ + _) .collect() 7
  • 8. Naive Lambda example with Spark SQL case class CampaignReport(id: String, views: Long, clicks: Long) sql("""SELECT campaign.id as id, campaign.views as views, campaign.clicks as clicks, event.type as type FROM campaign JOIN event ON campaign.id = event.campaign """).rdd .groupBy(row => row.getAs[String]("id")) .map{ case (id, rows) => val views = rows.head.getAs[Long]("views") val clicks = rows.head.getAs[Long]("clicks") val res = rows.groupBy(row => row.getAs[String]("type")).mapValues(_.size) CampaignReport(id, views = views + res("view"), clicks = clicks + res("click")) }.saveToCassandra(“keyspace”, “campaign_report”) 8
  • 9. Let’s take a step back: Spark Basics ● RDD operations(transformations and actions) form DAG ● DAG is split into stages of tasks which are then submitted to cluster manager ● stages combine tasks which don’t require shuffling/repartitioning ● tasks run on workers and results then return to client 9
  • 10. Architecture of Spark/Cassandra Clusters Separate Write & Analytics: ● clusters can be scaled independently ● data is replicated by Cassandra asynchronously ● Analytics has different Read/Write load patterns ● Analytics contains additional data and processing results ● Spark resource impact limited to only one DC To fully facilitate Spark-C* connector data locality awareness, Spark workers should be collocated with Cassandra nodes 10
  • 11. Spark Applications Deployment Revisited Cluster Manager: ● Spark Standalone ● YARN ● Mesos 11
  • 12. Managing Cluster Resources: Mesos ● heterogenous workloads ● full cluster utilization ● static vs. dynamic resource allocation ● fault tolerance and disaster recovery ● single resource view at datacenter levelimage source: http://www.slideshare.net/caniszczyk/apache-mesos-at-twitter-texas-linuxfest-2014 12
  • 13. Mesos Architecture Overview ● leader election and service discovery via ZooKeeper ● slaves publish available resources to master ● master sends resource offers to frameworks ● scheduler replies with tasks and resources needed per task ● master sends tasks to slaves 13
  • 14. Bringing Spark, Mesos and Cassandra Together Deployment example ● Mesos Masters and ZooKeepers collocated ● Mesos Slaves and Cassandra nodes collocated to enforce better data locality for Spark ● Spark binaries deployed to all worker nodes and spark-env is configured ● Spark Executor JAR uploaded to S3 Invocation example spark-submit --class io.datastrophic.SparkJob /etc/jobs/spark-jobs.jar 14
  • 15. Marathon ● long running tasks execution ● HA mode with ZooKeeper ● Docker executor ● REST API 15
  • 16. Chronos ● distributed cron ● HA mode with ZooKeeper ● supports graphs of jobs ● sensitive to network failures 16
  • 17. More Mesos frameworks ● Hadoop ● Cassandra ● Kafka ● Myriad: YARN on Mesos ● Storm ● Samza 17
  • 18. Data ingestion: endpoints to consume the data Endpoint requirements: ● high throughput ● resiliency ● easy scalability ● back pressure 18
  • 19. Akka features class JsonParserActor extends Actor { def receive = { case s: String => Try(Json.parse(s).as[Event]) match { case Failure(ex) => log.error(ex) case Success(event) => sender ! event } } } class HttpActor extends Actor { def receive = { case req: HttpRequest => system.actorOf(Props[JsonParserActor]) ! req.body case e: Event => system.actorOf(Props[CassandraWriterActor]) ! e } } ● actor model implementation for JVM ● message-based and asynchronous ● no shared mutable state ● easy scalability from one process to cluster of machines ● actor hierarchies with parental supervision ● not only concurrency framework: ○ akka-http ○ akka-streams ○ akka-persistence 19
  • 20. Writing to Cassandra with Akka class CassandraWriterActor extends Actor with ActorLogging { //for demo purposes, session initialized here val session = Cluster.builder() .addContactPoint("cassandra.host") .build() .connect() override def receive: Receive = { case event: Event => val statement = new SimpleStatement(event.createQuery) .setConsistencyLevel(ConsistencyLevel.QUORUM) Try(session.execute(statement)) match { case Failure(ex) => //error handling code case Success => sender ! WriteSuccessfull } } } 20
  • 21. Cassandra meets Batch Processing ● writing raw data (events) to Cassandra with Akka is easy ● but computation time of aggregations/rollups will grow with amount of data ● Cassandra is still designed for fast serving but not batch processing, so pre-aggregation of incoming data is needed ● actors are not suitable for performing aggregation due to stateless design model ● micro-batches partially solve the problem ● reliable storage for raw data is still needed 21
  • 22. Kafka: distributed commit log ● pre-aggregation of incoming data ● consumers read data in batches ● available as Kinesis on AWS 22
  • 23. Publishing to Kafka with Akka Http val config = new ProducerConfig(KafkaConfig()) lazy val producer = new KafkaProducer[A, A](config) val topic = “raw_events” val routes: Route = { post{ decodeRequest{ entity(as[String]){ str => JsonParser.parse(str).validate[Event] match { case s: JsSuccess[String] => producer.send(new KeyedMessage(topic, str)) case e: JsError => BadRequest -> JsError.toFlatJson(e).toString() } } } } } object AkkaHttpMicroservice extends App with Service { Http().bindAndHandle(routes, config.getString("http.interface"), config.getInt("http. port")) } 23
  • 24. Spark Streaming 24 ● variety of data sources ● at-least-once semantics ● exactly-once semantics available with Kafka Direct and idempotent storage
  • 25. Spark Streaming: Kinesis example val ssc = new StreamingContext(conf, Seconds(10)) val kinesisStream = KinesisUtils.createStream(ssc,appName,streamName, endpointURL,regionName, InitialPositionInStream.LATEST, Duration(checkpointInterval), StorageLevel.MEMORY_ONLY) } //transforming given stream to Event and saving to C* kinesisStream.map(JsonUtils.byteArrayToEvent) .saveToCassandra(keyspace, table) ssc.start() ssc.awaitTermination() 25
  • 26. Designing for Failure: Backups and Patching ● be prepared for failures and broken data ● design backup and patching strategies upfront ● idempotece should be enforced 26
  • 27. Restoring backup from S3 val sc = new SparkContext(conf) sc.textFile(s"s3n://bucket/2015/*/*.gz") .map(s => Try(JsonUtils.stringToEvent(s))) .filter(_.isSuccess).map(_.get) .saveToCassandra(config.keyspace, config.table) 27
  • 29. So what SMACK is ● concise toolbox for wide variety of data processing scenarios ● battle-tested and widely used software with large communities ● easy scalability and replication of data while preserving low latencies ● unified cluster management for heterogeneous loads ● single platform for any kind of applications ● implementation platform for different architecture designs ● really short time-to-market (e.g. for MVP verification) 29