SlideShare a Scribd company logo
1 of 74
Download to read offline
From 0 to Cassandra on AWS in 30 days
Tsunami alerting with Cassandra
Andrei Arion - Geoffrey Bérard - Charles Herriau
@BeastieFurets
1
1 Context
2 The project, step by step
3 Feedbacks
4 Conclusion
2
3
LesFurets.com
•  1st independant insurance aggregator
•  A unique place to compare and buy hundreds of insurance products
•  Key figures:
–  2 500 000 quotes/year
–  31% market share on car comparison (january 2015)
–  more than 50 insurers
4
LesFurets.com
5
@BeastieFurets
6
4 Teams | 25 Engineers | Lean Kanban | Daily delivery
LesFurets.com
7
Bigdata@Telcom ParisTech
8
Context
•  Master Level course
•  Big Data Analysis and Management
–  Non relational Databases
–  30 hours
•  Lectures: 25 %
•  Hand’s on: 75 %
9http://www.telecom-paristech.fr/formation-continue/masteres-specialises/big-data.html
Context
•  30 students:
–  various backgrounds: students and professionals
–  various skill levels
–  various expertise area: devops, data engineer, marketing
–  SQL and Hadoop exposure
10http://www.telecom-paristech.fr/formation-continue/masteres-specialises/big-data.html
Main topics:
Project goals
–  Choose the right technology
–  Implement from scratch a full stack solution
–  Optimize data model for a particular use case
–  Deploy on cloud
–  Discuss tradeoffs
One month to deliver
11http://www.telecom-paristech.fr/formation-continue/masteres-specialises/big-data.html
The project,
Step by step
12
Let’s all become students
13
14
Labor day
1
15
Discovering the subject
16
Discovering the subject
An earthquake occurs in the Sea of Japan. A tsunami is likely to hit the coast.
Notify all the population within a 500km range as soon as possible.
Constraints:
•  Team work (3 members)
•  Choose a technology seen in the module
•  Deploy on AWS (300€ budget)
•  The closest data center is lost
•  Pre-load the data
2
17
Discovering Data
18
How to get the data ?
3
•  Generated logs from telecom providers
•  One month of position tracking
•  1 Mb, 1 Gb, 10 Gb, 100 Gb files
•  Available on Amazon S3
•  10 biggest cities
•  100 network cells/city
•  1.000.000 different phone numbers
19
Data distribution
4
20
Data format
Timestamp Cell ID Coordinates Phone Number
4
2015-01-04 17:10:52,834;Osa_61;34.232793;135.906451;829924
Full dataset (100 GB):
•  240 000 logs per hour / city
•  ~ 2 000 000 000 rows
21
Data distribution
5
22
Data distribution
5
2015-01-04 17:10:52,834;Osa_61;34.232793;135.906451;829924
2015-01-04 17:10:52,928;Yok_85;35.494120;139.424249;121737
2015-01-04 17:10:52,423;Tok_14;35.683104;139.755020;731737
2015-01-04 17:10:53,923;Osa_61;34.232793;135.906451;861343
2015-01-04 17:10:53,153;Kyo_06;34.980933;135.777283;431737
...
2015-01-04 17:10:55,928;Yok_99;35.030989;140.126021;829924
23
Data distribution
5
Choosing the stack
24
25
Storage Technology short list
•  Efficient for logs
•  Fault Tolerance
–  Geolocation
–  Easy transform logs into documents
6
26
Install a local Cassandra cluster
7
27
Data Modeling
28
Naive model
Osa_19
2015-05-19 15:25:57,369 2015-05-21 15:00:57,551
456859 012564
CREATE TABLE phones1 (
cell text,
instant timestamp,
phoneNumber text,
PRIMARY KEY (cell, instant)
);
9
29
Avoid overwrites
Osa_19
2015-05-19 15:25:57,369 2015-05-21 15:00:57,551
456859, 659842 012564
9
CREATE TABLE phones2 (
cell text,
instant timestamp,
phoneNumbers Set<text>,
PRIMARY KEY (cell, instant)
);
30
Compound partition key
Osa_19 : 2015-05-19 15:25:57,369
456859 659842
- -
10
CREATE TABLE phones3 (
cell text,
instant timestamp,
phoneNumber text,
PRIMARY KEY ((cell, instant), phoneNumber)
);
34
Hourly bucketing
Osa_19 : 2015-05-19 15:00
456859 659842
- -
10
CREATE TABLE phones4 (
cell text,
instant timestamp,
phoneNumber text,
PRIMARY KEY ((cell, instant), phoneNumber)
);
CREATE TABLE phones5 (
cell text,
instant timestamp,
numbers text,
PRIMARY KEY ((cell, instant))
);
32
Query first
Osa_19, 2015-05-19 15:00
456859,659842
-
11
33
Importing Data
34
Import data into Cassandra:
13
100GB
2 billions rows
?
35
Import data into Cassandra
13
36
Import data into Cassandra: Copy
13
•  Pre-­‐processing:	
  
–  project	
  only	
  the	
  needed	
  informa9on	
  
–  cleanup	
  date	
  format	
  (ISO-8601 vs RFC3339)
37
Import data into Cassandra: Copy
14
100 hours*200 hours*
Pre-process COPY
100GB
2 billions rows
sed 's/,/./g' | awk -F',' '{ printf ""%s",
"%s","%s"",$2,$1,$5;print ""}'
•  Naive	
  parallelized	
  COPY	
  	
  
–  launch	
  one	
  copy	
  per	
  node	
  
–  launch	
  mul9ple	
  COPY	
  threads	
  /	
  node	
  
38
Import data into Cassandra: parallelize Copy
14
pre-processing,
splitting
parallelized copy
100GB
2 billions rows
39
Import data into Cassandra: Copy
15
 
•  What	
  really	
  happens	
  during	
  the	
  import?	
  
•  IMPORT	
  =	
  WRITE 	
  	
  
•  What	
  really	
  happens	
  during	
  a	
  WRITE	
  
	
  
40
How to speed up Cassandra import?
16
41
Cassandra Write Path Documentation
17
•  How	
  to	
  speed	
  up	
  Cassandra	
  import	
  =>	
  SSTableLoader	
  	
  
–  generate	
  SSTables	
  using	
  the	
  SSTableWriter	
  API	
  
–  stream	
  the	
  SSTables	
  to	
  your	
  Cassandra	
  cluster	
  using	
  sstableloader	
  
42
SSTableLoader
17
SSTableWriter sstableloader
100GB
2 billions rows
300 hours* 2 hours*
•  S3	
  is	
  distributed!	
  
–  op9mized	
  for	
  high-­‐throughput	
  when	
  used	
  in	
  parallel	
  
–  very	
  high	
  throughput	
  to/from	
  EC2	
  	
  
43
Import data into Cassandra from Amazon S3
17
©2011 http://improve.dk/pushing-the-limits-of-amazon-s3-upload-performance/
•  S3	
  is	
  a	
  distributed	
  file	
  system	
  
–  op9mized	
  for	
  high-­‐throughput	
  when	
  used	
  in	
  parallel	
  
•  Use	
  Spark	
  as	
  ETL	
  
44
Import data into Cassandra from Amazon S3
18
S3
100GB
2 billions rows
2 billions
inserts
•  use	
  Spark	
  to	
  read/pre-­‐process/write	
  to	
  Cassandra	
  in	
  parallel	
  
•  map	
  
	
  
45
Using Spark to parallelize the insert
18
S3
100GB
2 billions rows
2 billions
inserts
•  instead	
  of	
  impor9ng	
  all	
  the	
  lines	
  from	
  the	
  csv,	
  use	
  Spark	
  to	
  reduce	
  the	
  lines	
  
•  map+reduce	
  
	
  
46
Cheating / Modeling before inserting
18
S3
100GB
2 billions rows
2 billions
inserts
•  instead	
  of	
  impor9ng	
  all	
  the	
  lines	
  from	
  the	
  csv,	
  use	
  Spark	
  to	
  reduce	
  the	
  lines	
  
•  map+reduce	
  
	
  
47
Cheating / Modeling before inserting
19
S3
100GB
2 billions rows
1 million
inserts
48
Using AWS
49
Choosing the right AWS Instance type
22
50
Shopping on Amazon
•  What can 300E get you: xHours of y Instance Types + Z EBS
22
Good starting point: 6 analytics nodes, M3.large or M3.xlarge, SSD
Finally : not so difficult to choose
51
23
Capacity planning
52
•  Avantages
–  install an open source stack
–  full control
–  latest version
•  Inconvenients
–  devops skills required
–  lot of configuration to do
–  version compatibility
–  time / budget consuming
Install from scratch
24
•  DataStax Enterprise AMI : 1 analytics node shared between Cassandra and Spark
53
Start small
25
•  Requirement:
–  Preload data
•  Goal:
–  Minimize cluster uptime
•  Strategies:
–  use EBS -> persistent storage (expensive, slow)
–  load the dataset on a node and turn off the others (free, time consuming)
–  do everything “the night before” (free, risky)
54
Optimizing budget
26
55
Preparing for D-Day
•  Preload the data into Cassandra on the AWS cluster
•  Input: earthquake date and location
•  Kill the closest node
•  Start simulation
–  monitor in realtime the alerting performance
•  Tuning (model, import)
Lather, rinse, repeat...
56
Preparing for the D Day
28
•  Alternatives:
–  stop/kill the Cassandra process
–  turn off the gossip
–  AWS instance shutdown
•  The effect is the same
57
How to stop a specific node of the cluster ?
29
58
Pre-loading data
29
59
Waiting our turn
30
Conclusion
60
61
62
Data Model
No model
Use Input File Model
Query First
Optimized Query First
Smart Cheaters
0 1 2 3 4
63
Modelling in SQL vs CQL
●  SQL => CQL : state of mind shifting
○  3NF vs denormalization
○  model first vs query first
●  SQL and CQL are syntactically very similar that can be confusing
○  no Joins, range queries on partition keys...
64
Modeling data: cqlsh vs cli
65
Importing Data
0 Mo
10 Mo
1,000 Mo
100,000 Mo
66
Development
Notification System
Visualisation System
Real Time Notification
Killing node
Manage Aftershoks
0 1 2 3 4 5 6 7 8 9 10
Yes No
67
Architecture
70%
10%
20%
Cassandra
MongoDB
Both
68
Amazon Web Services
80%
20%
Yes
No
69
Quality
100%
Yes
No
70
What is the main problem ?
Modeling Data
Importing Data
Architecture
Development
Deployment
Quality
=> Multidisciplinary Approach
71
Multidisciplinary approach
Team 1 2 3 4 5 6 7 8 9 10
Modeling Data
√	
   √	
   √	
   √	
   √	
   √	
   √	
   √	
  
Importing Data
√	
   √	
   √	
  
Architecture
√	
   √	
   √	
   √	
   √	
   √	
  
Development
√	
   √	
   √	
   √	
   √	
   √	
  
Deployment
√	
   √	
   √	
   √	
   √	
   √	
   √	
   √	
  
Quality
√	
   √	
   √	
   √	
  
Grade 8 10 12 13 13 14 14.5 16 16.5 17
72
Team Skills Balance
0
1
2
3
4
5
Data Modeling
Data import
ArchitectureDevelopment
Deployment
17/20
14/20
08/20
Conclusion
•  Easy to learn and easy to teach
•  Easy to deploy on AWS
•  Performance monitor for tuning the models
•  Passing from SQL vs CQL can be challenging
•  Fun to implement on a concrete use-case with time/budget constraints
73
74
Thank you
Andrei Arion - aar@lesfurets.com
Geoffrey Bérard - gbe@lesfurets.com
Charles Herriau - che@lesfurets.com
@BeastieFurets

More Related Content

What's hot

Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumbergerinside-BigData.com
 
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?Julia Angell
 
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backendscodecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backendsDataStax Academy
 
Measuring Database Performance on Bare Metal AWS Instances
Measuring Database Performance on Bare Metal AWS InstancesMeasuring Database Performance on Bare Metal AWS Instances
Measuring Database Performance on Bare Metal AWS InstancesScyllaDB
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Databricks
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit
 
Titan and Cassandra at WellAware
Titan and Cassandra at WellAwareTitan and Cassandra at WellAware
Titan and Cassandra at WellAwaretwilmes
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series WorldMapR Technologies
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkDatabricks
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data AppTrieu Nguyen
 
Understanding Storage I/O Under Load
Understanding Storage I/O Under LoadUnderstanding Storage I/O Under Load
Understanding Storage I/O Under LoadScyllaDB
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetupamarsri
 
Counters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary TaleCounters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary TaleEric Lubow
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexesDaniel Lemire
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kitehuguk
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaminghuguk
 

What's hot (20)

Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
 
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backendscodecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
 
Measuring Database Performance on Bare Metal AWS Instances
Measuring Database Performance on Bare Metal AWS InstancesMeasuring Database Performance on Bare Metal AWS Instances
Measuring Database Performance on Bare Metal AWS Instances
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Titan and Cassandra at WellAware
Titan and Cassandra at WellAwareTitan and Cassandra at WellAware
Titan and Cassandra at WellAware
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
 
Understanding Storage I/O Under Load
Understanding Storage I/O Under LoadUnderstanding Storage I/O Under Load
Understanding Storage I/O Under Load
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Counters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary TaleCounters At Scale - A Cautionary Tale
Counters At Scale - A Cautionary Tale
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaming
 

Similar to LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting System with Cassandra (Français)

[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_SummaryHiram Fleitas León
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Crate.io
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Dataconomy Media
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerAndrew Yongjoon Kong
 
Adam Dagnall: Advanced S3 compatible storage integration in CloudStack
Adam Dagnall: Advanced S3 compatible storage integration in CloudStackAdam Dagnall: Advanced S3 compatible storage integration in CloudStack
Adam Dagnall: Advanced S3 compatible storage integration in CloudStackShapeBlue
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...DataStax Academy
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsJulien Anguenot
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot modePrakash Chockalingam
 
[RightScale Webinar] Architecting Databases in the cloud: How RightScale Doe...
[RightScale Webinar] Architecting Databases in the cloud:  How RightScale Doe...[RightScale Webinar] Architecting Databases in the cloud:  How RightScale Doe...
[RightScale Webinar] Architecting Databases in the cloud: How RightScale Doe...RightScale
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaScyllaDB
 
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...Coburn Watson
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 

Similar to LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting System with Cassandra (Français) (20)

[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
Dice presents-feb2014
Dice presents-feb2014Dice presents-feb2014
Dice presents-feb2014
 
Self-Service Supercomputing
Self-Service SupercomputingSelf-Service Supercomputing
Self-Service Supercomputing
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and Container
 
Adam Dagnall: Advanced S3 compatible storage integration in CloudStack
Adam Dagnall: Advanced S3 compatible storage integration in CloudStackAdam Dagnall: Advanced S3 compatible storage integration in CloudStack
Adam Dagnall: Advanced S3 compatible storage integration in CloudStack
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot mode
 
[RightScale Webinar] Architecting Databases in the cloud: How RightScale Doe...
[RightScale Webinar] Architecting Databases in the cloud:  How RightScale Doe...[RightScale Webinar] Architecting Databases in the cloud:  How RightScale Doe...
[RightScale Webinar] Architecting Databases in the cloud: How RightScale Doe...
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
 
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Recently uploaded

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Recently uploaded (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting System with Cassandra (Français)

  • 1. From 0 to Cassandra on AWS in 30 days Tsunami alerting with Cassandra Andrei Arion - Geoffrey Bérard - Charles Herriau @BeastieFurets 1
  • 2. 1 Context 2 The project, step by step 3 Feedbacks 4 Conclusion 2
  • 3. 3
  • 4. LesFurets.com •  1st independant insurance aggregator •  A unique place to compare and buy hundreds of insurance products •  Key figures: –  2 500 000 quotes/year –  31% market share on car comparison (january 2015) –  more than 50 insurers 4
  • 6. @BeastieFurets 6 4 Teams | 25 Engineers | Lean Kanban | Daily delivery
  • 9. Context •  Master Level course •  Big Data Analysis and Management –  Non relational Databases –  30 hours •  Lectures: 25 % •  Hand’s on: 75 % 9http://www.telecom-paristech.fr/formation-continue/masteres-specialises/big-data.html
  • 10. Context •  30 students: –  various backgrounds: students and professionals –  various skill levels –  various expertise area: devops, data engineer, marketing –  SQL and Hadoop exposure 10http://www.telecom-paristech.fr/formation-continue/masteres-specialises/big-data.html Main topics:
  • 11. Project goals –  Choose the right technology –  Implement from scratch a full stack solution –  Optimize data model for a particular use case –  Deploy on cloud –  Discuss tradeoffs One month to deliver 11http://www.telecom-paristech.fr/formation-continue/masteres-specialises/big-data.html
  • 13. Let’s all become students 13
  • 16. 16 Discovering the subject An earthquake occurs in the Sea of Japan. A tsunami is likely to hit the coast. Notify all the population within a 500km range as soon as possible. Constraints: •  Team work (3 members) •  Choose a technology seen in the module •  Deploy on AWS (300€ budget) •  The closest data center is lost •  Pre-load the data 2
  • 18. 18 How to get the data ? 3 •  Generated logs from telecom providers •  One month of position tracking •  1 Mb, 1 Gb, 10 Gb, 100 Gb files •  Available on Amazon S3
  • 19. •  10 biggest cities •  100 network cells/city •  1.000.000 different phone numbers 19 Data distribution 4
  • 20. 20 Data format Timestamp Cell ID Coordinates Phone Number 4 2015-01-04 17:10:52,834;Osa_61;34.232793;135.906451;829924
  • 21. Full dataset (100 GB): •  240 000 logs per hour / city •  ~ 2 000 000 000 rows 21 Data distribution 5
  • 22. 22 Data distribution 5 2015-01-04 17:10:52,834;Osa_61;34.232793;135.906451;829924 2015-01-04 17:10:52,928;Yok_85;35.494120;139.424249;121737 2015-01-04 17:10:52,423;Tok_14;35.683104;139.755020;731737 2015-01-04 17:10:53,923;Osa_61;34.232793;135.906451;861343 2015-01-04 17:10:53,153;Kyo_06;34.980933;135.777283;431737 ... 2015-01-04 17:10:55,928;Yok_99;35.030989;140.126021;829924
  • 25. 25 Storage Technology short list •  Efficient for logs •  Fault Tolerance –  Geolocation –  Easy transform logs into documents 6
  • 26. 26 Install a local Cassandra cluster 7
  • 28. 28 Naive model Osa_19 2015-05-19 15:25:57,369 2015-05-21 15:00:57,551 456859 012564 CREATE TABLE phones1 ( cell text, instant timestamp, phoneNumber text, PRIMARY KEY (cell, instant) ); 9
  • 29. 29 Avoid overwrites Osa_19 2015-05-19 15:25:57,369 2015-05-21 15:00:57,551 456859, 659842 012564 9 CREATE TABLE phones2 ( cell text, instant timestamp, phoneNumbers Set<text>, PRIMARY KEY (cell, instant) );
  • 30. 30 Compound partition key Osa_19 : 2015-05-19 15:25:57,369 456859 659842 - - 10 CREATE TABLE phones3 ( cell text, instant timestamp, phoneNumber text, PRIMARY KEY ((cell, instant), phoneNumber) );
  • 31. 34 Hourly bucketing Osa_19 : 2015-05-19 15:00 456859 659842 - - 10 CREATE TABLE phones4 ( cell text, instant timestamp, phoneNumber text, PRIMARY KEY ((cell, instant), phoneNumber) );
  • 32. CREATE TABLE phones5 ( cell text, instant timestamp, numbers text, PRIMARY KEY ((cell, instant)) ); 32 Query first Osa_19, 2015-05-19 15:00 456859,659842 - 11
  • 34. 34 Import data into Cassandra: 13 100GB 2 billions rows ?
  • 35. 35 Import data into Cassandra 13
  • 36. 36 Import data into Cassandra: Copy 13
  • 37. •  Pre-­‐processing:   –  project  only  the  needed  informa9on   –  cleanup  date  format  (ISO-8601 vs RFC3339) 37 Import data into Cassandra: Copy 14 100 hours*200 hours* Pre-process COPY 100GB 2 billions rows sed 's/,/./g' | awk -F',' '{ printf ""%s", "%s","%s"",$2,$1,$5;print ""}'
  • 38. •  Naive  parallelized  COPY     –  launch  one  copy  per  node   –  launch  mul9ple  COPY  threads  /  node   38 Import data into Cassandra: parallelize Copy 14 pre-processing, splitting parallelized copy 100GB 2 billions rows
  • 39. 39 Import data into Cassandra: Copy 15
  • 40.   •  What  really  happens  during  the  import?   •  IMPORT  =  WRITE     •  What  really  happens  during  a  WRITE     40 How to speed up Cassandra import? 16
  • 41. 41 Cassandra Write Path Documentation 17
  • 42. •  How  to  speed  up  Cassandra  import  =>  SSTableLoader     –  generate  SSTables  using  the  SSTableWriter  API   –  stream  the  SSTables  to  your  Cassandra  cluster  using  sstableloader   42 SSTableLoader 17 SSTableWriter sstableloader 100GB 2 billions rows 300 hours* 2 hours*
  • 43. •  S3  is  distributed!   –  op9mized  for  high-­‐throughput  when  used  in  parallel   –  very  high  throughput  to/from  EC2     43 Import data into Cassandra from Amazon S3 17 ©2011 http://improve.dk/pushing-the-limits-of-amazon-s3-upload-performance/
  • 44. •  S3  is  a  distributed  file  system   –  op9mized  for  high-­‐throughput  when  used  in  parallel   •  Use  Spark  as  ETL   44 Import data into Cassandra from Amazon S3 18 S3 100GB 2 billions rows 2 billions inserts
  • 45. •  use  Spark  to  read/pre-­‐process/write  to  Cassandra  in  parallel   •  map     45 Using Spark to parallelize the insert 18 S3 100GB 2 billions rows 2 billions inserts
  • 46. •  instead  of  impor9ng  all  the  lines  from  the  csv,  use  Spark  to  reduce  the  lines   •  map+reduce     46 Cheating / Modeling before inserting 18 S3 100GB 2 billions rows 2 billions inserts
  • 47. •  instead  of  impor9ng  all  the  lines  from  the  csv,  use  Spark  to  reduce  the  lines   •  map+reduce     47 Cheating / Modeling before inserting 19 S3 100GB 2 billions rows 1 million inserts
  • 49. 49 Choosing the right AWS Instance type 22
  • 50. 50 Shopping on Amazon •  What can 300E get you: xHours of y Instance Types + Z EBS 22
  • 51. Good starting point: 6 analytics nodes, M3.large or M3.xlarge, SSD Finally : not so difficult to choose 51 23 Capacity planning
  • 52. 52 •  Avantages –  install an open source stack –  full control –  latest version •  Inconvenients –  devops skills required –  lot of configuration to do –  version compatibility –  time / budget consuming Install from scratch 24
  • 53. •  DataStax Enterprise AMI : 1 analytics node shared between Cassandra and Spark 53 Start small 25
  • 54. •  Requirement: –  Preload data •  Goal: –  Minimize cluster uptime •  Strategies: –  use EBS -> persistent storage (expensive, slow) –  load the dataset on a node and turn off the others (free, time consuming) –  do everything “the night before” (free, risky) 54 Optimizing budget 26
  • 56. •  Preload the data into Cassandra on the AWS cluster •  Input: earthquake date and location •  Kill the closest node •  Start simulation –  monitor in realtime the alerting performance •  Tuning (model, import) Lather, rinse, repeat... 56 Preparing for the D Day 28
  • 57. •  Alternatives: –  stop/kill the Cassandra process –  turn off the gossip –  AWS instance shutdown •  The effect is the same 57 How to stop a specific node of the cluster ? 29
  • 61. 61
  • 62. 62 Data Model No model Use Input File Model Query First Optimized Query First Smart Cheaters 0 1 2 3 4
  • 63. 63 Modelling in SQL vs CQL ●  SQL => CQL : state of mind shifting ○  3NF vs denormalization ○  model first vs query first ●  SQL and CQL are syntactically very similar that can be confusing ○  no Joins, range queries on partition keys...
  • 65. 65 Importing Data 0 Mo 10 Mo 1,000 Mo 100,000 Mo
  • 66. 66 Development Notification System Visualisation System Real Time Notification Killing node Manage Aftershoks 0 1 2 3 4 5 6 7 8 9 10 Yes No
  • 70. 70 What is the main problem ? Modeling Data Importing Data Architecture Development Deployment Quality => Multidisciplinary Approach
  • 71. 71 Multidisciplinary approach Team 1 2 3 4 5 6 7 8 9 10 Modeling Data √   √   √   √   √   √   √   √   Importing Data √   √   √   Architecture √   √   √   √   √   √   Development √   √   √   √   √   √   Deployment √   √   √   √   √   √   √   √   Quality √   √   √   √   Grade 8 10 12 13 13 14 14.5 16 16.5 17
  • 72. 72 Team Skills Balance 0 1 2 3 4 5 Data Modeling Data import ArchitectureDevelopment Deployment 17/20 14/20 08/20
  • 73. Conclusion •  Easy to learn and easy to teach •  Easy to deploy on AWS •  Performance monitor for tuning the models •  Passing from SQL vs CQL can be challenging •  Fun to implement on a concrete use-case with time/budget constraints 73
  • 74. 74 Thank you Andrei Arion - aar@lesfurets.com Geoffrey Bérard - gbe@lesfurets.com Charles Herriau - che@lesfurets.com @BeastieFurets