SlideShare una empresa de Scribd logo
1 de 47
Descargar para leer sin conexión
Lambda
at Weather Scale Robbie Strickland
Who Am I?
Robbie Strickland
Director of Engineering, Analytics
rstrickland@weather.com
@rs_atl
Who Am I?
● Contributor to C*
community since 2010
● DataStax MVP 2014/15
● Author, Cassandra High
Availability
● Founder, ATL Cassandra
User Group
About TWC
● ~30 billion API requests per day
● ~120 million active mobile users
● #3 most active mobile user base
● ~360 PB of traffic daily
● Most weather data comes from us
Use Case
● Billions of events per day
○ Web/mobile beacons
○ Logs
○ Weather conditions + forecasts
○ etc.
● Keep data forever
Use Case
● Efficient batch + streaming analysis
● Self-serve data science
● BI / visualization tool support
Architecture
Attempt[0] Architecture
Operational
Analytics
Business
Analytics
Executive
Dashboards
Data
Discovery
Data
Science
3rd Party
System
Integration
Events
3rd Party
Other DBs
S3
Stream
Processing
Batch
Sources
Storage and Processing
Consumers
Data Access
Kafka
Streaming
Custom
Ingestion
Pipeline
ETL
Streaming
Sources
RESTful
Enqueue
service
SQL
Attempt[0] Data Model
CREATE TABLE events (
timebucket bigint,
timestamp bigint,
eventtype varchar,
eventid varchar,
platform varchar,
userid varchar,
version int,
appid varchar,
useragent varchar,
eventdata varchar,
tags set<varchar>,
devicedata map<varchar, varchar>,
PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)
) WITH CACHING = 'none'
AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };
Attempt[0] Data Model
CREATE TABLE events (
timebucket bigint,
timestamp bigint,
eventtype varchar,
eventid varchar,
platform varchar,
userid varchar,
version int,
appid varchar,
useragent varchar,
eventdata varchar,
tags set<varchar>,
devicedata map<varchar, varchar>,
PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)
) WITH CACHING = 'none'
AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };
Event payload == schema-less JSON
Attempt[0] Data Model
CREATE TABLE events (
timebucket bigint,
timestamp bigint,
eventtype varchar,
eventid varchar,
platform varchar,
userid varchar,
version int,
appid varchar,
useragent varchar,
eventdata varchar,
tags set<varchar>,
devicedata map<varchar, varchar>,
PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)
) WITH CACHING = 'none'
AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };
Partitioned by time bucket + type
Attempt[0] Data Model
CREATE TABLE events (
timebucket bigint,
timestamp bigint,
eventtype varchar,
eventid varchar,
platform varchar,
userid varchar,
version int,
appid varchar,
useragent varchar,
eventdata varchar,
tags set<varchar>,
devicedata map<varchar, varchar>,
PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)
) WITH CACHING = 'none'
AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };
Time-series data good fit for DTCS
Attempt[0] tl;dr
● C* everywhere
● Streaming data via custom ingest process
● Kafka backed by RESTful service
● Batch data via Informatica
● Spark SQL through ODBC
● Schema-less event payload
● Date-tiered compaction
Attempt[0] tl;dr
● C* everywhere
● Streaming data via custom ingest process
● Kafka backed by RESTful service
● Batch data via Informatica
● Spark SQL through ODBC
● Schema-less event payload
● Date-tiered compaction
Attempt[0] Lessons
● Batch loading large data sets into C* is silly
● … and expensive
● … and using Informatica to do it is SLOW
● Kafka + REST services == unnecessary
● No viable open source C* Hive driver
● DTCS is broken (see CASSANDRA-9666)
Attempt[0] Lessons
● Schema-less == bad:
○ Must parse JSON to extract key data
○ Expensive to analyze by event type
○ Cannot tune by event type
Attempt[1] Architecture
Data Lake
Operational
Analytics
Business
Analytics
Executive
Dashboards
Data
Discovery
Data
Science
3rd Party
System
Integration
Stream
Processing
Long Term Raw Storage
Short Term Storage and
Big Data Processing
Consumers
Amazon SQS
Streaming
Custom
Ingestion
Pipeline
Events
3rd Party
Other DBs
S3
Batch
Sources
Streaming
Sources
ETL
Data Access
SQL
Attempt[1] Data Model
● Each event type gets its own table
● Tables individually tuned based on workload
● Schema applied at ingestion:
○ We’re reading everything anyway
○ Makes subsequent analysis much easier
○ Allows us to filter junk early
Attempt[1] tl;dr
● Use C* for streaming data
○ Rolling time window (TTL depends on type)
○ Real-time access to events
○ Data locality makes Spark jobs faster
Attempt[1] tl;dr
● Everything else in S3
○ Batch data loads (mostly logs)
○ Daily C* backups
○ Stored as Parquet
○ Cheap, scalable long-term storage
○ Easy access from Spark
○ Easy to share internally & externally
○ Open source Hive support
Attempt[1] tl;dr
● Kafka replaced by SQS:
○ Scalable & reliable
○ Already fronted by a RESTful interface
○ Nearly free to operate (nothing to manage)
○ Robust security model
○ One queue per event type/platform
○ Built-in monitoring
Attempt[1] tl;dr
● STCS in lieu of DTCS (and LCS)
○ Because it’s bulletproof
○ Partitions spanning sstables is acceptable
○ Testing Time-Window compaction (thanks Jeff
Jirsa)
Attempt[1] tl;dr
● STCS in lieu of DTCS (and LCS)
○ Because it’s bulletproof
○ Partitions spanning sstables is acceptable
○ Testing Time-Window compaction (thanks Jeff
Jirsa)
Fine Print
● Use C* >= 2.1.8
○ CASSANDRA-9637 - fixes Spark input split
computation
○ CASSANDRA-9549 - fixes memory leak
○ CASSANDRA-9436 - exposes rpc/broadcast
addresses for Spark/cloud environments
● Version incompatibilities abound (check sbt
file for Spark-Cassandra connector)
Fine Print
● Two main Spark clusters:
○ Co-located with C* for heavy analysis
■ Predictable load
■ Efficient C* access
○ Self-serve in same DC but not co-located
■ Unpredictable load
■ Favors mining S3 data
■ Isolated from production jobs
Data Modeling
Partitioning
● Opposite strategy from “normal” C* modeling
○ Model for good parallelism
○ … not for single-partition queries
● Avoid shuffling for most cases
○ Shuffles occur when NOT grouping by partition key
○ Partition for your most common grouping
Secondary Indexes
● Useful for C*-level filtering
● Reduces Spark workload and RAM footprint
● Low cardinality is still the rule
Secondary Indexes (Client Access)
Secondary Indexes (with Spark)
Full-text Indexes
● Enabled via Stratio-Lucene custom index
(https://github.com/Stratio/cassandra-lucene-index)
● Great for C*-side filters
● Same access pattern as secondary indexes
Full-text Indexes
CREATE CUSTOM INDEX email_index on emails(lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds':'1',
'schema': '{
fields: {
id : {type : "integer"},
user : {type : "string"},
subject : {type : "text", analyzer : "english"},
body : {type : "text", analyzer : "english"},
time : {type : "date", pattern : "yyyy-MM-dd hh:mm:ss"}
}
}'
};
Full-text Indexes
SELECT * FROM emails WHERE lucene='{
filter : {type:"range", field:"time", lower:"2015-05-26 20:29:59"},
query : {type:"phrase", field:"subject", values:["test"]}
}';
SELECT * FROM emails WHERE lucene='{
filter : {type:"range", field:"time", lower:"2015-05-26 18:29:59"},
query : {type:"fuzzy", field:"subject", value:"thingy", max_edits:1}
}';
WIDE ROWS
Caution:
Wide Rows
● It only takes one to ruin your day
● Monitor cfstats for max partition bytes
● Use toppartitions to find hot keys
Avoid Nulls
● Nulls are deletes
● Deletes create tombstones
● Don’t write nulls!
● Beware of nulls in prepared statements
Data Exploration
Data Warehouse Paradigm - Old
Ingest Model Transform Design
Visualize
Data Warehouse Paradigm - New
Ingest Explore Analyze Deploy
Visualize
Visualization
● Critical to understanding your data
● Reduced time to visualization
● … from >1 month to minutes (!!)
● Waterfall to agile
Zeppelin
● Open source Spark notebook
● Interpreters for Scala, Python, Spark SQL,
CQL, Hive, Shell, & more
● Data visualizations
● Scheduled jobs
Zeppelin
Zeppelin
Zeppelin
Final Thoughts
Should I use DSE?
● Open source culture?
● On-staff C* expert(s)?
● Willingness to contribute/fix stuff?
● Moderate degree of risk is acceptable?
● Need/desire for latest features?
● Need/desire to control tool versions?
● Don’t have the budget for licensing?
We’re Hiring!
Robbie Strickland
rstrickland@weather.com

Más contenido relacionado

La actualidad más candente

Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSDataStax Academy
 
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable CassandraCassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandraaaronmorton
 
British Gas Connected Homes: Data Engineering
British Gas Connected Homes: Data EngineeringBritish Gas Connected Homes: Data Engineering
British Gas Connected Homes: Data EngineeringDataStax Academy
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
 
Building A Diverse Geo-Architecture For Cloud Native Applications In One Day
Building A Diverse Geo-Architecture For Cloud Native Applications In One DayBuilding A Diverse Geo-Architecture For Cloud Native Applications In One Day
Building A Diverse Geo-Architecture For Cloud Native Applications In One DayVMware Tanzu
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaScyllaDB
 
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandracodecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with CassandraDataStax Academy
 
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...DataStax
 
Deep dive into event store using Apache Cassandra
Deep dive into event store using Apache CassandraDeep dive into event store using Apache Cassandra
Deep dive into event store using Apache CassandraAhmedabadJavaMeetup
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormDataStax
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Tugdual Grall
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax
 
Modeling the IoT with TitanDB and Cassandra
Modeling the IoT with TitanDB and CassandraModeling the IoT with TitanDB and Cassandra
Modeling the IoT with TitanDB and Cassandratwilmes
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaDataStax Academy
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
 
Webinar: How to Shrink Your Datacenter Footprint by 50%
Webinar: How to Shrink Your Datacenter Footprint by 50%Webinar: How to Shrink Your Datacenter Footprint by 50%
Webinar: How to Shrink Your Datacenter Footprint by 50%ScyllaDB
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ IndixRajesh Muppalla
 

La actualidad más candente (20)

Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWS
 
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable CassandraCassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
 
British Gas Connected Homes: Data Engineering
British Gas Connected Homes: Data EngineeringBritish Gas Connected Homes: Data Engineering
British Gas Connected Homes: Data Engineering
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
 
Building A Diverse Geo-Architecture For Cloud Native Applications In One Day
Building A Diverse Geo-Architecture For Cloud Native Applications In One DayBuilding A Diverse Geo-Architecture For Cloud Native Applications In One Day
Building A Diverse Geo-Architecture For Cloud Native Applications In One Day
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
 
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandracodecentric AG: CQRS and Event Sourcing Applications with Cassandra
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
 
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
 
Deep dive into event store using Apache Cassandra
Deep dive into event store using Apache CassandraDeep dive into event store using Apache Cassandra
Deep dive into event store using Apache Cassandra
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
 
Modeling the IoT with TitanDB and Cassandra
Modeling the IoT with TitanDB and CassandraModeling the IoT with TitanDB and Cassandra
Modeling the IoT with TitanDB and Cassandra
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and Kafka
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Webinar: How to Shrink Your Datacenter Footprint by 50%
Webinar: How to Shrink Your Datacenter Footprint by 50%Webinar: How to Shrink Your Datacenter Footprint by 50%
Webinar: How to Shrink Your Datacenter Footprint by 50%
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 

Destacado

Big Data Grows Up - A (re)introduction to Cassandra
Big Data Grows Up - A (re)introduction to CassandraBig Data Grows Up - A (re)introduction to Cassandra
Big Data Grows Up - A (re)introduction to CassandraRobbie Strickland
 
A Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big DataA Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big DataEdward Hsu
 
Scalable data modelling by example - Cassandra Summit '16
Scalable data modelling by example - Cassandra Summit '16Scalable data modelling by example - Cassandra Summit '16
Scalable data modelling by example - Cassandra Summit '16Carlos Alonso Pérez
 
Hi Speed Datawarehousing
Hi Speed DatawarehousingHi Speed Datawarehousing
Hi Speed DatawarehousingJos van Dongen
 
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®Cambridge Semantics
 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI DutchJos van Dongen
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo UnstructuredCambridge Semantics
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData
 
Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Jos van Dongen
 
Graph-based Discovery and Analytics at Enterprise Scale
Graph-based Discovery and Analytics at Enterprise ScaleGraph-based Discovery and Analytics at Enterprise Scale
Graph-based Discovery and Analytics at Enterprise ScaleCambridge Semantics
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
Scalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosScalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosDataWorks Summit
 
Cassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCarlos Alonso Pérez
 
Cassandra Summit 2014: CQL Under the Hood
Cassandra Summit 2014: CQL Under the HoodCassandra Summit 2014: CQL Under the Hood
Cassandra Summit 2014: CQL Under the HoodDataStax Academy
 
Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraRobbie Strickland
 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsCambridge Semantics
 
Semantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational DatabasesSemantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational DatabasesCambridge Semantics
 
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...Cambridge Semantics
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleHelena Edelson
 

Destacado (20)

Big Data Grows Up - A (re)introduction to Cassandra
Big Data Grows Up - A (re)introduction to CassandraBig Data Grows Up - A (re)introduction to Cassandra
Big Data Grows Up - A (re)introduction to Cassandra
 
A Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big DataA Journey to Modern Apps with Containers, Microservices and Big Data
A Journey to Modern Apps with Containers, Microservices and Big Data
 
Cassandra for impatients
Cassandra for impatientsCassandra for impatients
Cassandra for impatients
 
Scalable data modelling by example - Cassandra Summit '16
Scalable data modelling by example - Cassandra Summit '16Scalable data modelling by example - Cassandra Summit '16
Scalable data modelling by example - Cassandra Summit '16
 
Hi Speed Datawarehousing
Hi Speed DatawarehousingHi Speed Datawarehousing
Hi Speed Datawarehousing
 
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
Transforming Data Management and Time to Insight with Anzo Smart Data Lake®
 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI Dutch
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo Unstructured
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15
 
Database Shootout: What's best for BI?
Database Shootout: What's best for BI?Database Shootout: What's best for BI?
Database Shootout: What's best for BI?
 
Graph-based Discovery and Analytics at Enterprise Scale
Graph-based Discovery and Analytics at Enterprise ScaleGraph-based Discovery and Analytics at Enterprise Scale
Graph-based Discovery and Analytics at Enterprise Scale
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Scalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosScalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and Mesos
 
Cassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one day
 
Cassandra Summit 2014: CQL Under the Hood
Cassandra Summit 2014: CQL Under the HoodCassandra Summit 2014: CQL Under the Hood
Cassandra Summit 2014: CQL Under the Hood
 
Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and Cassandra
 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using Semantics
 
Semantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational DatabasesSemantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational Databases
 
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
 

Similar a Lambda at Weather Scale - Cassandra Summit 2015

Lambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandLambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandSpark Summit
 
Cassandra To Infinity And Beyond
Cassandra To Infinity And BeyondCassandra To Infinity And Beyond
Cassandra To Infinity And BeyondRomain Hardouin
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScyllaDB
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightScyllaDB
 
(SEC403) Diving into AWS CloudTrail Events w/ Apache Spark on EMR
(SEC403) Diving into AWS CloudTrail Events w/ Apache Spark on EMR(SEC403) Diving into AWS CloudTrail Events w/ Apache Spark on EMR
(SEC403) Diving into AWS CloudTrail Events w/ Apache Spark on EMRAmazon Web Services
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
MACHBASE_NEO
MACHBASE_NEOMACHBASE_NEO
MACHBASE_NEOMACHBASE
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
MongoDB 4.0 새로운 기능 소개
MongoDB 4.0 새로운 기능 소개MongoDB 4.0 새로운 기능 소개
MongoDB 4.0 새로운 기능 소개Ha-Yang(White) Moon
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingYaroslav Tkachenko
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 

Similar a Lambda at Weather Scale - Cassandra Summit 2015 (20)

Lambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandLambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie Strickland
 
Multi-cluster k8ssandra
Multi-cluster k8ssandraMulti-cluster k8ssandra
Multi-cluster k8ssandra
 
Cassandra To Infinity And Beyond
Cassandra To Infinity And BeyondCassandra To Infinity And Beyond
Cassandra To Infinity And Beyond
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Presentation
PresentationPresentation
Presentation
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
 
(SEC403) Diving into AWS CloudTrail Events w/ Apache Spark on EMR
(SEC403) Diving into AWS CloudTrail Events w/ Apache Spark on EMR(SEC403) Diving into AWS CloudTrail Events w/ Apache Spark on EMR
(SEC403) Diving into AWS CloudTrail Events w/ Apache Spark on EMR
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
MACHBASE_NEO
MACHBASE_NEOMACHBASE_NEO
MACHBASE_NEO
 
Running Cassandra in AWS
Running Cassandra in AWSRunning Cassandra in AWS
Running Cassandra in AWS
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
MongoDB 4.0 새로운 기능 소개
MongoDB 4.0 새로운 기능 소개MongoDB 4.0 새로운 기능 소개
MongoDB 4.0 새로운 기능 소개
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 

Último

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Lambda at Weather Scale - Cassandra Summit 2015

  • 1. Lambda at Weather Scale Robbie Strickland
  • 2. Who Am I? Robbie Strickland Director of Engineering, Analytics rstrickland@weather.com @rs_atl
  • 3. Who Am I? ● Contributor to C* community since 2010 ● DataStax MVP 2014/15 ● Author, Cassandra High Availability ● Founder, ATL Cassandra User Group
  • 4. About TWC ● ~30 billion API requests per day ● ~120 million active mobile users ● #3 most active mobile user base ● ~360 PB of traffic daily ● Most weather data comes from us
  • 5. Use Case ● Billions of events per day ○ Web/mobile beacons ○ Logs ○ Weather conditions + forecasts ○ etc. ● Keep data forever
  • 6. Use Case ● Efficient batch + streaming analysis ● Self-serve data science ● BI / visualization tool support
  • 8. Attempt[0] Architecture Operational Analytics Business Analytics Executive Dashboards Data Discovery Data Science 3rd Party System Integration Events 3rd Party Other DBs S3 Stream Processing Batch Sources Storage and Processing Consumers Data Access Kafka Streaming Custom Ingestion Pipeline ETL Streaming Sources RESTful Enqueue service SQL
  • 9. Attempt[0] Data Model CREATE TABLE events ( timebucket bigint, timestamp bigint, eventtype varchar, eventid varchar, platform varchar, userid varchar, version int, appid varchar, useragent varchar, eventdata varchar, tags set<varchar>, devicedata map<varchar, varchar>, PRIMARY KEY ((timebucket, eventtype), timestamp, eventid) ) WITH CACHING = 'none' AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };
  • 10. Attempt[0] Data Model CREATE TABLE events ( timebucket bigint, timestamp bigint, eventtype varchar, eventid varchar, platform varchar, userid varchar, version int, appid varchar, useragent varchar, eventdata varchar, tags set<varchar>, devicedata map<varchar, varchar>, PRIMARY KEY ((timebucket, eventtype), timestamp, eventid) ) WITH CACHING = 'none' AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' }; Event payload == schema-less JSON
  • 11. Attempt[0] Data Model CREATE TABLE events ( timebucket bigint, timestamp bigint, eventtype varchar, eventid varchar, platform varchar, userid varchar, version int, appid varchar, useragent varchar, eventdata varchar, tags set<varchar>, devicedata map<varchar, varchar>, PRIMARY KEY ((timebucket, eventtype), timestamp, eventid) ) WITH CACHING = 'none' AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' }; Partitioned by time bucket + type
  • 12. Attempt[0] Data Model CREATE TABLE events ( timebucket bigint, timestamp bigint, eventtype varchar, eventid varchar, platform varchar, userid varchar, version int, appid varchar, useragent varchar, eventdata varchar, tags set<varchar>, devicedata map<varchar, varchar>, PRIMARY KEY ((timebucket, eventtype), timestamp, eventid) ) WITH CACHING = 'none' AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' }; Time-series data good fit for DTCS
  • 13. Attempt[0] tl;dr ● C* everywhere ● Streaming data via custom ingest process ● Kafka backed by RESTful service ● Batch data via Informatica ● Spark SQL through ODBC ● Schema-less event payload ● Date-tiered compaction
  • 14. Attempt[0] tl;dr ● C* everywhere ● Streaming data via custom ingest process ● Kafka backed by RESTful service ● Batch data via Informatica ● Spark SQL through ODBC ● Schema-less event payload ● Date-tiered compaction
  • 15. Attempt[0] Lessons ● Batch loading large data sets into C* is silly ● … and expensive ● … and using Informatica to do it is SLOW ● Kafka + REST services == unnecessary ● No viable open source C* Hive driver ● DTCS is broken (see CASSANDRA-9666)
  • 16. Attempt[0] Lessons ● Schema-less == bad: ○ Must parse JSON to extract key data ○ Expensive to analyze by event type ○ Cannot tune by event type
  • 17. Attempt[1] Architecture Data Lake Operational Analytics Business Analytics Executive Dashboards Data Discovery Data Science 3rd Party System Integration Stream Processing Long Term Raw Storage Short Term Storage and Big Data Processing Consumers Amazon SQS Streaming Custom Ingestion Pipeline Events 3rd Party Other DBs S3 Batch Sources Streaming Sources ETL Data Access SQL
  • 18. Attempt[1] Data Model ● Each event type gets its own table ● Tables individually tuned based on workload ● Schema applied at ingestion: ○ We’re reading everything anyway ○ Makes subsequent analysis much easier ○ Allows us to filter junk early
  • 19. Attempt[1] tl;dr ● Use C* for streaming data ○ Rolling time window (TTL depends on type) ○ Real-time access to events ○ Data locality makes Spark jobs faster
  • 20. Attempt[1] tl;dr ● Everything else in S3 ○ Batch data loads (mostly logs) ○ Daily C* backups ○ Stored as Parquet ○ Cheap, scalable long-term storage ○ Easy access from Spark ○ Easy to share internally & externally ○ Open source Hive support
  • 21. Attempt[1] tl;dr ● Kafka replaced by SQS: ○ Scalable & reliable ○ Already fronted by a RESTful interface ○ Nearly free to operate (nothing to manage) ○ Robust security model ○ One queue per event type/platform ○ Built-in monitoring
  • 22. Attempt[1] tl;dr ● STCS in lieu of DTCS (and LCS) ○ Because it’s bulletproof ○ Partitions spanning sstables is acceptable ○ Testing Time-Window compaction (thanks Jeff Jirsa)
  • 23. Attempt[1] tl;dr ● STCS in lieu of DTCS (and LCS) ○ Because it’s bulletproof ○ Partitions spanning sstables is acceptable ○ Testing Time-Window compaction (thanks Jeff Jirsa)
  • 24. Fine Print ● Use C* >= 2.1.8 ○ CASSANDRA-9637 - fixes Spark input split computation ○ CASSANDRA-9549 - fixes memory leak ○ CASSANDRA-9436 - exposes rpc/broadcast addresses for Spark/cloud environments ● Version incompatibilities abound (check sbt file for Spark-Cassandra connector)
  • 25. Fine Print ● Two main Spark clusters: ○ Co-located with C* for heavy analysis ■ Predictable load ■ Efficient C* access ○ Self-serve in same DC but not co-located ■ Unpredictable load ■ Favors mining S3 data ■ Isolated from production jobs
  • 27. Partitioning ● Opposite strategy from “normal” C* modeling ○ Model for good parallelism ○ … not for single-partition queries ● Avoid shuffling for most cases ○ Shuffles occur when NOT grouping by partition key ○ Partition for your most common grouping
  • 28. Secondary Indexes ● Useful for C*-level filtering ● Reduces Spark workload and RAM footprint ● Low cardinality is still the rule
  • 31. Full-text Indexes ● Enabled via Stratio-Lucene custom index (https://github.com/Stratio/cassandra-lucene-index) ● Great for C*-side filters ● Same access pattern as secondary indexes
  • 32. Full-text Indexes CREATE CUSTOM INDEX email_index on emails(lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds':'1', 'schema': '{ fields: { id : {type : "integer"}, user : {type : "string"}, subject : {type : "text", analyzer : "english"}, body : {type : "text", analyzer : "english"}, time : {type : "date", pattern : "yyyy-MM-dd hh:mm:ss"} } }' };
  • 33. Full-text Indexes SELECT * FROM emails WHERE lucene='{ filter : {type:"range", field:"time", lower:"2015-05-26 20:29:59"}, query : {type:"phrase", field:"subject", values:["test"]} }'; SELECT * FROM emails WHERE lucene='{ filter : {type:"range", field:"time", lower:"2015-05-26 18:29:59"}, query : {type:"fuzzy", field:"subject", value:"thingy", max_edits:1} }';
  • 35. Wide Rows ● It only takes one to ruin your day ● Monitor cfstats for max partition bytes ● Use toppartitions to find hot keys
  • 36. Avoid Nulls ● Nulls are deletes ● Deletes create tombstones ● Don’t write nulls! ● Beware of nulls in prepared statements
  • 38. Data Warehouse Paradigm - Old Ingest Model Transform Design Visualize
  • 39. Data Warehouse Paradigm - New Ingest Explore Analyze Deploy Visualize
  • 40. Visualization ● Critical to understanding your data ● Reduced time to visualization ● … from >1 month to minutes (!!) ● Waterfall to agile
  • 41. Zeppelin ● Open source Spark notebook ● Interpreters for Scala, Python, Spark SQL, CQL, Hive, Shell, & more ● Data visualizations ● Scheduled jobs
  • 46. Should I use DSE? ● Open source culture? ● On-staff C* expert(s)? ● Willingness to contribute/fix stuff? ● Moderate degree of risk is acceptable? ● Need/desire for latest features? ● Need/desire to control tool versions? ● Don’t have the budget for licensing?