SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
Scalable data modelling by example
Carlos Alonso (@calonso)
Carlos Alonso
2
• Ex-Londoner
• MSc Salamanca University, Spain
• Software Engineer @ Jobandtalent
• Cassandra certified developer
• Datastax Cassandra MVP 2015 & 2016
• @calonso / http://mrcalonso.com
Jobandtalent
3
• Revolutionising how people find jobs and how businesses
hire employees.
• Leveraging data to produce a unique job matching
technology.
• 10M+ users and 150K+ companies worldwide
• @jobandtalentEng / http://jobandtalent.com
• We are hiring!!
Cassandra Concepts
The data model is the only thing you can’t change
once in production.
Data organisation
6
Token
Physical Data Layout
7
Consistent Hashing
8
Hash function
“Carlos” 185664
1773456738847666528349
-894763734895827651234
Replication factor
How many copies (replicas) for your data
9
Consistency Level
How many replicas of your data must acknowledge?
10
A complete read/write example
11
DriverClient
Partitioner
f81d4fae-…
834
• RF = 3
• CL = QUORUM
• SELECT * … WHERE id = f81d4fae-…
12
13
Data Modelling
Data Modelling
15
• Understand your data
• Decide (know) how you’ll query the data
• Define column families to satisfy those queries
• Implement and optimise
Data Modelling
16
Conceptual
Model
Logical
Model
Physical
Model
Query-Driven
Methodology
Analysis &
Validation
Query Driven Methodology: goals
17
• Spread data evenly around the cluster
• Minimise the number of partitions read
• Keep partitions manageable
Query Driven Methodology: process
18
• Entities and relationships: map to tables
• Key attributes: map to primary key columns
• Equality search attributes: must be at the beginning of the primary key
• Inequality search attributes: become clustering columns
• Ordering attributes: become clustering columns
The Primary Key
19
PARTITION
KEY
+
CLUSTERING
COLUMN(S)
CREATE TABLE . . .(
fields . . .
PRIMARY KEY (part_key, clust1, . . .)
);
Analysis & Validation
20
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• How much data duplication? (batches)
An E-Library project.
Requirement: 1
22
Books can be uniquely identified and accessed by ISBN, we also need a title, genre, author and publisher.
Book
ISBN K
Title
Author
Genre
Publisher
QDM
Q1
Q1: Find books by ISBN
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– 1 X (5 - 1 - 0) + 0 < 1M
• How much data duplication? 0
Analysis & Validation
23
Book
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBN
Physical data model
24
Book
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBN
CREATE TABLE books (
ISBN VARCHAR PRIMARY KEY,
title VARCHAR,
author VARCHAR,
genre VARCHAR,
publisher VARCHAR
);
SELECT * FROM books WHERE ISBN = ‘…’;
Requirement 2
25
Users register into the system uniquely identified by an email and a password. We also want their full name. They will be
accessed by email and password or internal unique ID.
Users_by_ID
KID
full_name
QDM
Q1
Q1: Find users by ID
Users_by_login_info
email K
password K
full_name
ID
Q2
Q2: Find users by login info
Q3: Find users by email
(to guarantee uniqueness)
Q3
C
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– 1 X (2 - 1 - 0) + 0 < 1M
• How much data duplication? 0
Analysis & Validation
26
K
Users_by_ID
ID
full_name
Q1
Q1: Find users by ID
Physical Data Model
27
CREATE TABLE users_by_id (
ID TIMEUUID PRIMARY KEY,
full_name VARCHAR
);
SELECT * FROM users_by_id WHERE ID = …;
K
Users_by_ID
ID
full_name
Q1
Q1: Find users by ID
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– 1 X (4 - 1 - 0) + 0 < 1M
• How much data duplication? 1
Analysis & Validation
28
Q2: Find users by login info
Users_by_login_info
email K
password C
full_name
ID
Q3: Find users by email
(to guarantee uniqueness)
Physical Data Model
29
CREATE TABLE users_by_login_info (
email VARCHAR,
password VARCHAR,
full_name VARCHAR,
ID TIMEUUID,
PRIMARY KEY (email, password)
);
SELECT * FROM users_by_login_info
WHERE email = ‘…’ [AND password = ‘…’];
Q2: Find users by login info
Users_by_login_info
email K
password C
full_name
ID
Q3: Find users by email
(to guarantee uniqueness)
Physical Data Model
30
BEGIN BATCH
INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS;
INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…);
APPLY BATCH;
Requirement 3
31
Users read books.
We want to know which books has a user read and
show them sorted by title and author
Books_read_by_user
Kuser_ID
QDM
Q1: Find all books a logged
user has read
Q1
ISBN
genre
publisher
title
author
C
C
full_name S
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– Books X (7 - 1 - 1) + 1 < 1M => 200,000 books per user
• How much data duplication? 0
Analysis & Validation
32
Q1: Find all books a logged
user has read
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
S
Physical Data Model
33
CREATE TABLE books_read_by_user (
user_id TIMEUUID,
title VARCHAR,
author VARCHAR,
full_name VARCHAR STATIC,
ISBN VARCHAR,
genre VARCHAR,
publisher VARCHAR,
PRIMARY KEY (user_id, title, author)
);
SELECT * FROM books_read_by_user
WHERE user_ID = …;
Q1: Find all books a logged
user has read
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
S
Requirement 4
34
In order to improve our site’s usability we need to understand how our users use it by tracking every interaction they have
with our site.
element
type
user_ID K
Actions_by_user
QDM
Q1
Q1: Find all actions a user
does in a time range
time C
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– Actions X (4 - 1 - 0) + 0 < 1M => 333.333
• How much data duplication? 0
Analysis & Validation
35
K
Actions_by_user
user_ID
Q1
Q1: Find all actions a user
does in a time range
time
element
type
C
Requirement 4: Bucketing
36
time
element
type
user_ID K
Actions_by_user
month K
C
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– Actions X (5 - 2 - 0) + 0 < 1M => 333.333 

per user every <bucket_size>
bucket_size = 1 year => 38 actions / h
bucket_size = 1 month => 462 actions / h
bucket_size = 1 week => 1984 actions / h
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– Actions X (5 - 2 - 0) + 0 < 1M => 333.333 / month
• How much data duplication? 0
Analysis & Validation
37
K
Actions_by_user
user_ID
month
Q1
Q1: Find all actions a user
does in a time range
K
time
element
type
C
Physical Data Model
38
CREATE TABLE actions_by_user (
user_ID TIMEUUID,
month INT,
time TIMESTAMP,
element VARCHAR,
type VARCHAR,
PRIMARY KEY ((user_ID, month), time)
);
SELECT * FROM actions_by_user
WHERE user_ID = … AND month = … AND time < … AND time > …;
K
Actions_by_user
user_ID
month
Q1
Q1: Find all actions a user
does in a time range
K
time
element
type
C
Further validation
39
∑sizeOf(pk) + ∑sizeOf(sc) + Nr x ∑(sizeOf(rc) + ∑sizeOf(clc)) + 8 x Nv < 200 MB
– pk = Partition Key column
– sc = Static column
– Nr = Number of rows
– rc = Regular column
– clc = Clustering column
– Nv = Number of values
Next Steps
40
• Test your models against your hardware setup
– cassandra-stress
– http://www.sestevez.com/sestevez/CassandraDataModeler/ (kudos Sebastian Estevez)
• Monitor everything
– DataStax OpsCenter
– Graphite
– Datadog
– . . .
Thanks!
Carlos Alonso
Software Engineer
@calonso

Más contenido relacionado

La actualidad más candente

Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...DataStax
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016DataStax
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...DataStax
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...DataStax
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016DataStax
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraOlga Lavrentieva
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016DataStax
 
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax Academy
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark Summit
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityHiromitsu Komatsu
 
Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in CassandraShogo Hoshii
 
Processing 50,000 events per second with Cassandra and Spark
Processing 50,000 events per second with Cassandra and SparkProcessing 50,000 events per second with Cassandra and Spark
Processing 50,000 events per second with Cassandra and SparkBen Slater
 
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...DataStax
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchPatricia Gorla
 
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...DataStax
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...DataStax
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformDataStax Academy
 

La actualidad más candente (20)

Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and Analytics
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 
Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in Cassandra
 
Processing 50,000 events per second with Cassandra and Spark
Processing 50,000 events per second with Cassandra and SparkProcessing 50,000 events per second with Cassandra and Spark
Processing 50,000 events per second with Cassandra and Spark
 
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise Search
 
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
 

Destacado

Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Holden Ackerman
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
A Benchmark Test on Presto, Spark Sql and Hive on Tez
A Benchmark Test on Presto, Spark Sql and Hive on TezA Benchmark Test on Presto, Spark Sql and Hive on Tez
A Benchmark Test on Presto, Spark Sql and Hive on TezGw Liu
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkDongwon Kim
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache CassandraPatrick McFadin
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
 
Help! I want to contribute to an Open Source project but my boss says no.
Help! I want to contribute to an Open Source project but my boss says no.Help! I want to contribute to an Open Source project but my boss says no.
Help! I want to contribute to an Open Source project but my boss says no.Patrick McFadin
 
Cassandra 3.0 advanced preview
Cassandra 3.0 advanced previewCassandra 3.0 advanced preview
Cassandra 3.0 advanced previewPatrick McFadin
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandraPatrick McFadin
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkDongwon Kim
 

Destacado (11)

Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
A Benchmark Test on Presto, Spark Sql and Hive on Tez
A Benchmark Test on Presto, Spark Sql and Hive on TezA Benchmark Test on Presto, Spark Sql and Hive on Tez
A Benchmark Test on Presto, Spark Sql and Hive on Tez
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
 
Help! I want to contribute to an Open Source project but my boss says no.
Help! I want to contribute to an Open Source project but my boss says no.Help! I want to contribute to an Open Source project but my boss says no.
Help! I want to contribute to an Open Source project but my boss says no.
 
Cassandra 3.0 advanced preview
Cassandra 3.0 advanced previewCassandra 3.0 advanced preview
Cassandra 3.0 advanced preview
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 

Similar a Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchSylvain Wallez
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB SchemasRemaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB SchemasMongoDB
 
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAmazon Web Services
 
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...QBiC_Tue
 
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...NETWAYS
 
A Fast and Efficient Time Series Storage Based on Apache Solr
A Fast and Efficient Time Series Storage Based on Apache SolrA Fast and Efficient Time Series Storage Based on Apache Solr
A Fast and Efficient Time Series Storage Based on Apache SolrQAware GmbH
 
Chronix: A fast and efficient time series storage based on Apache Solr
Chronix: A fast and efficient time series storage based on Apache SolrChronix: A fast and efficient time series storage based on Apache Solr
Chronix: A fast and efficient time series storage based on Apache SolrFlorian Lautenschlager
 
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB
 
Webinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBWebinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBMongoDB
 
MongoDB Best Practices
MongoDB Best PracticesMongoDB Best Practices
MongoDB Best PracticesLewis Lin 🦊
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindEMC
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfphongnguyen312110237
 

Similar a Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016 (20)

Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
AWS Webcast - Dynamo DB
AWS Webcast - Dynamo DBAWS Webcast - Dynamo DB
AWS Webcast - Dynamo DB
 
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB SchemasRemaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
 
Spanner (may 19)
Spanner (may 19)Spanner (may 19)
Spanner (may 19)
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
 
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
 
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
 
A Fast and Efficient Time Series Storage Based on Apache Solr
A Fast and Efficient Time Series Storage Based on Apache SolrA Fast and Efficient Time Series Storage Based on Apache Solr
A Fast and Efficient Time Series Storage Based on Apache Solr
 
Chronix: A fast and efficient time series storage based on Apache Solr
Chronix: A fast and efficient time series storage based on Apache SolrChronix: A fast and efficient time series storage based on Apache Solr
Chronix: A fast and efficient time series storage based on Apache Solr
 
Deep Dive - DynamoDB
Deep Dive - DynamoDBDeep Dive - DynamoDB
Deep Dive - DynamoDB
 
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
 
Webinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDBWebinar: Best Practices for Getting Started with MongoDB
Webinar: Best Practices for Getting Started with MongoDB
 
MongoDB Best Practices
MongoDB Best PracticesMongoDB Best Practices
MongoDB Best Practices
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
 
The new time series kid on the block
The new time series kid on the blockThe new time series kid on the block
The new time series kid on the block
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdf
 

Más de DataStax

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?DataStax
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...DataStax
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsDataStax
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphDataStax
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyDataStax
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...DataStax
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache KafkaDataStax
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseDataStax
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0DataStax
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...DataStax
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesDataStax
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDataStax
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudDataStax
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceDataStax
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...DataStax
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...DataStax
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...DataStax
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)DataStax
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsDataStax
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingDataStax
 

Más de DataStax (20)

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise Graph
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerce
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking Applications
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
 

Último

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 

Último (20)

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 

Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

  • 1. Scalable data modelling by example Carlos Alonso (@calonso)
  • 2. Carlos Alonso 2 • Ex-Londoner • MSc Salamanca University, Spain • Software Engineer @ Jobandtalent • Cassandra certified developer • Datastax Cassandra MVP 2015 & 2016 • @calonso / http://mrcalonso.com
  • 3. Jobandtalent 3 • Revolutionising how people find jobs and how businesses hire employees. • Leveraging data to produce a unique job matching technology. • 10M+ users and 150K+ companies worldwide • @jobandtalentEng / http://jobandtalent.com • We are hiring!!
  • 5. The data model is the only thing you can’t change once in production.
  • 8. Consistent Hashing 8 Hash function “Carlos” 185664 1773456738847666528349 -894763734895827651234
  • 9. Replication factor How many copies (replicas) for your data 9
  • 10. Consistency Level How many replicas of your data must acknowledge? 10
  • 11. A complete read/write example 11 DriverClient Partitioner f81d4fae-… 834 • RF = 3 • CL = QUORUM • SELECT * … WHERE id = f81d4fae-…
  • 12. 12
  • 13. 13
  • 15. Data Modelling 15 • Understand your data • Decide (know) how you’ll query the data • Define column families to satisfy those queries • Implement and optimise
  • 17. Query Driven Methodology: goals 17 • Spread data evenly around the cluster • Minimise the number of partitions read • Keep partitions manageable
  • 18. Query Driven Methodology: process 18 • Entities and relationships: map to tables • Key attributes: map to primary key columns • Equality search attributes: must be at the beginning of the primary key • Inequality search attributes: become clustering columns • Ordering attributes: become clustering columns
  • 19. The Primary Key 19 PARTITION KEY + CLUSTERING COLUMN(S) CREATE TABLE . . .( fields . . . PRIMARY KEY (part_key, clust1, . . .) );
  • 20. Analysis & Validation 20 • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M • How much data duplication? (batches)
  • 22. Requirement: 1 22 Books can be uniquely identified and accessed by ISBN, we also need a title, genre, author and publisher. Book ISBN K Title Author Genre Publisher QDM Q1 Q1: Find books by ISBN
  • 23. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – 1 X (5 - 1 - 0) + 0 < 1M • How much data duplication? 0 Analysis & Validation 23 Book ISBN K Title Author Genre Publisher Q1 Q1: Find books by ISBN
  • 24. Physical data model 24 Book ISBN K Title Author Genre Publisher Q1 Q1: Find books by ISBN CREATE TABLE books ( ISBN VARCHAR PRIMARY KEY, title VARCHAR, author VARCHAR, genre VARCHAR, publisher VARCHAR ); SELECT * FROM books WHERE ISBN = ‘…’;
  • 25. Requirement 2 25 Users register into the system uniquely identified by an email and a password. We also want their full name. They will be accessed by email and password or internal unique ID. Users_by_ID KID full_name QDM Q1 Q1: Find users by ID Users_by_login_info email K password K full_name ID Q2 Q2: Find users by login info Q3: Find users by email (to guarantee uniqueness) Q3 C
  • 26. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – 1 X (2 - 1 - 0) + 0 < 1M • How much data duplication? 0 Analysis & Validation 26 K Users_by_ID ID full_name Q1 Q1: Find users by ID
  • 27. Physical Data Model 27 CREATE TABLE users_by_id ( ID TIMEUUID PRIMARY KEY, full_name VARCHAR ); SELECT * FROM users_by_id WHERE ID = …; K Users_by_ID ID full_name Q1 Q1: Find users by ID
  • 28. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – 1 X (4 - 1 - 0) + 0 < 1M • How much data duplication? 1 Analysis & Validation 28 Q2: Find users by login info Users_by_login_info email K password C full_name ID Q3: Find users by email (to guarantee uniqueness)
  • 29. Physical Data Model 29 CREATE TABLE users_by_login_info ( email VARCHAR, password VARCHAR, full_name VARCHAR, ID TIMEUUID, PRIMARY KEY (email, password) ); SELECT * FROM users_by_login_info WHERE email = ‘…’ [AND password = ‘…’]; Q2: Find users by login info Users_by_login_info email K password C full_name ID Q3: Find users by email (to guarantee uniqueness)
  • 30. Physical Data Model 30 BEGIN BATCH INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS; INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…); APPLY BATCH;
  • 31. Requirement 3 31 Users read books. We want to know which books has a user read and show them sorted by title and author Books_read_by_user Kuser_ID QDM Q1: Find all books a logged user has read Q1 ISBN genre publisher title author C C full_name S
  • 32. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Books X (7 - 1 - 1) + 1 < 1M => 200,000 books per user • How much data duplication? 0 Analysis & Validation 32 Q1: Find all books a logged user has read K Books_read_by_user user_ID title Q1 full_name ISBN genre publisher author C C S
  • 33. Physical Data Model 33 CREATE TABLE books_read_by_user ( user_id TIMEUUID, title VARCHAR, author VARCHAR, full_name VARCHAR STATIC, ISBN VARCHAR, genre VARCHAR, publisher VARCHAR, PRIMARY KEY (user_id, title, author) ); SELECT * FROM books_read_by_user WHERE user_ID = …; Q1: Find all books a logged user has read K Books_read_by_user user_ID title Q1 full_name ISBN genre publisher author C C S
  • 34. Requirement 4 34 In order to improve our site’s usability we need to understand how our users use it by tracking every interaction they have with our site. element type user_ID K Actions_by_user QDM Q1 Q1: Find all actions a user does in a time range time C
  • 35. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Actions X (4 - 1 - 0) + 0 < 1M => 333.333 • How much data duplication? 0 Analysis & Validation 35 K Actions_by_user user_ID Q1 Q1: Find all actions a user does in a time range time element type C
  • 36. Requirement 4: Bucketing 36 time element type user_ID K Actions_by_user month K C – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Actions X (5 - 2 - 0) + 0 < 1M => 333.333 
 per user every <bucket_size> bucket_size = 1 year => 38 actions / h bucket_size = 1 month => 462 actions / h bucket_size = 1 week => 1984 actions / h
  • 37. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Actions X (5 - 2 - 0) + 0 < 1M => 333.333 / month • How much data duplication? 0 Analysis & Validation 37 K Actions_by_user user_ID month Q1 Q1: Find all actions a user does in a time range K time element type C
  • 38. Physical Data Model 38 CREATE TABLE actions_by_user ( user_ID TIMEUUID, month INT, time TIMESTAMP, element VARCHAR, type VARCHAR, PRIMARY KEY ((user_ID, month), time) ); SELECT * FROM actions_by_user WHERE user_ID = … AND month = … AND time < … AND time > …; K Actions_by_user user_ID month Q1 Q1: Find all actions a user does in a time range K time element type C
  • 39. Further validation 39 ∑sizeOf(pk) + ∑sizeOf(sc) + Nr x ∑(sizeOf(rc) + ∑sizeOf(clc)) + 8 x Nv < 200 MB – pk = Partition Key column – sc = Static column – Nr = Number of rows – rc = Regular column – clc = Clustering column – Nv = Number of values
  • 40. Next Steps 40 • Test your models against your hardware setup – cassandra-stress – http://www.sestevez.com/sestevez/CassandraDataModeler/ (kudos Sebastian Estevez) • Monitor everything – DataStax OpsCenter – Graphite – Datadog – . . .