SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
Modeling Data In Cassandra
     Conceptual Differences Versus RDBMS
    Matthew F. Dennis, DataStax // @mdennis




June 27, 2012
Cassandra Is Not Relational
get out of the relational mindset when working
  with Cassandra (or really any NoSQL DB)
Work Backwards From Queries
   Think in terms of queries, not in terms of
normalizing the data; in fact, you often want to
  denormalize (already common in the data
    warehousing world, even in RDBMS)
OK great, but how do I do that?
Well, you need to know how Cassandra models
          data (e.g. Google Big Table)

   research.google.com/archive/bigtable-osdi06.pdf



   Go Read It!
In Cassandra:

data is organized into Keyspaces (usually one per app)
➔




each Keyspace can have multiple Column Families
➔




each Column Family can have many Rows
➔




each Row has a Row Key and a variable number of Columns
➔




each Column consists of a Name, Value and Timestamp
➔
In Cassandra, Keyspaces:
are similar in concept to a “database” in some RDBMs
➔




are stored in separate directories on disk
➔




are usually one-one with applications
➔




are usually the administrative unit for things related to ops
➔




contain multiple column families
➔
In Cassandra, In Keyspaces, Column Famlies:
   ➔ are similar in concept to a “table” in most RDBMs

   ➔ are stored in separate files on disk (many per CF)

   ➔ are usually approximately one-one with query type

   ➔ are usually the administrative unit for things related to your data

   ➔ can contain many (~billion* per node) rows




* for a good sized node
(you can always add nodes)
In Cassandra, In Keyspaces, In Column Families ...
Rows

 thepaul   office: Austin      OS: OSX          twitter: thepaul0


 mdennis    office: UA         OS: Linux        twitter: mdennis


  thobbs   office: Austin   twitter: tylhobbs




Row Keys
thepaul   office: Austin       OS: OSX          twitter: thepaul0


mdennis    office: UA          OS: Linux        twitter: mdennis


thobbs    office: Austin    twitter: tylhobbs




                           Columns
Column Names

thepaul   office: Austin      OS: OSX          twitter: thepaul0


mdennis    office: UA         OS: Linux        twitter: mdennis


thobbs    office: Austin   twitter: tylhobbs
Column Values

thepaul   office: Austin      OS: OSX          twitter: thepaul0


mdennis    office: UA         OS: Linux        twitter: mdennis


thobbs    office: Austin   twitter: tylhobbs
thepaul   office: Austin       OS: OSX          twitter: thepaul0


mdennis    office: UA          OS: Linux        twitter: mdennis


thobbs    office: Austin    twitter: tylhobbs




                           Rows Are Randomly Ordered
                             (if using the RandomPartitioner)
thepaul   office: Austin           OS: OSX          twitter: thepaul0


mdennis    office: UA              OS: Linux        twitter: mdennis


thobbs    office: Austin        twitter: tylhobbs




                  Columns Are Ordered by Name
                           (by a configurable comparator)
Columns are ordered because
 doing so allows very efficient
implementations of useful and
     common operations

        (e.g. merge join)
In particular, within a row
columns with a given name can
    be located very quickly.
(ordered names => log(n) binary search)
More importantly, I can query for a
      slice between a start and end

                 Row Key

RK   ts0   ts1   ...   ...   tsM ...   ...   ...   ...   tsN ...   ...   ...   ...   ...


 start                                                                         end
Why does that matter?
Because columns within don’t have to be static!
    (and random disk seeks are teh evil)
The Column Name Can Be Part of Your Data

  INTC     ts0: $25.20         ts1: $25.25             ...


  AMR       ts0: $6.20          ts9: $0.26             ...


  CRDS      ts0: $1.05          ts5: $6.82             ...




                  Columns Are Ordered by Name
                   (in this case by a TimeUUID Comparator)
Turns Out That Pattern Comes Up A Lot
  ➔ stock ticks
  ➔ event logs

  ➔ ad clicks/views

  ➔ sensor records

  ➔ access/error logs

  ➔ plane/truck/person/”entity” locations

  ➔…
OK, but I can do that in SQL
Not efficiently at scale, at least not easily ...
How it Looks In a RDBMS
                    ticker   timestamp   bid   ask   ...
                    AMR      ts0         ...   ...   ...
                    ...      ...         ...   ...   ...
                    CRDS     ts0         ...   ...   ...
                    ...      ...         ...   ...   ...
Data I Care About   ...      ts0         ...   ...   ...
                    AMR      ts1         ...   ...   ...
                    ...      ...         ...   ...   ...
                    ...      ...         ...   ...   ...
                    …        ts1         ...   ...   ...
                    AMR      ts2         ...   ...   ...
                    ...      ts2         ...   ...   ...
How it Looks In a RDBMS
             ticker     timestamp   bid   ask   ...
             AMR        ts0         ...   ...   ...



                      Larger Than Your Page Size
Disk Seeks
             AMR        ts1         ...   ...   ...


                      Larger Than Your Page Size

             AMR        ts2         ...   ...   ...
             ...        ts2         ...   ...   ...
OK, but what about ...
PostgreSQL Cluster Command?
➔




MySQL Cluster Indexes?
➔




Oracle Index Organized Tables?
➔




SQLServer Clustered Index?
➔
OK, but what about ...
PostgreSQL Cluster Using?
➔




    Meh ...
MySQL [InnoDB] Cluster Indexes?
➔




Oracle Index Organized Table?
➔




SQLServer Clustered Index?
➔
The on-disk management of that
        clustering results in tons of IO …

In the case of PostgreSQL:

clustering is a one time operation
➔

    (implies you must periodically rewrite the entire table)

new data is *not* written in clustered order
➔

    (which is often the data you care most about)
OK, so just partition the tables ...
Not a bad idea, except in MySQL there is a limit of
 1024 partitions and generally less if using NDB

 (you should probably still do it if using MySQL though)

  http://dev.mysql.com/doc/refman/5.5/en/partitioning-limitations.html
OK fine, I agree storing data that is queried
       together on disk together is a good thing but
          what's that have to do with modeling in
                        Cassandra?

        Seek To Here


 RK    ts0   ts1   ...   ...   tsM ...   ...   ...   ...   tsN ...   ...   ...   ...   ...



                                  Read Precisely My Data *



* more on some caveats later
Well, that's what is meant by “work backwards
from your queries” or “think in terms of queries”

(NB: this concept, in general, applies to RDBMS
 at scale as well; it is not specific to Cassandra)
An Example From Fraud Detection
  To calculate risk it is common to need to know all the
 emails, destinations, origins, devices, locations, phone
numbers, et cetera ever used for the account in question
In a normalized model that usually translates to a
          table for each type of entity being tracked

                id          name         ...           id          device         ...
                1           guy          ...           1000        0xdead         ...
                2           gal          ...           2000        0xb33f         ...
                ...         ...          ...           ...         ...            ...


id       dest         ...          id          email         ...            id          origin    ...
15       USA          ...          100         guy@          ...            150         USA       ...
25       Finland      ...          200         gal@          ...            250         Nigeria   ...
...      ...          ...          ...         ...           ...            ...         ...       ...
The problem is that at scale that also means
        a disk seek for each one …
    (even for perfect IOT et al if across multiple tables)




➔Previous emails? That's a seek …
➔Previous devices? That's a seek …

➔Previous destinations? That's a seek ...
But In Cassandra I Store The Data I Query
           Together On Disk Together
               (remember, column names need not be static)


  Data I Care About

acctY    ...          ...          ...       ...        ...      ...         ...
acctX    dest21       dev2         dev7        email3   email9   orig4       ...
acctZ    ...          ...          ...       ...        ...      ...         ...



                            email:cassandra@mailinator.com = dateEmailWasLastUsed




                            Column Name                                  Column Value
Don't treat Cassandra (or any DB) as a black box
  ➔Understand how your DBs (and data structures) work

  ➔Understand the building blocks they provide

  ➔Understand the work complexity (“big O”) of queries

  ➔For data sets > memory, goal is to minimize seeks *




* on a related note, SSDs are awesome
Q?
      Modeling Data In Cassandra
 Conceptual Differences Versus RDBMS
Matthew F. Dennis, DataStax // @mdennis

Más contenido relacionado

La actualidad más candente

Guaranteeing Memory Safety in Rust
Guaranteeing Memory Safety in RustGuaranteeing Memory Safety in Rust
Guaranteeing Memory Safety in Rustnikomatsakis
 
Rust tutorial from Boston Meetup 2015-07-22
Rust tutorial from Boston Meetup 2015-07-22Rust tutorial from Boston Meetup 2015-07-22
Rust tutorial from Boston Meetup 2015-07-22nikomatsakis
 
Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)nikomatsakis
 
Better Web Clients with Mantle and AFNetworking
Better Web Clients with Mantle and AFNetworkingBetter Web Clients with Mantle and AFNetworking
Better Web Clients with Mantle and AFNetworkingGuillermo Gonzalez
 
Windows 10 Nt Heap Exploitation (Chinese version)
Windows 10 Nt Heap Exploitation (Chinese version)Windows 10 Nt Heap Exploitation (Chinese version)
Windows 10 Nt Heap Exploitation (Chinese version)Angel Boy
 
Rust "Hot or Not" at Sioux
Rust "Hot or Not" at SiouxRust "Hot or Not" at Sioux
Rust "Hot or Not" at Siouxnikomatsakis
 
The State of NoSQL
The State of NoSQLThe State of NoSQL
The State of NoSQLBen Scofield
 
Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! aleks-f
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템HyeonSeok Choi
 
Dynamic C++ ACCU 2013
Dynamic C++ ACCU 2013Dynamic C++ ACCU 2013
Dynamic C++ ACCU 2013aleks-f
 
MacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) ExploitationMacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) ExploitationAngel Boy
 
Clojure: The Art of Abstraction
Clojure: The Art of AbstractionClojure: The Art of Abstraction
Clojure: The Art of AbstractionAlex Miller
 
Apache Cassandra in Bangalore - Cassandra Internals and Performance
Apache Cassandra in Bangalore - Cassandra Internals and PerformanceApache Cassandra in Bangalore - Cassandra Internals and Performance
Apache Cassandra in Bangalore - Cassandra Internals and Performanceaaronmorton
 

La actualidad más candente (20)

Guaranteeing Memory Safety in Rust
Guaranteeing Memory Safety in RustGuaranteeing Memory Safety in Rust
Guaranteeing Memory Safety in Rust
 
Rust tutorial from Boston Meetup 2015-07-22
Rust tutorial from Boston Meetup 2015-07-22Rust tutorial from Boston Meetup 2015-07-22
Rust tutorial from Boston Meetup 2015-07-22
 
8 - OOP - Syntax & Messages
8 - OOP - Syntax & Messages8 - OOP - Syntax & Messages
8 - OOP - Syntax & Messages
 
Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)
 
11 bytecode
11 bytecode11 bytecode
11 bytecode
 
Better Web Clients with Mantle and AFNetworking
Better Web Clients with Mantle and AFNetworkingBetter Web Clients with Mantle and AFNetworking
Better Web Clients with Mantle and AFNetworking
 
Senten500.c
Senten500.cSenten500.c
Senten500.c
 
Introduction to Rust
Introduction to RustIntroduction to Rust
Introduction to Rust
 
Windows 10 Nt Heap Exploitation (Chinese version)
Windows 10 Nt Heap Exploitation (Chinese version)Windows 10 Nt Heap Exploitation (Chinese version)
Windows 10 Nt Heap Exploitation (Chinese version)
 
Rust "Hot or Not" at Sioux
Rust "Hot or Not" at SiouxRust "Hot or Not" at Sioux
Rust "Hot or Not" at Sioux
 
12 virtualmachine
12 virtualmachine12 virtualmachine
12 virtualmachine
 
The State of NoSQL
The State of NoSQLThe State of NoSQL
The State of NoSQL
 
07 bestpractice
07 bestpractice07 bestpractice
07 bestpractice
 
Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! 
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템
 
Dynamic C++ ACCU 2013
Dynamic C++ ACCU 2013Dynamic C++ ACCU 2013
Dynamic C++ ACCU 2013
 
MacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) ExploitationMacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) Exploitation
 
Clojure: The Art of Abstraction
Clojure: The Art of AbstractionClojure: The Art of Abstraction
Clojure: The Art of Abstraction
 
Python lec4
Python lec4Python lec4
Python lec4
 
Apache Cassandra in Bangalore - Cassandra Internals and Performance
Apache Cassandra in Bangalore - Cassandra Internals and PerformanceApache Cassandra in Bangalore - Cassandra Internals and Performance
Apache Cassandra in Bangalore - Cassandra Internals and Performance
 

Destacado

Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-PatternsMatthew Dennis
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsMatthew Dennis
 
BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsMatthew Dennis
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingMatthew Dennis
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data ModelingMatthew Dennis
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsDave Gardner
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big DataMatthew Dennis
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Modelebenhewitt
 
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSECassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSEDataStax Academy
 
Introduction to Data Modeling in Cassandra
Introduction to Data Modeling in CassandraIntroduction to Data Modeling in Cassandra
Introduction to Data Modeling in CassandraJim Hatcher
 
Massively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache CassandraMassively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache Cassandrajbellis
 
C*ollege Credit: An Introduction to Apache Cassandra
C*ollege Credit: An Introduction to Apache CassandraC*ollege Credit: An Introduction to Apache Cassandra
C*ollege Credit: An Introduction to Apache CassandraDataStax
 
Introduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandraIntroduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandraPatrick McFadin
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13Dave Gardner
 
Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Dave Gardner
 
Planning to Fail #phpne13
Planning to Fail #phpne13Planning to Fail #phpne13
Planning to Fail #phpne13Dave Gardner
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopPatricia Gorla
 
From rdbms to cassandra without a hitch
From rdbms to cassandra without a hitchFrom rdbms to cassandra without a hitch
From rdbms to cassandra without a hitchDuyhai Doan
 
Cassandra nice use cases and worst anti patterns
Cassandra nice use cases and worst anti patternsCassandra nice use cases and worst anti patterns
Cassandra nice use cases and worst anti patternsDuyhai Doan
 

Destacado (20)

Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-Patterns
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
 
BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current Trends
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data Modeling
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
 
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSECassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
 
Introduction to Data Modeling in Cassandra
Introduction to Data Modeling in CassandraIntroduction to Data Modeling in Cassandra
Introduction to Data Modeling in Cassandra
 
Massively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache CassandraMassively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache Cassandra
 
C*ollege Credit: An Introduction to Apache Cassandra
C*ollege Credit: An Introduction to Apache CassandraC*ollege Credit: An Introduction to Apache Cassandra
C*ollege Credit: An Introduction to Apache Cassandra
 
Cassandra On EC2
Cassandra On EC2Cassandra On EC2
Cassandra On EC2
 
Introduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandraIntroduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandra
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13
 
Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)
 
Planning to Fail #phpne13
Planning to Fail #phpne13Planning to Fail #phpne13
Planning to Fail #phpne13
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
 
From rdbms to cassandra without a hitch
From rdbms to cassandra without a hitchFrom rdbms to cassandra without a hitch
From rdbms to cassandra without a hitch
 
Cassandra nice use cases and worst anti patterns
Cassandra nice use cases and worst anti patternsCassandra nice use cases and worst anti patterns
Cassandra nice use cases and worst anti patterns
 

Similar a DZone Cassandra Data Modeling Webinar

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Apache Cassandra Opinion and Fact
Apache Cassandra Opinion and FactApache Cassandra Opinion and Fact
Apache Cassandra Opinion and Factmediumdata
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code ClinicMike Acton
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...DataStax
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouseAltinity Ltd
 
Querying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsQuerying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsRuben Verborgh
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for SysadminsNathan Milford
 
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2
Cassandra Community Webinar  - Introduction To Apache Cassandra 1.2Cassandra Community Webinar  - Introduction To Apache Cassandra 1.2
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2aaronmorton
 
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2DataStax
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceRob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceHeroku
 
Scaling with MongoDB
Scaling with MongoDBScaling with MongoDB
Scaling with MongoDBRick Copeland
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandraaaronmorton
 
Cassandra Client Tutorial
Cassandra Client TutorialCassandra Client Tutorial
Cassandra Client TutorialJoe McTee
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeWim Godden
 
Tokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperTokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperConnor McDonald
 

Similar a DZone Cassandra Data Modeling Webinar (20)

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Apache Cassandra Opinion and Fact
Apache Cassandra Opinion and FactApache Cassandra Opinion and Fact
Apache Cassandra Opinion and Fact
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code Clinic
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
 
Querying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsQuerying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern Fragments
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2
Cassandra Community Webinar  - Introduction To Apache Cassandra 1.2Cassandra Community Webinar  - Introduction To Apache Cassandra 1.2
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2
 
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
 
Intro to riak
Intro to riakIntro to riak
Intro to riak
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceRob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
 
Scaling with MongoDB
Scaling with MongoDBScaling with MongoDB
Scaling with MongoDB
 
Taming Cassandra
Taming CassandraTaming Cassandra
Taming Cassandra
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Cassandra Client Tutorial
Cassandra Client TutorialCassandra Client Tutorial
Cassandra Client Tutorial
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
 
Tokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperTokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java Developer
 
Web_Alg_Project
Web_Alg_ProjectWeb_Alg_Project
Web_Alg_Project
 

Último

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 

Último (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 

DZone Cassandra Data Modeling Webinar

  • 1. Modeling Data In Cassandra Conceptual Differences Versus RDBMS Matthew F. Dennis, DataStax // @mdennis June 27, 2012
  • 2. Cassandra Is Not Relational get out of the relational mindset when working with Cassandra (or really any NoSQL DB)
  • 3. Work Backwards From Queries Think in terms of queries, not in terms of normalizing the data; in fact, you often want to denormalize (already common in the data warehousing world, even in RDBMS)
  • 4. OK great, but how do I do that? Well, you need to know how Cassandra models data (e.g. Google Big Table) research.google.com/archive/bigtable-osdi06.pdf Go Read It!
  • 5. In Cassandra: data is organized into Keyspaces (usually one per app) ➔ each Keyspace can have multiple Column Families ➔ each Column Family can have many Rows ➔ each Row has a Row Key and a variable number of Columns ➔ each Column consists of a Name, Value and Timestamp ➔
  • 6. In Cassandra, Keyspaces: are similar in concept to a “database” in some RDBMs ➔ are stored in separate directories on disk ➔ are usually one-one with applications ➔ are usually the administrative unit for things related to ops ➔ contain multiple column families ➔
  • 7. In Cassandra, In Keyspaces, Column Famlies: ➔ are similar in concept to a “table” in most RDBMs ➔ are stored in separate files on disk (many per CF) ➔ are usually approximately one-one with query type ➔ are usually the administrative unit for things related to your data ➔ can contain many (~billion* per node) rows * for a good sized node (you can always add nodes)
  • 8. In Cassandra, In Keyspaces, In Column Families ...
  • 9. Rows thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs Row Keys
  • 10. thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs Columns
  • 11. Column Names thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs
  • 12. Column Values thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs
  • 13. thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs Rows Are Randomly Ordered (if using the RandomPartitioner)
  • 14. thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs Columns Are Ordered by Name (by a configurable comparator)
  • 15. Columns are ordered because doing so allows very efficient implementations of useful and common operations (e.g. merge join)
  • 16. In particular, within a row columns with a given name can be located very quickly. (ordered names => log(n) binary search)
  • 17. More importantly, I can query for a slice between a start and end Row Key RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ... start end
  • 18. Why does that matter? Because columns within don’t have to be static! (and random disk seeks are teh evil)
  • 19. The Column Name Can Be Part of Your Data INTC ts0: $25.20 ts1: $25.25 ... AMR ts0: $6.20 ts9: $0.26 ... CRDS ts0: $1.05 ts5: $6.82 ... Columns Are Ordered by Name (in this case by a TimeUUID Comparator)
  • 20. Turns Out That Pattern Comes Up A Lot ➔ stock ticks ➔ event logs ➔ ad clicks/views ➔ sensor records ➔ access/error logs ➔ plane/truck/person/”entity” locations ➔…
  • 21. OK, but I can do that in SQL Not efficiently at scale, at least not easily ...
  • 22. How it Looks In a RDBMS ticker timestamp bid ask ... AMR ts0 ... ... ... ... ... ... ... ... CRDS ts0 ... ... ... ... ... ... ... ... Data I Care About ... ts0 ... ... ... AMR ts1 ... ... ... ... ... ... ... ... ... ... ... ... ... … ts1 ... ... ... AMR ts2 ... ... ... ... ts2 ... ... ...
  • 23. How it Looks In a RDBMS ticker timestamp bid ask ... AMR ts0 ... ... ... Larger Than Your Page Size Disk Seeks AMR ts1 ... ... ... Larger Than Your Page Size AMR ts2 ... ... ... ... ts2 ... ... ...
  • 24. OK, but what about ... PostgreSQL Cluster Command? ➔ MySQL Cluster Indexes? ➔ Oracle Index Organized Tables? ➔ SQLServer Clustered Index? ➔
  • 25. OK, but what about ... PostgreSQL Cluster Using? ➔ Meh ... MySQL [InnoDB] Cluster Indexes? ➔ Oracle Index Organized Table? ➔ SQLServer Clustered Index? ➔
  • 26. The on-disk management of that clustering results in tons of IO … In the case of PostgreSQL: clustering is a one time operation ➔ (implies you must periodically rewrite the entire table) new data is *not* written in clustered order ➔ (which is often the data you care most about)
  • 27. OK, so just partition the tables ...
  • 28. Not a bad idea, except in MySQL there is a limit of 1024 partitions and generally less if using NDB (you should probably still do it if using MySQL though) http://dev.mysql.com/doc/refman/5.5/en/partitioning-limitations.html
  • 29. OK fine, I agree storing data that is queried together on disk together is a good thing but what's that have to do with modeling in Cassandra? Seek To Here RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ... Read Precisely My Data * * more on some caveats later
  • 30. Well, that's what is meant by “work backwards from your queries” or “think in terms of queries” (NB: this concept, in general, applies to RDBMS at scale as well; it is not specific to Cassandra)
  • 31. An Example From Fraud Detection To calculate risk it is common to need to know all the emails, destinations, origins, devices, locations, phone numbers, et cetera ever used for the account in question
  • 32. In a normalized model that usually translates to a table for each type of entity being tracked id name ... id device ... 1 guy ... 1000 0xdead ... 2 gal ... 2000 0xb33f ... ... ... ... ... ... ... id dest ... id email ... id origin ... 15 USA ... 100 guy@ ... 150 USA ... 25 Finland ... 200 gal@ ... 250 Nigeria ... ... ... ... ... ... ... ... ... ...
  • 33. The problem is that at scale that also means a disk seek for each one … (even for perfect IOT et al if across multiple tables) ➔Previous emails? That's a seek … ➔Previous devices? That's a seek … ➔Previous destinations? That's a seek ...
  • 34. But In Cassandra I Store The Data I Query Together On Disk Together (remember, column names need not be static) Data I Care About acctY ... ... ... ... ... ... ... acctX dest21 dev2 dev7 email3 email9 orig4 ... acctZ ... ... ... ... ... ... ... email:cassandra@mailinator.com = dateEmailWasLastUsed Column Name Column Value
  • 35. Don't treat Cassandra (or any DB) as a black box ➔Understand how your DBs (and data structures) work ➔Understand the building blocks they provide ➔Understand the work complexity (“big O”) of queries ➔For data sets > memory, goal is to minimize seeks * * on a related note, SSDs are awesome
  • 36. Q? Modeling Data In Cassandra Conceptual Differences Versus RDBMS Matthew F. Dennis, DataStax // @mdennis