SlideShare una empresa de Scribd logo
1 de 59
Descargar para leer sin conexión
Rethinking Topology in Cassandra


                            ApacheCon Europe
                            November 7, 2012



                                Eric Evans
                            eevans@acunu.com
                               @jericevans


Wednesday, November 7, 12                      1
DHT 101



Wednesday, November 7, 12             2
DHT 101
                                    partitioning
                                        Z    A




Wednesday, November 7, 12                                  3

The keyspace, a namespace encompassing all possible keys
DHT 101
                                      partitioning



                                Z                          A


                            Y                                    B


                                                             C



Wednesday, November 7, 12                                                                     4

The namespace is divided into N partitions (where N is the number of nodes). Partitions are
mapped to nodes and placed evenly throughout the namespace.
DHT 101
                                      partitioning



                                Z                           A


                            Y         Key = Aaa                   B


                                                             C



Wednesday, November 7, 12                                                                   5

A record, stored by key, is positioned on the next node (working clockwise) from where it
sorts in the namespace
DHT 101
                                    replica placement



                                Z                         A


                            Y         Key = Aaa                 B


                                                           C



Wednesday, November 7, 12                                                                  6

Additional copies (replicas) are stored on other nodes. Commonly the next N-1 nodes, but
anything deterministic will work.
DHT 101
                                     consistency




                      Consistency
                      Availability
                      Partition tolerance


Wednesday, November 7, 12                                                                    7

With multiple copies comes a set of trade-offs commonly articulated using the CAP theorem;
At any given point, we can only guarantee 2 of Consistency, Availability, and Partition
tolerance.
DHT 101
                            scenario: consistency level = one


                                 A
                                                                        W

                                       ?



                                  ?



Wednesday, November 7, 12                                                                      8

Writing at consistency level ONE provides very high availability, only one in 3 member nodes
need be up for write to succeed
DHT 101
                            scenario: consistency level = all


                                 A
                                                                         R

                                       ?



                                  ?



Wednesday, November 7, 12                                                                       9

If strong consistency is required, reads with consistency ALL can be used of writes performed
at ONE. The trade-off is in availability, all 3 member nodes must be up, else the read fails.
DHT 101
                            scenario: quorum write


                              A
                                                                 W

                  R+W > N           B



                               ?



Wednesday, November 7, 12                                            10

Using QUORUM consistency, we only require floor((N/2)+1) nodes.
DHT 101
                             scenario: quorum read


                              ?



                  R+W > N           B


                                                                 R
                               C



Wednesday, November 7, 12                                            11

Using QUORUM consistency, we only require floor((N/2)+1) nodes.
Awesome, yes?




Wednesday, November 7, 12                   12
Well...




Wednesday, November 7, 12             13
Problem:
                            Poor load distribution




Wednesday, November 7, 12                            14
Distributing Load

                                 Z       A


                             Y               B


                                         C
                                     M

Wednesday, November 7, 12                        15

B and C hold replicas of A
Distributing Load

                                 Z       A


                             Y               B


                                         C
                                     M

Wednesday, November 7, 12                        16

A and B hold replicas of Z
Distributing Load

                                 Z       A


                             Y               B


                                         C
                                     M

Wednesday, November 7, 12                        17

Z and A hold replicas of Y
Distributing Load

                                 Z       A


                             Y               B


                                         C
                                     M

Wednesday, November 7, 12                        18

Disaster strikes!
Distributing Load

                                 Z                              A


                             Y                                        B


                                                                 C
                                        M

Wednesday, November 7, 12                                                                       19

Sets [Y,Z,A], [Z,A,B], [A,B,C] all suffer the loss of A; Results in extra load on neighboring
nodes
Distributing Load

                                 Z       A
                                         A1


                             Y                B


                                         C
                                     M

Wednesday, November 7, 12                         20

Solution: Replace/repair down node
Distributing Load

                                 Z       A
                                         A1


                             Y                B


                                         C
                                     M

Wednesday, November 7, 12                         21

Solution: Replace/repair down node
Distributing Load

                                 Z                        A
                                                          A1


                             Y                                 B


                                                           C
                                     M

Wednesday, November 7, 12                                                                22

Neighboring nodes are needed to stream missing data to A; Results in even more load on
neighboring nodes
Problem:
                            Poor data distribution




Wednesday, November 7, 12                            23
Distributing Data
                                    A



                                          C
                             D




                                    B

Wednesday, November 7, 12                       24

Ideal distribution of keyspace
Distributing Data
                                              A
                                 E


                                                                  C
                             D




                                              B

Wednesday, November 7, 12                                                        25

Bootstrapping a node, bisecting one partition; Distribution is no longer ideal
Distributing Data
                                          A     A
                                  E


                                                             C
                             D
                                                             C
                              D


                                          B B

Wednesday, November 7, 12                                          26

Moving existing nodes means moving corresponding data; Not ideal
Distributing Data
                                          A     A
                                  E


                                                             C
                             D
                                                             C
                              D


                                          B B

Wednesday, November 7, 12                                          27

Moving existing nodes means moving corresponding data; Not ideal
Distributing Data
                                               A
                                 H                             E


                                                                    C
                             D


                                 G                             F
                                               B

Wednesday, November 7, 12                                                             28

Frequently cited alternative: Double the size of your cluster, bisecting all ranges
Distributing Data
                                               A
                                 H                             E


                                                                    C
                             D


                                 G                             F
                                               B

Wednesday, November 7, 12                                                             29

Frequently cited alternative: Double the size of your cluster, bisecting all ranges
Virtual Nodes



Wednesday, November 7, 12                   30
In a nutshell...

             host


                                                                              host


             host



Wednesday, November 7, 12                                                                     31

Basically: “nodes” on the ring are virtual, and many of them are mapped to each “real” node
(host)
Benefits
                   • Operationally simpler (no token
                            management)
                   •        Better distribution of load
                   •        Concurrent streaming involving all hosts
                   •        Smaller partitions mean greater reliability
                   •        Supports heterogenous hardware


Wednesday, November 7, 12                                                 32
Strategies

                   • Automatic sharding
                   • Fixed partition assignment
                   • Random token assignment



Wednesday, November 7, 12                         33
Strategy
                                        Automatic Sharding



                   • Partitions are split when data exceeds a
                            threshold
                   • Newly created partitions are relocated to a
                            host with lower data load
                   • Similar to sharding performed by Bigtable,
                            or Mongo auto-sharding



Wednesday, November 7, 12                                          34
Strategy
                                     Fixed Partition Assignment

                   • Namespace divided into Q evenly-sized
                            partitions
                   • Q/N partitions assigned per host (where N
                            is the number of hosts)
                   • Joining hosts “steal” partitions evenly from
                            existing hosts.
                   • Used by Dynamo and Voldemort (described
                            in Dynamo paper as “strategy 3”)


Wednesday, November 7, 12                                           35
Strategy
                                    Random Token Assignment



                   • Each host assigned T random tokens
                   • T random tokens generated for joining
                            hosts; New tokens divide existing ranges
                   • Similar to libketama; Identical to Classic
                            Cassandra when T=1



Wednesday, November 7, 12                                              36
Considerations

                   1. Number of partitions
                   2. Partition size
                   3. How 1 changes with more nodes and data
                   4. How 2 changes with more nodes and data




Wednesday, November 7, 12                                      37
Evaluating
                            Strategy        No. Partitions   Partition size

                            Random                 O(N)         O(B/N)


                             Fixed                 O(1)          O(B)


                    Auto-sharding                  O(B)          O(1)

               B ~ total data size, N ~ number of hosts


Wednesday, November 7, 12                                                     38
Evaluating
                   • Automatic sharding
                     • partition size constant (great)
                     • number of partitions scales linearly with
                            data size (bad)
                   • Fixed partition assignment
                   • Random token assignment

Wednesday, November 7, 12                                          39
Evaluating
                   •        Automatic sharding
                   •        Fixed partition assignment
                        •     Number of partitions is constant (good)
                        •     Partition size scales linearly with data size
                              (bad)
                        •     Higher operational complexity (bad)
                   •        Random token assignment


Wednesday, November 7, 12                                                     40
Evaluating
                   • Automatic sharding
                   • Fixed partition assignment
                   • Random token assignment
                     • Number of partitions scales linearly with
                            number of hosts (good ok)
                        • Partition size increases with more data;
                            decreases with more hosts (good)


Wednesday, November 7, 12                                            41
Evaluating


                   • Automatic sharding
                   • Fixed partition assignment
                   • Random token assignment



Wednesday, November 7, 12                         42
Cassandra



Wednesday, November 7, 12               43
Configuration
                               conf/cassandra.yaml


               # Comma separated list of tokens,
               # (new installs only).
               initial_token:<token>,<token>,<token>

               or

               # Number of tokens to generate.
               num_tokens: 256



Wednesday, November 7, 12                                                            44

Two params control how tokens are assigned. The initial_token param now optionally
accepts a csv list, or (preferably) you can assign a numeric value to num_tokens
Configuration
                                     nodetool info

      Token           :     (invoke with -T/--tokens to see all 256 tokens)
      ID              :     64090651-6034-41d5-bfc6-ddd24957f164
      Gossip active   :     true
      Thrift active   :     true
      Load            :     92.69 KB
      Generation No   :     1351030018
      Uptime (seconds):     45
      Heap Memory (MB):     95.16 / 1956.00
      Data Center     :     datacenter1
      Rack            :     rack1
      Exceptions      :     0
      Key Cache       :     size 240 (bytes), capacity 101711872 (bytes ...
      Row Cache       :     size 0 (bytes), capacity 0 (bytes), 0 hits, ...




Wednesday, November 7, 12                                                                      45

To keep the output readable, nodetool info no longer displays tokens (if there are more than
one), unless the -T/--tokens argument is passed
Configuration
                                              nodetool ring
      Datacenter: datacenter1
      ==========
      Replicas: 2

      Address               Rack    Status State    Load         Owns     Token
                                                                          9022770486425350384
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -9182469192098976078
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -9054823614314102214
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8970752544645156769
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8927190060345427739
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8880475677109843259
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8817876497520861779
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8810512134942064901
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8661764562509480261
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8641550925069186492
      127.0.0.1             rack1   Up     Normal   97.24   KB   66.03%   -8636224350654790732
      ...
      ...




Wednesday, November 7, 12                                                                          46

nodetool ring is still there, but the output is significantly more verbose, and it is less useful
as the go-to
Configuration
                                  nodetool status




      Datacenter: datacenter1
      =======================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      -- Address   Load    Tokens Owns   Host ID                               Rack
      UN 10.0.0.1 97.2 KB 256     66.0% 64090651-6034-41d5-bfc6-ddd24957f164   rack1
      UN 10.0.0.2 92.7 KB 256     66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c   rack1
      UN 10.0.0.3 92.6 KB 256     67.7% e4eef159-cb77-4627-84c4-14efbc868082   rack1




Wednesday, November 7, 12                                                              47

New go-to command is nodetool status
Configuration
                                     nodetool status




      Datacenter: datacenter1
      =======================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      -- Address   Load    Tokens Owns   Host ID                                       Rack
      UN 10.0.0.1 97.2 KB 256     66.0% 64090651-6034-41d5-bfc6-ddd24957f164           rack1
      UN 10.0.0.2 92.7 KB 256     66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c           rack1
      UN 10.0.0.3 92.6 KB 256     67.7% e4eef159-cb77-4627-84c4-14efbc868082           rack1




Wednesday, November 7, 12                                                                      48

Of note, since it is no longer practical to name a host by it’s token (because it can have
many), each host has a unique ID
Configuration
                                  nodetool status




      Datacenter: datacenter1
      =======================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      -- Address   Load    Tokens Owns   Host ID                               Rack
      UN 10.0.0.1 97.2 KB 256     66.0% 64090651-6034-41d5-bfc6-ddd24957f164   rack1
      UN 10.0.0.2 92.7 KB 256     66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c   rack1
      UN 10.0.0.3 92.6 KB 256     67.7% e4eef159-cb77-4627-84c4-14efbc868082   rack1




Wednesday, November 7, 12                                                              49

Note the token per-node count
Migration
                                    A




                            C               B




Wednesday, November 7, 12                       50
Migration
                            edit conf/cassandra.yaml and restart




               # Number of tokens to generate.
               num_tokens: 256




Wednesday, November 7, 12                                          51

Step 1: Set num_tokens in cassandra.yaml, and restart node
Migration
                       convert to T contiguous tokens in existing ranges

                                        A AA
                                   A AA                   B




                                                             A
                                  A




                                                              AA
                                 A
                                A




                                                                 AA A
                               A
                               A
                               A




                                                                 AAA AA
                                 A A
                                       A




                                                         C
                                      A
                                     A



                                    A
                                    A

                                   A
                                   A
                                   A
                                   A




Wednesday, November 7, 12                                                                     52

This will cause the existing range to be split into T contiguous tokens. This results in no
change to placement
Migration
                                         shuffle

                                     A AA
                                A AA                    B




                                                           A
                               A




                                                            AA
                              A
                             A




                                                               AA A
                            A
                            A
                            A




                                                               AAA AA
                             A A
                                   A




                                                       C
                                  A
                                 A



                                A
                                A

                               A
                               A
                               A
                               A




Wednesday, November 7, 12                                                 53

Step 2: Initialize a shuffle operation. Nodes randomly exchange ranges.
Shuffle

                   • Range transfers are queued on each host
                   • Hosts initiate transfer of ranges to self
                   • Pay attention to the logs!


Wednesday, November 7, 12                                        54
Shuffle
                                          bin/shuffle
      Usage: shuffle [options] <sub-command>

      Sub-commands:
       create               Initialize a new shuffle operation
       ls                   List pending relocations
       clear                Clear pending relocations
       en[able]             Enable shuffling
       dis[able]            Disable shuffling

      Options:
       -dc, --only-dc               Apply only to named DC (create only)
       -tp, --thrift-port           Thrift port number (Default: 9160)
       -p,   --port                 JMX port number (Default: 7199)
       -tf, --thrift-framed         Enable framed transport for Thrift (Default: false)
       -en, --and-enable            Immediately enable shuffling (create only)
       -H,   --help                 Print help information
       -h,   --host                 JMX hostname or IP address (Default: localhost)
       -th, --thrift-host           Thrift hostname or IP address (Default: JMX host)




Wednesday, November 7, 12                                                                 55
Performance



Wednesday, November 7, 12                 56
removenode
                 400


                300


                200


                100


                   0
                            Acunu Reflex / Cassandra 1.2   Cassandra 1.1




Wednesday, November 7, 12                                                 57

17 node cluster of EC2 m1.large instances, 460M rows
bootstrap
                500


                375


                250


               125


                   0
                            Acunu Reflex / Cassandra 1.2   Cassandra 1.1




Wednesday, November 7, 12                                                 58

17 node cluster of EC2 m1.large instances, 460M rows
The End
         • Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan
             Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan
             Sivasubramanian, Peter Vosshall and Werner Vogels “Dynamo: Amazon’s
             Highly Available Key-value Store” Web.

         • Low, Richard. “Improving Cassandra's uptime with virtual nodes” Web.
         • Overton, Sam. “Virtual Nodes Strategies.” Web.
         • Overton, Sam. “Virtual Nodes: Performance Results.” Web.
         • Jones, Richard. "libketama - a consistent hashing algo for memcache
             clients” Web.



Wednesday, November 7, 12                                                          59

Más contenido relacionado

La actualidad más candente

Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...
kawamuray
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 

La actualidad más candente (20)

Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeApache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
Redis cluster
Redis clusterRedis cluster
Redis cluster
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Redis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech TalkRedis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech Talk
 

Destacado

Cassandra Virtual Node talk
Cassandra Virtual Node talkCassandra Virtual Node talk
Cassandra Virtual Node talk
Patrick McFadin
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Eric Evans
 
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3
Eric Evans
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
Eric Evans
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
Eric Evans
 

Destacado (20)

Cassandra Virtual Node talk
Cassandra Virtual Node talkCassandra Virtual Node talk
Cassandra Virtual Node talk
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
 
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-caseWikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
 
Castle enhanced Cassandra
Castle enhanced CassandraCastle enhanced Cassandra
Castle enhanced Cassandra
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
 
Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)Wikimedia Content API (Strangeloop)
Wikimedia Content API (Strangeloop)
 
Webinar Degetel DataStax
Webinar Degetel DataStaxWebinar Degetel DataStax
Webinar Degetel DataStax
 
DataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoTDataStax et Apache Cassandra pour la gestion des flux IoT
DataStax et Apache Cassandra pour la gestion des flux IoT
 
CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)CQL In Cassandra 1.0 (and beyond)
CQL In Cassandra 1.0 (and beyond)
 
DataStax Enterprise BBL
DataStax Enterprise BBLDataStax Enterprise BBL
DataStax Enterprise BBL
 
Cassandra by Example: Data Modelling with CQL3
Cassandra by Example:  Data Modelling with CQL3Cassandra by Example:  Data Modelling with CQL3
Cassandra by Example: Data Modelling with CQL3
 
CQL: SQL In Cassandra
CQL: SQL In CassandraCQL: SQL In Cassandra
CQL: SQL In Cassandra
 
It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRD
 
Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)Rethinking Topology In Cassandra (ApacheCon NA)
Rethinking Topology In Cassandra (ApacheCon NA)
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra
 

Más de Eric Evans (9)

Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
 
Cassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQLCassandra: Not Just NoSQL, It's MoSQL
Cassandra: Not Just NoSQL, It's MoSQL
 
NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?NoSQL Yes, But YesCQL, No?
NoSQL Yes, But YesCQL, No?
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
Outside The Box With Apache Cassnadra
Outside The Box With Apache CassnadraOutside The Box With Apache Cassnadra
Outside The Box With Apache Cassnadra
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
An Introduction To Cassandra
An Introduction To CassandraAn Introduction To Cassandra
An Introduction To Cassandra
 
Cassandra In A Nutshell
Cassandra In A NutshellCassandra In A Nutshell
Cassandra In A Nutshell
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Virtual Nodes: Rethinking Topology in Cassandra

  • 1. Rethinking Topology in Cassandra ApacheCon Europe November 7, 2012 Eric Evans eevans@acunu.com @jericevans Wednesday, November 7, 12 1
  • 3. DHT 101 partitioning Z A Wednesday, November 7, 12 3 The keyspace, a namespace encompassing all possible keys
  • 4. DHT 101 partitioning Z A Y B C Wednesday, November 7, 12 4 The namespace is divided into N partitions (where N is the number of nodes). Partitions are mapped to nodes and placed evenly throughout the namespace.
  • 5. DHT 101 partitioning Z A Y Key = Aaa B C Wednesday, November 7, 12 5 A record, stored by key, is positioned on the next node (working clockwise) from where it sorts in the namespace
  • 6. DHT 101 replica placement Z A Y Key = Aaa B C Wednesday, November 7, 12 6 Additional copies (replicas) are stored on other nodes. Commonly the next N-1 nodes, but anything deterministic will work.
  • 7. DHT 101 consistency Consistency Availability Partition tolerance Wednesday, November 7, 12 7 With multiple copies comes a set of trade-offs commonly articulated using the CAP theorem; At any given point, we can only guarantee 2 of Consistency, Availability, and Partition tolerance.
  • 8. DHT 101 scenario: consistency level = one A W ? ? Wednesday, November 7, 12 8 Writing at consistency level ONE provides very high availability, only one in 3 member nodes need be up for write to succeed
  • 9. DHT 101 scenario: consistency level = all A R ? ? Wednesday, November 7, 12 9 If strong consistency is required, reads with consistency ALL can be used of writes performed at ONE. The trade-off is in availability, all 3 member nodes must be up, else the read fails.
  • 10. DHT 101 scenario: quorum write A W R+W > N B ? Wednesday, November 7, 12 10 Using QUORUM consistency, we only require floor((N/2)+1) nodes.
  • 11. DHT 101 scenario: quorum read ? R+W > N B R C Wednesday, November 7, 12 11 Using QUORUM consistency, we only require floor((N/2)+1) nodes.
  • 14. Problem: Poor load distribution Wednesday, November 7, 12 14
  • 15. Distributing Load Z A Y B C M Wednesday, November 7, 12 15 B and C hold replicas of A
  • 16. Distributing Load Z A Y B C M Wednesday, November 7, 12 16 A and B hold replicas of Z
  • 17. Distributing Load Z A Y B C M Wednesday, November 7, 12 17 Z and A hold replicas of Y
  • 18. Distributing Load Z A Y B C M Wednesday, November 7, 12 18 Disaster strikes!
  • 19. Distributing Load Z A Y B C M Wednesday, November 7, 12 19 Sets [Y,Z,A], [Z,A,B], [A,B,C] all suffer the loss of A; Results in extra load on neighboring nodes
  • 20. Distributing Load Z A A1 Y B C M Wednesday, November 7, 12 20 Solution: Replace/repair down node
  • 21. Distributing Load Z A A1 Y B C M Wednesday, November 7, 12 21 Solution: Replace/repair down node
  • 22. Distributing Load Z A A1 Y B C M Wednesday, November 7, 12 22 Neighboring nodes are needed to stream missing data to A; Results in even more load on neighboring nodes
  • 23. Problem: Poor data distribution Wednesday, November 7, 12 23
  • 24. Distributing Data A C D B Wednesday, November 7, 12 24 Ideal distribution of keyspace
  • 25. Distributing Data A E C D B Wednesday, November 7, 12 25 Bootstrapping a node, bisecting one partition; Distribution is no longer ideal
  • 26. Distributing Data A A E C D C D B B Wednesday, November 7, 12 26 Moving existing nodes means moving corresponding data; Not ideal
  • 27. Distributing Data A A E C D C D B B Wednesday, November 7, 12 27 Moving existing nodes means moving corresponding data; Not ideal
  • 28. Distributing Data A H E C D G F B Wednesday, November 7, 12 28 Frequently cited alternative: Double the size of your cluster, bisecting all ranges
  • 29. Distributing Data A H E C D G F B Wednesday, November 7, 12 29 Frequently cited alternative: Double the size of your cluster, bisecting all ranges
  • 31. In a nutshell... host host host Wednesday, November 7, 12 31 Basically: “nodes” on the ring are virtual, and many of them are mapped to each “real” node (host)
  • 32. Benefits • Operationally simpler (no token management) • Better distribution of load • Concurrent streaming involving all hosts • Smaller partitions mean greater reliability • Supports heterogenous hardware Wednesday, November 7, 12 32
  • 33. Strategies • Automatic sharding • Fixed partition assignment • Random token assignment Wednesday, November 7, 12 33
  • 34. Strategy Automatic Sharding • Partitions are split when data exceeds a threshold • Newly created partitions are relocated to a host with lower data load • Similar to sharding performed by Bigtable, or Mongo auto-sharding Wednesday, November 7, 12 34
  • 35. Strategy Fixed Partition Assignment • Namespace divided into Q evenly-sized partitions • Q/N partitions assigned per host (where N is the number of hosts) • Joining hosts “steal” partitions evenly from existing hosts. • Used by Dynamo and Voldemort (described in Dynamo paper as “strategy 3”) Wednesday, November 7, 12 35
  • 36. Strategy Random Token Assignment • Each host assigned T random tokens • T random tokens generated for joining hosts; New tokens divide existing ranges • Similar to libketama; Identical to Classic Cassandra when T=1 Wednesday, November 7, 12 36
  • 37. Considerations 1. Number of partitions 2. Partition size 3. How 1 changes with more nodes and data 4. How 2 changes with more nodes and data Wednesday, November 7, 12 37
  • 38. Evaluating Strategy No. Partitions Partition size Random O(N) O(B/N) Fixed O(1) O(B) Auto-sharding O(B) O(1) B ~ total data size, N ~ number of hosts Wednesday, November 7, 12 38
  • 39. Evaluating • Automatic sharding • partition size constant (great) • number of partitions scales linearly with data size (bad) • Fixed partition assignment • Random token assignment Wednesday, November 7, 12 39
  • 40. Evaluating • Automatic sharding • Fixed partition assignment • Number of partitions is constant (good) • Partition size scales linearly with data size (bad) • Higher operational complexity (bad) • Random token assignment Wednesday, November 7, 12 40
  • 41. Evaluating • Automatic sharding • Fixed partition assignment • Random token assignment • Number of partitions scales linearly with number of hosts (good ok) • Partition size increases with more data; decreases with more hosts (good) Wednesday, November 7, 12 41
  • 42. Evaluating • Automatic sharding • Fixed partition assignment • Random token assignment Wednesday, November 7, 12 42
  • 44. Configuration conf/cassandra.yaml # Comma separated list of tokens, # (new installs only). initial_token:<token>,<token>,<token> or # Number of tokens to generate. num_tokens: 256 Wednesday, November 7, 12 44 Two params control how tokens are assigned. The initial_token param now optionally accepts a csv list, or (preferably) you can assign a numeric value to num_tokens
  • 45. Configuration nodetool info Token : (invoke with -T/--tokens to see all 256 tokens) ID : 64090651-6034-41d5-bfc6-ddd24957f164 Gossip active : true Thrift active : true Load : 92.69 KB Generation No : 1351030018 Uptime (seconds): 45 Heap Memory (MB): 95.16 / 1956.00 Data Center : datacenter1 Rack : rack1 Exceptions : 0 Key Cache : size 240 (bytes), capacity 101711872 (bytes ... Row Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, ... Wednesday, November 7, 12 45 To keep the output readable, nodetool info no longer displays tokens (if there are more than one), unless the -T/--tokens argument is passed
  • 46. Configuration nodetool ring Datacenter: datacenter1 ========== Replicas: 2 Address Rack Status State Load Owns Token 9022770486425350384 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -9182469192098976078 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -9054823614314102214 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8970752544645156769 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8927190060345427739 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8880475677109843259 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8817876497520861779 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8810512134942064901 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8661764562509480261 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8641550925069186492 127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8636224350654790732 ... ... Wednesday, November 7, 12 46 nodetool ring is still there, but the output is significantly more verbose, and it is less useful as the go-to
  • 47. Configuration nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.0.0.1 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1 UN 10.0.0.2 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1 UN 10.0.0.3 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1 Wednesday, November 7, 12 47 New go-to command is nodetool status
  • 48. Configuration nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.0.0.1 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1 UN 10.0.0.2 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1 UN 10.0.0.3 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1 Wednesday, November 7, 12 48 Of note, since it is no longer practical to name a host by it’s token (because it can have many), each host has a unique ID
  • 49. Configuration nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.0.0.1 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1 UN 10.0.0.2 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1 UN 10.0.0.3 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1 Wednesday, November 7, 12 49 Note the token per-node count
  • 50. Migration A C B Wednesday, November 7, 12 50
  • 51. Migration edit conf/cassandra.yaml and restart # Number of tokens to generate. num_tokens: 256 Wednesday, November 7, 12 51 Step 1: Set num_tokens in cassandra.yaml, and restart node
  • 52. Migration convert to T contiguous tokens in existing ranges A AA A AA B A A AA A A AA A A A A AAA AA A A A C A A A A A A A A Wednesday, November 7, 12 52 This will cause the existing range to be split into T contiguous tokens. This results in no change to placement
  • 53. Migration shuffle A AA A AA B A A AA A A AA A A A A AAA AA A A A C A A A A A A A A Wednesday, November 7, 12 53 Step 2: Initialize a shuffle operation. Nodes randomly exchange ranges.
  • 54. Shuffle • Range transfers are queued on each host • Hosts initiate transfer of ranges to self • Pay attention to the logs! Wednesday, November 7, 12 54
  • 55. Shuffle bin/shuffle Usage: shuffle [options] <sub-command> Sub-commands: create Initialize a new shuffle operation ls List pending relocations clear Clear pending relocations en[able] Enable shuffling dis[able] Disable shuffling Options: -dc, --only-dc Apply only to named DC (create only) -tp, --thrift-port Thrift port number (Default: 9160) -p, --port JMX port number (Default: 7199) -tf, --thrift-framed Enable framed transport for Thrift (Default: false) -en, --and-enable Immediately enable shuffling (create only) -H, --help Print help information -h, --host JMX hostname or IP address (Default: localhost) -th, --thrift-host Thrift hostname or IP address (Default: JMX host) Wednesday, November 7, 12 55
  • 57. removenode 400 300 200 100 0 Acunu Reflex / Cassandra 1.2 Cassandra 1.1 Wednesday, November 7, 12 57 17 node cluster of EC2 m1.large instances, 460M rows
  • 58. bootstrap 500 375 250 125 0 Acunu Reflex / Cassandra 1.2 Cassandra 1.1 Wednesday, November 7, 12 58 17 node cluster of EC2 m1.large instances, 460M rows
  • 59. The End • Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels “Dynamo: Amazon’s Highly Available Key-value Store” Web. • Low, Richard. “Improving Cassandra's uptime with virtual nodes” Web. • Overton, Sam. “Virtual Nodes Strategies.” Web. • Overton, Sam. “Virtual Nodes: Performance Results.” Web. • Jones, Richard. "libketama - a consistent hashing algo for memcache clients” Web. Wednesday, November 7, 12 59