SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
Kai – An Open Source Implementation
                of Amazon’s Dynamo
                              takemaru




                          08.6.17
Outline
    Amazon’s Dynamo
         Motivation
         Features
         Algorithms


    Kai
         Build and Run
         Internals
         Roadmap




     
                    08.6.17
Dynamo: Motivation
    Largest e-commerce site
         75K query/sec (my estimation)
              500 req/sec * 150 query/req
         O(10M) users and Many items

    Why not RDBMS?
         Not easy to scaling-out or load balancing
         Many components only need primary key access

    Databases are required just for Amazon
         Dynamo for primary key accesses
         SimpleDB for complex queries
         S3 for large files

    Dynamo is used for
         Shopping carts, customer preferences, session management, sales rank, and
          product catalogs

     
                                                             08.6.17
Dynamo: Features
    Key, value store
         Distributed hash table
    High scalability
         No master, peer-to-peer
         Large scale cluster, maybe O(1K)
                                                      get(“cart/joe”)
    Fault tolerant
         Even if an entire data center fails                     joe
         Meets latency requirements in the case


                                                       joe

                                                                  joe

     
                                              08.6.17
Dynamo: Features, cont’d
    Service Level Agreements
         < 300ms for 99.9% queries
         On the average, 15ms for reads and 30ms for writes
    High availability
         No lock and always writable
                                                       get(“cart/joe”)
    Eventually Consistent
         Replicas are loosely synchronized                        joe
         Inconsistencies are resolved by clients

 Tradeoff between availability and consistency
 -  RDBMS chooses consistency                           joe

 -  Dynamo prefers availability
                                   joe

     
                                               08.6.17
Dynamo: Overview
    Dynamo cluster (instance)                                                    joe
         Consists of equivalent nodes                                bob
         Has N replicas for each key                                                    bob

                                                                      joe
 bob
    Dynamo APIs
         get(key)                                                                joe

         put(key, value, context)
              Context is a kind of version, like cas of memcache         Dynamo of N = 3
         delete is not described in the paper (I guess defined)

    Requirements for clients
         Don’t need to know ALL nodes, unlike memcache clients
              Requests can be sent to any node


     
                                                              08.6.17
Dynamo: Partitioning
    Consistent Hashing                                             2^128
                                                                      0
 1
 2
         Nodes and keys are
          positioned at their hash        hash(node5)
          values                                                   hash(node4)
              MD5 (128bits)
         Keys are stored in the
          following N nodes
                                                                            hash(node6)
                                     hash(node2)
                                                                           hash(node3)
                                                         joe
                                       hash(cart/joe)

                                                                               hash(node1)

                                                           Hash ring (N=3)
     
                                                          08.6.17
Dynamo: Partitioning, cont’d
    Advantages                                                         2^128
                                                                          0
 1
 2
         Small # of keys are
          remapped, when                      hash(node5)
          membership is changed                                        hash(node4)
              Between down node and its
               Nth predecessor


                                                                                hash(node6)
                                      hash(node2)
                                                                               hash(node3)
                                                             joe
                                           hash(cart/joe)

                                                                                   hash(node1)
                                              Remapped range

                                                               Node2 is down
     
                                                              08.6.17
in data center A

Dynamo: Partitioning                                                               in data center B

    Physical placement of                                                2^128
     replicas                                                               0
 1
 2
         Each key is replicated                hash(node5)
          across multiple data centers
                                                                         hash(node4)
         Arrangement scheme has
          not been revealed
              Netmask is helpful, I guess
              Impact on replica distribution
               is unknown                                                         hash(node6)
                                           hash(node2)
                                                                                 hash(node3)
                                                               joe
    Advantages                              hash(cart/joe)
         No data outage on data                                                     hash(node1)
          center failures
                                                  joe is replicated in multiple data centers
     
                                                                08.6.17
Dynamo: Partitioning, cont’d
    Virtual nodes                                               2^128
         Multiple positions are taken            hash(node3-1)
   0
 1
 2
          by a single physical node          hash(node5)
              O(100) virtual/physical
                                                                   hash(node4)
                                         hash(node1-1)
             hash(node5-1)
    Advantages                                                        hash(node2-1)
         Keys are more uniformly                                        hash(node6-1)
          distributed statistically                                      hash(node6)
                                    hash(node2)
         Remapped keys are evenly                                     hash(node3)
          dispersed across nodes
         # of virtual nodes can be
          determined based on               hash(node5-1)
                                                                             hash(node1)
          capacity of physical node
                                                    Two virtual nodes per each physical node
Dynamo: Partitioning, cont’d
    Buckets                                                    2^128
         Hash ring is equally divided                            0
 1
 2
          into buckets
              There should be more buckets
               than all of virtual nodes
         Keys in same bucket are
          mapped to same nodes                          bob and joe are
                                              bob
      in same bucket

    Advantages
         Keys are easily synchronized               joe
          bucket by bucket
              For Merkle trees discussed
               later
                                                     Divided into 8 buckets
     
                                                      08.6.17
Dynamo: Membership
    Gossip-based protocol
         Spreads membership like a
          rumor                                Detecting node failure
              Membership contains node list
               and change history                                        node2 is down
         Exchanges membership with a
          node at random every second
         Updates membership if more                                  node2 is down
          recent one received
                                                                                  node2 is down
                                         hash(node2)

    Advantages                                                  node2 is down
         Robust; no one can prevent a
          rumor from spreading
         Exponentially rapid spread
                                                        Membership change is spread
                                                           by the form of gossip
     
                                                            08.6.17
Dynamo: Membership, cont’d
    Chord
                                                            Brown knows only yellows
         For large scale cluster
              > O(10K) virtual nodes
         Each node needs to know
          only O(log v) nodes                            Who has “cart/ken”?
              v is # of virtual nodes
                                              ken
              The nearer on hash ring, the
               more to know
         Messages are routed hop
          by hop
              Hop count is < O(log v)

         For details, see original
          paper
                                                     Routing in Chord
     
                                                    08.6.17
Dynamo:                                                                Request
                                                                       Response
get/put Operations
    Client                                      Client
     1.      Sends a request any of
             Dynamo node
          The request is forwarded to                        Coordinator
           coordinator
                                                               joe
               Coordinator: one of nodes
                associated with the key
    Coordinator
     1.         Chooses N nodes by using
                consistent hashing
                                                       joe
     2.         Forwards a request to N
                nodes
     3.         Waits responses from R or
                                                                                 joe
                W nodes, or timeouts
     4.         Checks replica versions if get
     5.         Sends a response to client
                                                      get/put operations for N,R,W = 3,2,2
     
                                                                08.6.17
Dynamo:                                                                   Request
                                                                          Response
get/put Operations, Cont’d
                                                    Client
    Quorum
         Parameters
              N: # of replicas                                  Coordinator
              R: min # of successful reads                       joe
              W: min # of successful writes
         Conditions
              R+W > N
                   Key can be read from at least         joe
                    one node which has
                    successfully written it
              R<N                                                                  joe
                   Provides better latency
              N,R,W = 3,2,2 in common
                                                         get/put operations for N,R,W = 3,2,2
     
                                                                   08.6.17
Dynamo: Versioning
    Vector Clocks
         Ordering of versions in distributed manner
               No master, no clock synchronization
               Parallel branches can be detected
         List of clocks (counters) associated with each node
               Each clock is managed by coordinator
               (A:2, C:1) means that updated when A’s clock was 2 and C’s was 1
                                    Independent
nodeA
            (A:0)
   (A:1)
                  (A:2, C:1)
        (A:3, B:1, C:1)
       Effect

nodeB
                                             (A:1, B:1, C:1)
                  (A:3, B:2, C:1)

nodeC
            Cause
            (A:1, C:1)
                       (A:2, C:2)
                                                      Independent
          Time
                              Cause and Effect of version (A:1, B:1, C:1)
     
                                                                    08.6.17
Dynamo: Versioning, cont’d
    Vector Clocks
         When a coordinator updates a key, …
               Increments its own clock
               Replaces clock in vector of the key




                                    Independent
nodeA
            (A:0)
   (A:1)
                  (A:2, C:1)
        (A:3, B:1, C:1)
       Effect

nodeB
                                             (A:1, B:1, C:1)
                  (A:3, B:2, C:1)

nodeC
            Cause
            (A:1, C:1)
                       (A:2, C:2)
                                                      Independent
          Time
                              Cause and Effect of version (A:1, B:1, C:1)
     
                                                                    08.6.17
Dynamo: Versioning, cont’d
    Vector Clocks
         How to determine ordering of versions?
               If all clocks are equal to or greater than those of the other
                    The one is more recently updated
                          e.g. (A:1, B:1, C:1) < (A:3, B:1, C:1)
               Otherwise
                    They are parallel branches and cannot be ordered
                        e.g. (A:1, B:1, C:1) ? (A:2, C:1)
                                            Independent
nodeA
               (A:0)
       (A:1)
                   (A:2, C:1)
             (A:3, B:1, C:1)
       Effect

nodeB
                                                        (A:1, B:1, C:1)
                    (A:3, B:2, C:1)

nodeC
               Cause
                   (A:1, C:1)
                          (A:2, C:2)
                                                                    Independent
          Time
                                        Cause and Effect of version (A:1, B:1, C:1)
     
                                                                                 08.6.17
Dynamo: Synchronization
    Sync replicas with Merkle tree
         Hierarchical checksums (Merkle tree) are kept to be calculated
               Trees are constructed for each bucket
         Synchronization is executed periodically or when membership
          changes
node1
                  3377
                                               5C37
             node2


                12D5
           F1C9
                               12D5
             E334

          8FF3
 9F9D
 F632
 34B7
                               8FF3
 9F9D
 F632
 E1F3

          win
     mac
     linux
 bsd
                          win
    mac
       linux
 bsd


                        Comparison of hierarchical checksums in Merkle trees
     
                                                                   08.6.17
Dynamo: Synchronization, cont’d
    Sync replicas with Merkle tree
         Compares Merkle tree with other nodes
               From root to leaf, until checksum corresponds with each other



node1
                  3377
                  1. Compare
                  5C37
             node2


                12D5
           F1C9
          2. Compare
          12D5
             E334

          8FF3
 9F9D
 F632
 34B7
              3. Compare
      8FF3
 9F9D
 F632
 E1F3

          win
     mac
     linux
 bsd
                          win
    mac
       linux
 bsd


                        Comparison of hierarchical checksums in Merkle trees
     
                                                                   08.6.17
Dynamo: Synchronization, cont’d
    Sync replicas with Merkle tree
         Synchronize out-of-date replicas
               In background as low priority task
               If versions cannot be ordered, no sync will be done


node1
                  3377
                  1. Compare
                  5C37
             node2


                12D5
           F1C9
          2. Compare
          12D5
             E334

          8FF3
 9F9D
 F632
 34B7
              3. Compare
      8FF3
 9F9D
 F632
 E1F3

          win
     mac
     linux
 bsd
        4. Sync
          win
    mac
       linux
 bsd


                        Comparison of hierarchical checksums in Merkle trees
     
                                                                   08.6.17
Dynamo: Synchronization, cont’d
    Advantages of Merkle tree
         Comparisons can be reduced, if most of replicas are
          synchronized
               Root checksums are equal, and no more comparison is required


node1
                  3377
                  1. Compare
                  5C37
             node2


                12D5
           F1C9
          2. Compare
          12D5
             E334

          8FF3
 9F9D
 F632
 34B7
              3. Compare
      8FF3
 9F9D
 F632
 E1F3

          win
     mac
     linux
 bsd
        4. Sync
          win
    mac
       linux
 bsd


                        Comparison of hierarchical checksums in Merkle trees
     
                                                                   08.6.17
Dynamo: Implementation
    Implementation
         Written in Java
         Closed source 
    APIs
         Over HTTP
    Storage
         BDB or MySQL
    Security
         No requirements




     
                      08.6.17
Kai: Overview
    Kai
         Open source implementation of Amazon’s Dynamo
              Named after my origin
              OpenDynamo had been taken by a project not related to Amazon’s
               Dynamo 
         Written in Erlang
         memcache API
         Found at http://sourceforge.net/projects/kai/




     
                                                       08.6.17
Kai: Building Kai
    Requirements
         Erlang OTP (>= R12B)
         make
    Build
          %    svn co http://kai.svn.sourceforge.net/svnroot/kai/trunk kai
          %    cd kai/
          %    make
          %    make test


                  Edits RUN_TEST in Makefile according to your enbironment before
                   make test, if not MacOSX
                       ./configure will be coming




     
                                                           08.6.17
Kai: Configuration
    kai.config
         All parameters are optional since having default values
Parameter
                Description
                                        Default value
logfile
                  Name of log file
                                   Standard output
hostname
                 Hostname, which should be specified if the          Auto detection
                          computer has multiple network interfaces
port
                     Port # of internal API
                             11011
memcache_port
            Port # of memcache API
                             11211
n, r, w
                  N, R, W for quorum
                                 3, 2, 2
number_of_buckets
        # of buckets, which must correspond with other      1024
                          nodes
number_of_virtual_nodes
 # of virtual nodes
                                  128




     
                                                                 08.6.17
Kai: Running Kai
    Run as a stand alone server
      % erl -pa src -config kai -kai n 1 -kai r 1 -kai w 1

      1> application:load(kai).
      2> application:start(kai).



         Arguments
            Arguments
                 Description
            -pa src
                   Adds directory src to load path
            -config kai
               Loads configuration file kai.config
            -kai n1 -kai r1 -kai w1
   Overwrites configuration as N, R, W = 1, 1, 1

         Access to 127.0.0.1:11211 by your memcache client


     
                                                                08.6.17
Kai: Running Kai, cont’d
    Run as a cluster system
         Start three nodes on different port #’s in a single computer
          Terminal 1
          % erl -pa src -config kai -kai port 11011 -kai memcache_port 11211

          1> application:load(kai).
          2> application:start(kai).

          Terminal 2
          % erl -pa src -config kai -kai port 11012 -kai memcache_port 11212

          1> application:load(kai).
          2> application:start(kai).

          Terminal 3
          % erl -pa src -config kai -kai port 11013 -kai memcache_port 11213

          1> application:load(kai).
          2> application:start(kai).

     
                                                      08.6.17
Kai: Running Kai, cont’d
    Run as a cluster system
         Connect all nodes by informing of their neighbors
          3> kai_api:check_node({{127,0,0,1}, 11011}, {{127,0,0,1}, 11012}).
          4> kai_api:check_node({{127,0,0,1}, 11012}, {{127,0,0,1}, 11013}).
          5> kai_api:check_node({{127,0,0,1}, 11013}, {{127,0,0,1}, 11011}).


         Access to 127.0.0.1:11211-11213 by your memcache client

         Adding a new node to existed cluster
          % Link a new node to the existed cluster 
          % (kai_api:check_node/2 establishes one-directional link)
          1> kai_api:check_node(NewNode, NodeInCluster).

          % Wait till synchronizing buckets…

          % Link a node in the cluster to the new node
          2> kai_api:check_node(NodeInCluster, NewNode).

     
                                                      08.6.17
Kai: Internals
Function
          Module
         Comments
Partitioning
      kai_hash
       •  Not considering physical placement
Membership
        kai_network
    •  Not including Chord
                                   • Will be renamed to kai_membership
Coordinator
       kai_memcache
   • Will be re-implemented in kai_coordinator
Versioning
                        •  Not implemented yet
Synchronization
   kai_sync
       •  Not including Merkle tree
Storage
           kai_store
      •  Using ets
Internal API
      kai_api
API
               kai_memcache
   •  get, set, and delete
Logging
           kai_log
Configuration
     kai_config
Supervisor
        kai_sup

   
                                                          08.6.17
Kai: kai_hash
    Base module                                Future work
         gen_server                                 Physical placement
    Current status                                  Parallel access will be
         Consistent hashing                          allowed for read operation
                                                          To avoid waiting long while
               32bit hash space
                                                           re-hashing
               Virtual nodes
                                                          No use gen_server:call
               Buckets
      Synopsis
          kai_hash:start_link(),

          # Re-hashes when membership changes
          {replaced_buckets, ListOfReplacedBuckets} =
              kai_hash:update_nodes(ListOfNodesToAdd, ListOfNodesToRemove),

          # Finds nodes associated with Key to get/put
          {nodes, ListOfNodes} = kai_hash:find_nodes(Key).

     
                                                             08.6.17
Kai: kai_network,
will be renamed to kai_membership
    Base module                                       Future work
         gen_fsm                                           Chord or Kademlia
    Current status                                              Kademlia is used by
                                                                  BitTorrent
               Built-in distributed facilities,
                such as EPMD, are not used,
                since they don’t seem to be
                scalable
         Gossip-based protocol
      Synopsis
          kai_network:start_link(),

          # Checks whether Node is alive, and updates consistent hashing if needed
          kai_network:check_node(Node),

          # In background, kai_network checks a node at random every second

     
                                                                    08.6.17
Kai: kai_coordinator,
now implemented in kai_memcache
    Base module                           Future work
         gen_server (in plan)                       Separated from
                                                      kai_memcache
    Current status
                                                Requests from clients will
              Implemented in
               kai_memcache
                                                 be routed to coordinators
                                                     Currently, a node receiving
         Quorum                                      requests behaves like a
                                                      coordinator
Synopsis (in plan)
kai_coordinator:start_link(),

% Sends a get request to N nodes, and returns received data to kai_memcache
Data = kai_coordinator:get(Key),

% Sends a put request to N nodes, where Data is a variable of data record
kai_coordinator:put(Data).


     
                                                         08.6.17
Kai: kai_version
    Base module                                Future work
         gen_server (in plan)                       Vector Clocks
    Current Status
               Not implemented yet




      Synopsis (in plan)
          kai_version:start_link(),

          % Updates clock of LocalNode in VectorClocks
          kai_version:update(VectorClocks, LocalNode),

          % Generates ordering of vector clocks
          {order, Order} = kai_version:order(VectorClocks1, VectorClocks2).


     
                                                         08.6.17
Kai: kai_sync
    Base module                                Future work
         gen_fsm                                    Bulk transport
    Current status                                       Downloads whole bucket
                                                           when membership changes
         Sync data if not exists
                                                     Parallel download
               Versions are not compared
                                                          Synchronizes a bucket with
                                                           multiple nodes

      Synopsis
                                                     Merkle tree
          kai_sync:start_link(),

          # Compares key list in Bucket with other nodes, 
          # and retrieves those which don’t exist
          kai_sync:update_bucket(Bucket),

          # In background, kai_sync synchronizes a bucket at random every second


     
                                                             08.6.17
Kai: kai_store
    Base module                                 Future work
         gen_server                                  Persistent storage without
    Current status                                    capacity limit
                                                           dets, mnesia, or MySQL
         ets, which is Erlang built-in
                                                           Unfortunately, tables in dets
          memory storage, is used
                                                            and mnesia < 4GB
                                                      Delaying deletion
      Synopsis
          kai_store:start_link(),

          # Retrieves Data associated with Key
          Data = kai_store:get(Key),

          % Stores Data, which is a variable of data record
          kai_store:put(Data).


     
                                                               08.6.17
Kai: kai_api
    Base module                                   Future work
         gen_tcp                                       Process pool
    Current status                                          Restricts max # of processes
                                                              to receive API calls
         Internal API
                                                        Connection pool
               RPC for kai_hash, kai_store,
                and kai_network
                             Re-uses TCP connections


      Synopsis
          kai_api:start_link(),

          # Retrieves node list from Node
          {node_list, ListOfNodes} = kai_api:node_list(Node),

          # Retrieves Data associated with Key from Node
          Data = kai_api:get(Node, Key).


     
                                                                08.6.17
Kai: kai_memcache
    Base module                                     Future work
         gen_tcp                                         cas and stats
    Current status                                       Process pool
         get, set, and delete of                              Restricts max # of processes
                                                                to receive API calls
          memcache API are
          supported
              exptime of set must be zero,     Synopsis in Ruby
               since Kai is persistent
                                                require ‘memcache’
               storage
              get returns multiple values if   cache = MemCache.new ‘127.0.0.1:11211’

               multiple versions are found
     # Stores ‘value’ with ‘key’
                                                cache[‘key’] = ‘value’

                                                # Retrieves data associated with ‘key’
                                                p cache[‘key’]

     
                                                                  08.6.17
Kai: Testing
    Execution
         Done by make test
    Implementation
         common_test
              Built-in testing framework
         Test servers
              Written from scratch currently
              Can test_server be used?




     
                                          08.6.17
Kai: Miscellaneous
    Node ID
     # Nodes are identified by socket addresses of internal API
     {Addr, Port} = {{192,168,1,1}, 11011}.


    Data structure
     # This data structure includes value as well as metadata
     -record(data,     {key, bucket, last_modified, checksum, flags, value}).

     # This includes only metadata, such as key, bucket #, and MD5 checksum
     -record(metadata, {key, bucket, last_modified, checksum}).


    Return values
     # Returns ‘undefined’ if data associated with Key is not found
     undefined = kai_store:get(Key).

     # Returns Reason with ‘error’ tag when something wrong happens
     {error, Reason} = function(Args).


     
                                                   08.6.17
Kai: Roadmap
1.         Basic implementation
          Current version


2.         Almost Dynamo
  Module
                     Task
  kai_hash
                   Parallel access will be allowed for read operation
  kai_coordinator
            Requests from clients will be routed to coordinators
  kai_version
                Vector clocks
  kai_sync
                   Bulk and parallel transport
  kai_store
                  Persistent storage
  kai_api
                    Process pool
  kai_memcache
               Process pool and cas


      
                                                       08.6.17
Kai: Roadmap, cont’d
3.         Dynamo
  Module
                            Task
  kai_hash
                          Physical Placement
  kai_membership
                    Chord or Kademlia
  kai_sync
                          Merkel tree
  kai_store
                         Delaying deletion
  kai_api
                           Connection pool
  kai_memcache
                      stats


     On development environments
          configure, test_server



      
                                                    08.6.17
Conclusion
    Join us if interested in!

                http://sourceforge.net/projects/kai/




     
                                             08.6.17

Más contenido relacionado

Destacado

GT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL DatabaseGT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL DatabaseRob Tweed
 
分散Key/Valueストア Kai 事例紹介
分散Key/Valueストア Kai事例紹介分散Key/Valueストア Kai事例紹介
分散Key/Valueストア Kai 事例紹介Tomoya Hashimoto
 
Process Design and Polymorphism: Lessons Learnt from Development of Kai
Process Design and Polymorphism: Lessons Learnt from Development of KaiProcess Design and Polymorphism: Lessons Learnt from Development of Kai
Process Design and Polymorphism: Lessons Learnt from Development of KaiTakeru INOUE
 
Amazon's Dynamo in POE and Erlang @ YAPC::Asia 2008 LT
Amazon's Dynamo in POE and Erlang @ YAPC::Asia 2008 LTAmazon's Dynamo in POE and Erlang @ YAPC::Asia 2008 LT
Amazon's Dynamo in POE and Erlang @ YAPC::Asia 2008 LTTakeru INOUE
 
Kai = (Dynamo + memcache API) / Erlang
Kai = (Dynamo + memcache API) / ErlangKai = (Dynamo + memcache API) / Erlang
Kai = (Dynamo + memcache API) / ErlangTakeru INOUE
 
分散ストレージに使えるかもしれないアルゴリズム
分散ストレージに使えるかもしれないアルゴリズム分散ストレージに使えるかもしれないアルゴリズム
分散ストレージに使えるかもしれないアルゴリズムTakeru INOUE
 
HBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationHBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationSchubert Zhang
 
NYC* 2013 — "Using Cassandra for DVR Scheduling at Comcast"
NYC* 2013 — "Using Cassandra for DVR Scheduling at Comcast"NYC* 2013 — "Using Cassandra for DVR Scheduling at Comcast"
NYC* 2013 — "Using Cassandra for DVR Scheduling at Comcast"DataStax Academy
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthdaveconnors
 
Database system concepts
Database system conceptsDatabase system concepts
Database system conceptsKumar
 
MEP Engineers on the way - How to improve MEP modeling with Dynamo
MEP Engineers on the way - How to improve MEP modeling with DynamoMEP Engineers on the way - How to improve MEP modeling with Dynamo
MEP Engineers on the way - How to improve MEP modeling with DynamoCesare Caoduro
 
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...DataStax Academy
 
Meetup 19/12/2016 - Blockchain-as-a-service voor Antwerpen?
Meetup 19/12/2016 - Blockchain-as-a-service voor Antwerpen?Meetup 19/12/2016 - Blockchain-as-a-service voor Antwerpen?
Meetup 19/12/2016 - Blockchain-as-a-service voor Antwerpen?Digipolis Antwerpen
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Jay Patel
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraAdrian Cockcroft
 
Basic Study for Erlang #1
Basic Study for Erlang #1Basic Study for Erlang #1
Basic Study for Erlang #1Masahito Ikuta
 
The Social Semantic Web: An Introduction
The Social Semantic Web: An IntroductionThe Social Semantic Web: An Introduction
The Social Semantic Web: An IntroductionJohn Breslin
 

Destacado (20)

Sparksee overview
Sparksee overviewSparksee overview
Sparksee overview
 
GT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL DatabaseGT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL Database
 
分散Key/Valueストア Kai 事例紹介
分散Key/Valueストア Kai事例紹介分散Key/Valueストア Kai事例紹介
分散Key/Valueストア Kai 事例紹介
 
Process Design and Polymorphism: Lessons Learnt from Development of Kai
Process Design and Polymorphism: Lessons Learnt from Development of KaiProcess Design and Polymorphism: Lessons Learnt from Development of Kai
Process Design and Polymorphism: Lessons Learnt from Development of Kai
 
Amazon's Dynamo in POE and Erlang @ YAPC::Asia 2008 LT
Amazon's Dynamo in POE and Erlang @ YAPC::Asia 2008 LTAmazon's Dynamo in POE and Erlang @ YAPC::Asia 2008 LT
Amazon's Dynamo in POE and Erlang @ YAPC::Asia 2008 LT
 
Kai = (Dynamo + memcache API) / Erlang
Kai = (Dynamo + memcache API) / ErlangKai = (Dynamo + memcache API) / Erlang
Kai = (Dynamo + memcache API) / Erlang
 
分散ストレージに使えるかもしれないアルゴリズム
分散ストレージに使えるかもしれないアルゴリズム分散ストレージに使えるかもしれないアルゴリズム
分散ストレージに使えるかもしれないアルゴリズム
 
HBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationHBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance Evaluation
 
Dynamo db
Dynamo dbDynamo db
Dynamo db
 
NYC* 2013 — "Using Cassandra for DVR Scheduling at Comcast"
NYC* 2013 — "Using Cassandra for DVR Scheduling at Comcast"NYC* 2013 — "Using Cassandra for DVR Scheduling at Comcast"
NYC* 2013 — "Using Cassandra for DVR Scheduling at Comcast"
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
 
Database system concepts
Database system conceptsDatabase system concepts
Database system concepts
 
MEP Engineers on the way - How to improve MEP modeling with Dynamo
MEP Engineers on the way - How to improve MEP modeling with DynamoMEP Engineers on the way - How to improve MEP modeling with Dynamo
MEP Engineers on the way - How to improve MEP modeling with Dynamo
 
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
 
Meetup 19/12/2016 - Blockchain-as-a-service voor Antwerpen?
Meetup 19/12/2016 - Blockchain-as-a-service voor Antwerpen?Meetup 19/12/2016 - Blockchain-as-a-service voor Antwerpen?
Meetup 19/12/2016 - Blockchain-as-a-service voor Antwerpen?
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 
Basic Study for Erlang #1
Basic Study for Erlang #1Basic Study for Erlang #1
Basic Study for Erlang #1
 
The Social Semantic Web: An Introduction
The Social Semantic Web: An IntroductionThe Social Semantic Web: An Introduction
The Social Semantic Web: An Introduction
 
AWS Cost Visualizer
AWS Cost VisualizerAWS Cost Visualizer
AWS Cost Visualizer
 

Similar a Kai – An Open Source Implementation of Amazon’s Dynamo

Scaling Rails with memcached
Scaling Rails with memcachedScaling Rails with memcached
Scaling Rails with memcachedelliando dias
 
Perl Intro 3 Datalog Parsing
Perl Intro 3 Datalog ParsingPerl Intro 3 Datalog Parsing
Perl Intro 3 Datalog ParsingShaun Griffith
 
Self Adaptive Simulated Binary Crossover
Self Adaptive Simulated Binary CrossoverSelf Adaptive Simulated Binary Crossover
Self Adaptive Simulated Binary Crossoverpaskorn
 
Benchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseBenchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseChristopher Choi
 
第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandra第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandraShun Nakamura
 
When Crypto Attacks! (Yahoo 2009)
When Crypto Attacks! (Yahoo 2009)When Crypto Attacks! (Yahoo 2009)
When Crypto Attacks! (Yahoo 2009)Nate Lawson
 
Google_A_Behind_the_Scenes_Tour_-_Jeff_Dean
Google_A_Behind_the_Scenes_Tour_-_Jeff_DeanGoogle_A_Behind_the_Scenes_Tour_-_Jeff_Dean
Google_A_Behind_the_Scenes_Tour_-_Jeff_DeanHiroshi Ono
 
Zh Tw Introduction To H Base
Zh Tw Introduction To H BaseZh Tw Introduction To H Base
Zh Tw Introduction To H Basekevin liao
 

Similar a Kai – An Open Source Implementation of Amazon’s Dynamo (14)

All The Little Pieces
All The Little PiecesAll The Little Pieces
All The Little Pieces
 
Scaling Rails with memcached
Scaling Rails with memcachedScaling Rails with memcached
Scaling Rails with memcached
 
Perl Intro 3 Datalog Parsing
Perl Intro 3 Datalog ParsingPerl Intro 3 Datalog Parsing
Perl Intro 3 Datalog Parsing
 
Self Adaptive Simulated Binary Crossover
Self Adaptive Simulated Binary CrossoverSelf Adaptive Simulated Binary Crossover
Self Adaptive Simulated Binary Crossover
 
Gpu Join Presentation
Gpu Join PresentationGpu Join Presentation
Gpu Join Presentation
 
Benchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBaseBenchmarking MongoDB and CouchBase
Benchmarking MongoDB and CouchBase
 
Vidoop CouchDB Talk
Vidoop CouchDB TalkVidoop CouchDB Talk
Vidoop CouchDB Talk
 
第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandra第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandra
 
When Crypto Attacks! (Yahoo 2009)
When Crypto Attacks! (Yahoo 2009)When Crypto Attacks! (Yahoo 2009)
When Crypto Attacks! (Yahoo 2009)
 
Google_A_Behind_the_Scenes_Tour_-_Jeff_Dean
Google_A_Behind_the_Scenes_Tour_-_Jeff_DeanGoogle_A_Behind_the_Scenes_Tour_-_Jeff_Dean
Google_A_Behind_the_Scenes_Tour_-_Jeff_Dean
 
Ndp Slides
Ndp SlidesNdp Slides
Ndp Slides
 
NTLM
NTLMNTLM
NTLM
 
Zh Tw Introduction To H Base
Zh Tw Introduction To H BaseZh Tw Introduction To H Base
Zh Tw Introduction To H Base
 
Web 2.0 101
Web 2.0 101Web 2.0 101
Web 2.0 101
 

Último

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 

Último (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 

Kai – An Open Source Implementation of Amazon’s Dynamo

  • 1. Kai – An Open Source Implementation of Amazon’s Dynamo takemaru 08.6.17
  • 2. Outline   Amazon’s Dynamo   Motivation   Features   Algorithms   Kai   Build and Run   Internals   Roadmap 08.6.17
  • 3. Dynamo: Motivation   Largest e-commerce site   75K query/sec (my estimation)   500 req/sec * 150 query/req   O(10M) users and Many items   Why not RDBMS?   Not easy to scaling-out or load balancing   Many components only need primary key access   Databases are required just for Amazon   Dynamo for primary key accesses   SimpleDB for complex queries   S3 for large files   Dynamo is used for   Shopping carts, customer preferences, session management, sales rank, and product catalogs 08.6.17
  • 4. Dynamo: Features   Key, value store   Distributed hash table   High scalability   No master, peer-to-peer   Large scale cluster, maybe O(1K) get(“cart/joe”)   Fault tolerant   Even if an entire data center fails joe   Meets latency requirements in the case joe joe 08.6.17
  • 5. Dynamo: Features, cont’d   Service Level Agreements   < 300ms for 99.9% queries   On the average, 15ms for reads and 30ms for writes   High availability   No lock and always writable get(“cart/joe”)   Eventually Consistent   Replicas are loosely synchronized joe   Inconsistencies are resolved by clients Tradeoff between availability and consistency -  RDBMS chooses consistency joe -  Dynamo prefers availability joe 08.6.17
  • 6. Dynamo: Overview   Dynamo cluster (instance) joe   Consists of equivalent nodes bob   Has N replicas for each key bob joe bob   Dynamo APIs   get(key) joe   put(key, value, context)   Context is a kind of version, like cas of memcache Dynamo of N = 3   delete is not described in the paper (I guess defined)   Requirements for clients   Don’t need to know ALL nodes, unlike memcache clients   Requests can be sent to any node 08.6.17
  • 7. Dynamo: Partitioning   Consistent Hashing 2^128 0 1 2   Nodes and keys are positioned at their hash hash(node5) values hash(node4)   MD5 (128bits)   Keys are stored in the following N nodes hash(node6) hash(node2) hash(node3) joe hash(cart/joe) hash(node1) Hash ring (N=3) 08.6.17
  • 8. Dynamo: Partitioning, cont’d   Advantages 2^128 0 1 2   Small # of keys are remapped, when hash(node5) membership is changed hash(node4)   Between down node and its Nth predecessor hash(node6) hash(node2) hash(node3) joe hash(cart/joe) hash(node1) Remapped range Node2 is down 08.6.17
  • 9. in data center A Dynamo: Partitioning in data center B   Physical placement of 2^128 replicas 0 1 2   Each key is replicated hash(node5) across multiple data centers hash(node4)   Arrangement scheme has not been revealed   Netmask is helpful, I guess   Impact on replica distribution is unknown hash(node6) hash(node2) hash(node3) joe   Advantages hash(cart/joe)   No data outage on data hash(node1) center failures joe is replicated in multiple data centers 08.6.17
  • 10. Dynamo: Partitioning, cont’d   Virtual nodes 2^128   Multiple positions are taken hash(node3-1) 0 1 2 by a single physical node hash(node5)   O(100) virtual/physical hash(node4) hash(node1-1) hash(node5-1)   Advantages hash(node2-1)   Keys are more uniformly hash(node6-1) distributed statistically hash(node6) hash(node2)   Remapped keys are evenly hash(node3) dispersed across nodes   # of virtual nodes can be determined based on hash(node5-1) hash(node1) capacity of physical node Two virtual nodes per each physical node
  • 11. Dynamo: Partitioning, cont’d   Buckets 2^128   Hash ring is equally divided 0 1 2 into buckets   There should be more buckets than all of virtual nodes   Keys in same bucket are mapped to same nodes bob and joe are bob in same bucket   Advantages   Keys are easily synchronized joe bucket by bucket   For Merkle trees discussed later Divided into 8 buckets 08.6.17
  • 12. Dynamo: Membership   Gossip-based protocol   Spreads membership like a rumor Detecting node failure   Membership contains node list and change history node2 is down   Exchanges membership with a node at random every second   Updates membership if more node2 is down recent one received node2 is down hash(node2)   Advantages node2 is down   Robust; no one can prevent a rumor from spreading   Exponentially rapid spread Membership change is spread by the form of gossip 08.6.17
  • 13. Dynamo: Membership, cont’d   Chord Brown knows only yellows   For large scale cluster   > O(10K) virtual nodes   Each node needs to know only O(log v) nodes Who has “cart/ken”?   v is # of virtual nodes ken   The nearer on hash ring, the more to know   Messages are routed hop by hop   Hop count is < O(log v)   For details, see original paper Routing in Chord 08.6.17
  • 14. Dynamo: Request Response get/put Operations   Client Client 1.  Sends a request any of Dynamo node   The request is forwarded to Coordinator coordinator joe   Coordinator: one of nodes associated with the key   Coordinator 1.  Chooses N nodes by using consistent hashing joe 2.  Forwards a request to N nodes 3.  Waits responses from R or joe W nodes, or timeouts 4.  Checks replica versions if get 5.  Sends a response to client get/put operations for N,R,W = 3,2,2 08.6.17
  • 15. Dynamo: Request Response get/put Operations, Cont’d Client   Quorum   Parameters   N: # of replicas Coordinator   R: min # of successful reads joe   W: min # of successful writes   Conditions   R+W > N   Key can be read from at least joe one node which has successfully written it   R<N joe   Provides better latency   N,R,W = 3,2,2 in common get/put operations for N,R,W = 3,2,2 08.6.17
  • 16. Dynamo: Versioning   Vector Clocks   Ordering of versions in distributed manner   No master, no clock synchronization   Parallel branches can be detected   List of clocks (counters) associated with each node   Each clock is managed by coordinator   (A:2, C:1) means that updated when A’s clock was 2 and C’s was 1 Independent nodeA (A:0) (A:1) (A:2, C:1) (A:3, B:1, C:1) Effect nodeB (A:1, B:1, C:1) (A:3, B:2, C:1) nodeC Cause (A:1, C:1) (A:2, C:2) Independent Time Cause and Effect of version (A:1, B:1, C:1) 08.6.17
  • 17. Dynamo: Versioning, cont’d   Vector Clocks   When a coordinator updates a key, …   Increments its own clock   Replaces clock in vector of the key Independent nodeA (A:0) (A:1) (A:2, C:1) (A:3, B:1, C:1) Effect nodeB (A:1, B:1, C:1) (A:3, B:2, C:1) nodeC Cause (A:1, C:1) (A:2, C:2) Independent Time Cause and Effect of version (A:1, B:1, C:1) 08.6.17
  • 18. Dynamo: Versioning, cont’d   Vector Clocks   How to determine ordering of versions?   If all clocks are equal to or greater than those of the other   The one is more recently updated   e.g. (A:1, B:1, C:1) < (A:3, B:1, C:1)   Otherwise   They are parallel branches and cannot be ordered   e.g. (A:1, B:1, C:1) ? (A:2, C:1) Independent nodeA (A:0) (A:1) (A:2, C:1) (A:3, B:1, C:1) Effect nodeB (A:1, B:1, C:1) (A:3, B:2, C:1) nodeC Cause (A:1, C:1) (A:2, C:2) Independent Time Cause and Effect of version (A:1, B:1, C:1) 08.6.17
  • 19. Dynamo: Synchronization   Sync replicas with Merkle tree   Hierarchical checksums (Merkle tree) are kept to be calculated   Trees are constructed for each bucket   Synchronization is executed periodically or when membership changes node1 3377 5C37 node2 12D5 F1C9 12D5 E334 8FF3 9F9D F632 34B7 8FF3 9F9D F632 E1F3 win mac linux bsd win mac linux bsd Comparison of hierarchical checksums in Merkle trees 08.6.17
  • 20. Dynamo: Synchronization, cont’d   Sync replicas with Merkle tree   Compares Merkle tree with other nodes   From root to leaf, until checksum corresponds with each other node1 3377 1. Compare 5C37 node2 12D5 F1C9 2. Compare 12D5 E334 8FF3 9F9D F632 34B7 3. Compare 8FF3 9F9D F632 E1F3 win mac linux bsd win mac linux bsd Comparison of hierarchical checksums in Merkle trees 08.6.17
  • 21. Dynamo: Synchronization, cont’d   Sync replicas with Merkle tree   Synchronize out-of-date replicas   In background as low priority task   If versions cannot be ordered, no sync will be done node1 3377 1. Compare 5C37 node2 12D5 F1C9 2. Compare 12D5 E334 8FF3 9F9D F632 34B7 3. Compare 8FF3 9F9D F632 E1F3 win mac linux bsd 4. Sync win mac linux bsd Comparison of hierarchical checksums in Merkle trees 08.6.17
  • 22. Dynamo: Synchronization, cont’d   Advantages of Merkle tree   Comparisons can be reduced, if most of replicas are synchronized   Root checksums are equal, and no more comparison is required node1 3377 1. Compare 5C37 node2 12D5 F1C9 2. Compare 12D5 E334 8FF3 9F9D F632 34B7 3. Compare 8FF3 9F9D F632 E1F3 win mac linux bsd 4. Sync win mac linux bsd Comparison of hierarchical checksums in Merkle trees 08.6.17
  • 23. Dynamo: Implementation   Implementation   Written in Java   Closed source    APIs   Over HTTP   Storage   BDB or MySQL   Security   No requirements 08.6.17
  • 24. Kai: Overview   Kai   Open source implementation of Amazon’s Dynamo   Named after my origin   OpenDynamo had been taken by a project not related to Amazon’s Dynamo    Written in Erlang   memcache API   Found at http://sourceforge.net/projects/kai/ 08.6.17
  • 25. Kai: Building Kai   Requirements   Erlang OTP (>= R12B)   make   Build % svn co http://kai.svn.sourceforge.net/svnroot/kai/trunk kai % cd kai/ % make % make test   Edits RUN_TEST in Makefile according to your enbironment before make test, if not MacOSX   ./configure will be coming 08.6.17
  • 26. Kai: Configuration   kai.config   All parameters are optional since having default values Parameter Description Default value logfile Name of log file Standard output hostname Hostname, which should be specified if the Auto detection computer has multiple network interfaces port Port # of internal API 11011 memcache_port Port # of memcache API 11211 n, r, w N, R, W for quorum 3, 2, 2 number_of_buckets # of buckets, which must correspond with other 1024 nodes number_of_virtual_nodes # of virtual nodes 128 08.6.17
  • 27. Kai: Running Kai   Run as a stand alone server % erl -pa src -config kai -kai n 1 -kai r 1 -kai w 1 1> application:load(kai). 2> application:start(kai).   Arguments Arguments Description -pa src Adds directory src to load path -config kai Loads configuration file kai.config -kai n1 -kai r1 -kai w1 Overwrites configuration as N, R, W = 1, 1, 1   Access to 127.0.0.1:11211 by your memcache client 08.6.17
  • 28. Kai: Running Kai, cont’d   Run as a cluster system   Start three nodes on different port #’s in a single computer Terminal 1 % erl -pa src -config kai -kai port 11011 -kai memcache_port 11211 1> application:load(kai). 2> application:start(kai). Terminal 2 % erl -pa src -config kai -kai port 11012 -kai memcache_port 11212 1> application:load(kai). 2> application:start(kai). Terminal 3 % erl -pa src -config kai -kai port 11013 -kai memcache_port 11213 1> application:load(kai). 2> application:start(kai). 08.6.17
  • 29. Kai: Running Kai, cont’d   Run as a cluster system   Connect all nodes by informing of their neighbors 3> kai_api:check_node({{127,0,0,1}, 11011}, {{127,0,0,1}, 11012}). 4> kai_api:check_node({{127,0,0,1}, 11012}, {{127,0,0,1}, 11013}). 5> kai_api:check_node({{127,0,0,1}, 11013}, {{127,0,0,1}, 11011}).   Access to 127.0.0.1:11211-11213 by your memcache client   Adding a new node to existed cluster % Link a new node to the existed cluster % (kai_api:check_node/2 establishes one-directional link) 1> kai_api:check_node(NewNode, NodeInCluster). % Wait till synchronizing buckets… % Link a node in the cluster to the new node 2> kai_api:check_node(NodeInCluster, NewNode). 08.6.17
  • 30. Kai: Internals Function Module Comments Partitioning kai_hash •  Not considering physical placement Membership kai_network •  Not including Chord • Will be renamed to kai_membership Coordinator kai_memcache • Will be re-implemented in kai_coordinator Versioning •  Not implemented yet Synchronization kai_sync •  Not including Merkle tree Storage kai_store •  Using ets Internal API kai_api API kai_memcache •  get, set, and delete Logging kai_log Configuration kai_config Supervisor kai_sup 08.6.17
  • 31. Kai: kai_hash   Base module   Future work   gen_server   Physical placement   Current status   Parallel access will be   Consistent hashing allowed for read operation   To avoid waiting long while   32bit hash space re-hashing   Virtual nodes   No use gen_server:call   Buckets Synopsis kai_hash:start_link(), # Re-hashes when membership changes {replaced_buckets, ListOfReplacedBuckets} = kai_hash:update_nodes(ListOfNodesToAdd, ListOfNodesToRemove), # Finds nodes associated with Key to get/put {nodes, ListOfNodes} = kai_hash:find_nodes(Key). 08.6.17
  • 32. Kai: kai_network, will be renamed to kai_membership   Base module   Future work   gen_fsm   Chord or Kademlia   Current status   Kademlia is used by BitTorrent   Built-in distributed facilities, such as EPMD, are not used, since they don’t seem to be scalable   Gossip-based protocol Synopsis kai_network:start_link(), # Checks whether Node is alive, and updates consistent hashing if needed kai_network:check_node(Node), # In background, kai_network checks a node at random every second 08.6.17
  • 33. Kai: kai_coordinator, now implemented in kai_memcache   Base module   Future work   gen_server (in plan)   Separated from kai_memcache   Current status   Requests from clients will   Implemented in kai_memcache be routed to coordinators   Currently, a node receiving   Quorum requests behaves like a coordinator Synopsis (in plan) kai_coordinator:start_link(), % Sends a get request to N nodes, and returns received data to kai_memcache Data = kai_coordinator:get(Key), % Sends a put request to N nodes, where Data is a variable of data record kai_coordinator:put(Data). 08.6.17
  • 34. Kai: kai_version   Base module   Future work   gen_server (in plan)   Vector Clocks   Current Status   Not implemented yet Synopsis (in plan) kai_version:start_link(), % Updates clock of LocalNode in VectorClocks kai_version:update(VectorClocks, LocalNode), % Generates ordering of vector clocks {order, Order} = kai_version:order(VectorClocks1, VectorClocks2). 08.6.17
  • 35. Kai: kai_sync   Base module   Future work   gen_fsm   Bulk transport   Current status   Downloads whole bucket when membership changes   Sync data if not exists   Parallel download   Versions are not compared   Synchronizes a bucket with multiple nodes Synopsis   Merkle tree kai_sync:start_link(), # Compares key list in Bucket with other nodes, # and retrieves those which don’t exist kai_sync:update_bucket(Bucket), # In background, kai_sync synchronizes a bucket at random every second 08.6.17
  • 36. Kai: kai_store   Base module   Future work   gen_server   Persistent storage without   Current status capacity limit   dets, mnesia, or MySQL   ets, which is Erlang built-in   Unfortunately, tables in dets memory storage, is used and mnesia < 4GB   Delaying deletion Synopsis kai_store:start_link(), # Retrieves Data associated with Key Data = kai_store:get(Key), % Stores Data, which is a variable of data record kai_store:put(Data). 08.6.17
  • 37. Kai: kai_api   Base module   Future work   gen_tcp   Process pool   Current status   Restricts max # of processes to receive API calls   Internal API   Connection pool   RPC for kai_hash, kai_store, and kai_network   Re-uses TCP connections Synopsis kai_api:start_link(), # Retrieves node list from Node {node_list, ListOfNodes} = kai_api:node_list(Node), # Retrieves Data associated with Key from Node Data = kai_api:get(Node, Key). 08.6.17
  • 38. Kai: kai_memcache   Base module   Future work   gen_tcp   cas and stats   Current status   Process pool   get, set, and delete of   Restricts max # of processes to receive API calls memcache API are supported   exptime of set must be zero, Synopsis in Ruby since Kai is persistent require ‘memcache’ storage   get returns multiple values if cache = MemCache.new ‘127.0.0.1:11211’ multiple versions are found # Stores ‘value’ with ‘key’ cache[‘key’] = ‘value’ # Retrieves data associated with ‘key’ p cache[‘key’] 08.6.17
  • 39. Kai: Testing   Execution   Done by make test   Implementation   common_test   Built-in testing framework   Test servers   Written from scratch currently   Can test_server be used? 08.6.17
  • 40. Kai: Miscellaneous   Node ID # Nodes are identified by socket addresses of internal API {Addr, Port} = {{192,168,1,1}, 11011}.   Data structure # This data structure includes value as well as metadata -record(data, {key, bucket, last_modified, checksum, flags, value}). # This includes only metadata, such as key, bucket #, and MD5 checksum -record(metadata, {key, bucket, last_modified, checksum}).   Return values # Returns ‘undefined’ if data associated with Key is not found undefined = kai_store:get(Key). # Returns Reason with ‘error’ tag when something wrong happens {error, Reason} = function(Args). 08.6.17
  • 41. Kai: Roadmap 1.  Basic implementation   Current version 2.  Almost Dynamo Module Task kai_hash Parallel access will be allowed for read operation kai_coordinator Requests from clients will be routed to coordinators kai_version Vector clocks kai_sync Bulk and parallel transport kai_store Persistent storage kai_api Process pool kai_memcache Process pool and cas 08.6.17
  • 42. Kai: Roadmap, cont’d 3.  Dynamo Module Task kai_hash Physical Placement kai_membership Chord or Kademlia kai_sync Merkel tree kai_store Delaying deletion kai_api Connection pool kai_memcache stats   On development environments   configure, test_server 08.6.17
  • 43. Conclusion   Join us if interested in! http://sourceforge.net/projects/kai/ 08.6.17