2. uses Google File System (GFS) [1] to store data and log node heterogeneity. It defines the load measure as a function of
information. GFS can divide file into fixed size chunks and the number of queries issued on data items per time frame and
balances the data load by distributing those chunks into then proposes a mechanism that balances the system's load by
different nodes evenly. However, GFS use primary copy, adjusting the DHT structure so that it best captures query load
pessimistic algorithm to synchronize data between different distributions and node heterogeneity. However, this system
replicate nodes, which make it scale and performance poorly in does not consider the data replication. Also it assumes that the
the write intensive and wide area scenario. forward request and the response message have the same load,
which may be not correct.
Dynamo is a highly available, eventually consistent key-
value storage system that uses Consistent Hashing to increase
scalability, Vector Clock to do reconciliation and Sloppy III. SYSTEM DESIGN
Quorum and Hinted handoff to handle temporary failure. In Cassandra is inspired by Bigtable and Dynamo, it integrate
Dynamo, all the nodes and the keys of data items are hashed Bigtable's Column Family based data model and Dynamo's
and then the output values are mapped to a “ring”. The node’ Eventually Consistency behavior and thus can get both of their
output represent the position of this node in the “ring”, while advantages.
the key’s output decide which node this item will be stored. In
order to balance the data load, Dynamo map one real node into As with Dynamo, Cassandra use Consistent Hashing to
many virtual nodes, each of them occupy one position in the partition and distribute data into different nodes by hashing
“ring”, different node may have different number of virtual both nodes and data’s value into a “ring”. To improve
nodes, based on their capacity. availability and balance the load, all the data are replicated into
N nodes, where N is the replication number that can be
Pnuts, a geographically distributed database system, use configured in advance. First, it will assign each data item a
pub/sub message to assure the order of update for one key and Coordinator Node, which is the first node this data item meet
provide per-record timeline consistency, which is between when walk clockwise around the “ring”, then replicate this data
strict serialization and eventually consistency. Similar with item to the next N-1 clockwise successor nodes in the “ring”.
Bigtable, Pnuts use one centralized router to look up the right To adapt between strong consistency and high performance,
node for a specified key and divide data into many fix sized Cassandra provide different consistency level options to both
tablet. It can move tablet from overload node to low load node read and write operations. For write, Consistency.One means
to get data load balanced and will dynamically change node the the operation will only be routed to the closet replica node.
client connect to reduce forward request. Consistency.Quorum means system will route request to
ecStore [9] is an cloud based elastic storage system which quorum, usually N/2+1, number of nodes and wait for their
supports data partition and replication automatically, load responses. Consistency.All means it will route request to all N
balancing, efficient range query and transactional access. It use replica nodes and waiting for their response. For read, the
stratum architecture: BATON tree based data partition as the operation will be routed to all replicas, but only will wait
bottom layer to provide highly scalability, 2 tie load-adaptive specific number of response, others will be received and
replication as the middle layer to balance load and provide handled in asynchronies way. For Consistency.One, this
highly availability and multi-version optimistic concurrent number is 1; for Consistency.Quorum, this number is N/2+1
control as the top transaction level to provide data consistency. and for Consistency.All this number is N.
It use data partition to solve data skew issue and solve request In this chapter we will introduce how we improve
skew by adding second replicas to those hotspot data. ecStore Cassandra.
use primary copy optimistic replication and provide adaptive
read consistency. However, if you choose read consistency A. Minimize forward request load
value equal to the number of replicas, which means it will not
feedback to client until the data has been updated to all In order to use Cassandra cluster, client application needs to
replicas, the result is the same with using primary copy connect to a node within the cluster. We name this node as the
pessimistic replication and may cause poor performance issue, Connected Node. When client read or write a data item, if this
otherwise, you may get corrupt data in the situation where Connected Node is not one of the replica nodes responsible for
recent updated data has not been synchronized to the nodes that this item, it has to forward the request to one or more nodes,
you are accessing to. wait for their response and finally feedback to the client. Since
most of the applications exists request skew, when you choose
S. Bianchi et al. [10] studies the load of a P2P system under different node as Connected Node, the total forward request
biased request workloads. It discovers that those systems show load will be different. However, client often do not know
a heavy lookup traffic load and also the load that in the which node has the least forward request load at first. Even
intermediate node that is responsible for forward the access to client connect to the most loaded node at first, however, as the
the target node. Based on this, the authors propose a way to use time goes by, the users’ access pattern will change, and thus the
Routing Tables Reorganization to reduce forward request load hot spot data will also changes. For these two reasons, we need
and use cache and data replicas to reduce local request load to to change this configuration dynamically to minimize the
balance the traffic load. As it needs hundreds of nodes in the forward request load.
experiments, the authors just did some simulation experiment.
In Cassandra, we will meet three kind of forward request:
M. Abdallah et al. [11] proposes a load balancing
mechanism that takes into account both data popularity and
3. • K1: Request’s consistency level is Consistency.One, in moveRatio, which can be configured, then we say this local
this situation the Connected Node only forward request node is overloaded. If finding one node is overloaded, we will
to closest replica and wait for its response. select the nodes that have enough free space to make sure both
themselves and overloaded node’s used ratio less than
• K2: If it is read request and consistency level is not moveRatio after moving the position of overloaded node as
Consistency.One, Connected Node will forward two candidates to move with. If there is more than one candidate,
type of request: One is read request that will be routed we will select the one who has the minimum original used ratio
to its closest replica, which then will return whole as the target node and move the overloaded node beside it to
message to Connected Node; another is read digest balance the data between them.
request sent to other replicas, which only return
message digest. After get all the response, system will Since our balancing algorithm is based on average used
digest the message and compare it with digests got ratio, so if two nodes’ total storage capacity is very different,
from other replicas to see if they are the same version. even their used ratios are similar, their used storage size also
will very different. It means one node’s storage data number is
• K3: If it is write operation and not Consistency.One, much more than another one, which sometimes also means this
Connected Node will forward and wait for node has much more request load than that one. We solve this
blockNumber number of write responses according to potential issue by setting a variable called allowCapacityRatio.
the consistency level. Here blockNumber’s value is For any node whose total storage is larger than
N/2+1 if consistency level is Consistency.Quorum or allowCapacityRatio times of minimum nodes’ total storage, we
N if it is Consistency.All. will use this most allowed capacity to present its total capacity
When accessing a data item, for different kind of forward instead.
request, the benefit we get is different after shifting Connected
Node from original node to one of the replicas responsible for //It will be accessed in Connected Node
this item. In K1, if Connected Node is one of replicas, it does 1: nodes←findNodes(key)
not need to wait any message from other remote nodes, which 2: for node∈nodes do
will improve its response time a lot. In K2, it still needs to wait 3: load←baseLoad //Each write operation's load
read digest message from other replicas, but Connected Node 4: if blockNumber equals 1
can handle read request itself. In K3, it still needs to wait 5: load←baseLoad *weightOne
(blockNumber-1) number of write response from other
6: end if
replicas, which comparing with other 2 situation, the
7: if (blockNumber great than 1)
improvement is limited.
and isReadOperation()
Based on this observation, we give out our improvement. 8: load←baseLoad *weighRead
The opinion is recording all nodes’ request load in Connected 9: end if
Node, and assigns different kinds of request with different 10: addLoad(node, load)
weight. Every specify time, system we compare the max 11: end for
request load with Connected Node, if its request load is much Figure 1. Record each node’s request load
larger than original Connected Node, then we will
change Connected Nod to that node. Fig. 1 and Fig.2 describe
the Pseudo code. 1: maxNode ← maxLoadNode()
2: if getload(maxNode) – getLoad(connectedNode)
B. Consider node’s storage capacity when balancing data great than changeFactor* totalClusterLoad()
among each node 3: changeConnectedNode(node)
In order to balance the data load between each nodes, 4: end if
Cassandra monitors each node’s data load information on the Figure 2. Change Connected Node
ring. If finding one node is overloaded, system will alleviate its
load by moving its position along the ring. The detail checking
and moving algorithms are described in [12]. 1: localUsedRatio←localUsedSize / localTotalSize
2: averageUsedRatio←getClusterAverageUsedRatio()
However, Cassandra assumes each node has the same 3: if localUsedRatio great than
storage capacity, it only monitors the storage size each node moveRatio * averageUsedRatio
has used and then uses this information to judge if there exist 4: candidateNodes ← find all nodes whose
overloaded node or not. But in the reality, within one (usedSize+localUsedSize)/(totalSize+localToal
Cassandra’s cluster, different commodity server nodes may Size) less than moveRatio * averageUsedRatio
have different storage capacity, which also need to take into 5: targetNode←minUsedRatio(candidateNodes)
consideration when balancing the data. 6: Move local node to let newLocalUsedRatio
In order to maximum each node’s storage utilization, we equals newTargetNodeUsedRatio
propose an enhanced data balancing algorithm. We suggest 7: end if
comparing each node’s local storage used ratio to whole Figure 3. Storage balance algorithm
cluster’s average storage used ratio. If it is large than
4. IV. EXPERIMENT 2) Replicas Factor = 2, W = 1, R = 1
We run series of experiments to evaluate the result. The In this round we change the factor to 2; the purpose is to
base Cassandra version we use is 0.6.4. We set up a 6 see how replicas number affects our algorithm. From table III
commodity server nodes cluster; all nodes are within the same we can see if we connect our TPC-W client application to
LAN and same Rack. Node1, Node2 or Node3 at first, the Connected Node will be
changed to Node5 after running the algorithm. Table IV tells us
We did some changes to the TPC-W benchmark application if the Connected Node is shift from Node1 to Node5, it will
and use it in our system. The requests in TPC-W are following reduce 25.2% forward read request and 18% forward write
Zipf distribution, which is used very often in the web request. When it is changed from Node2 to Node5, then the
application domain to simulate users’ real access model. results is 22.5% and 19.2% respectively. For Node5, result is
17% and 11.6%.
A. The result of minimize forward request load
Comparing with the result in Round1, we find when other
The criteria here we use is the total forward request number configuration remain the same, the less replicas number we use,
for each Node. We assign following values to the variables the more chance the Connected Node will be changed, but for
describe in Fig. 1: baseLoad = 1, weightOne = 2, weighRead = each change, the improvement is less than what we get from
1.2. As in Fig. 2, we set changeFactor = 5% and make the round 1.
system check one day a time to see if need to change
Connected Node or not.
TABLE III. REQUEST LOAD DIFFERENCE IN ROUND2
To see how the Replicas Factor and different consistency
Node1 Node2 Node3 Node4 Node5 Node6
level affect the result, we set up 3 round tests, each round will K1 Request 147 156 187 230 247 195
be run 24 hours, and the units of request number in the Total Load 294 312 374 460 494 390
following tables are ten thousands. Factor Diff 8.6% 7.8% 5.2% 1.5% 0% 4.5%
1) Replicas Factor = 3, W = 1, R = 1
In this round, each data item has 3 replicas distributed in 3
TABLE IV. FORWARD REQUEST REDUCED RATIO IN ROUND2
different nodes, all operation’s consistent level are set to
Consistency.One, which means for write operation, it will only Forward Forward Read Reduced Write Reduced
touch one node and for read operation, it will only wait for one Read request Write request Ratio Ratio
Node1 295 139 25.2% 18%
response while other response will be got in asynchronous way. Node2 284 141 22.5% 19.2%
We connect client to Node 3 at first, since its Ratio Node3 265 129 17.0% 11.6%
Difference is 1.8%, which is small than changeFactor, so the Node5 220 114 0% 0%
Connected Node has not changed after run the algorithm, but
from Table I, we can see if we connect to Node1 or Node2 at 3) Replicas Factor = 3, W = 2, R = 2
first, then the Connected Node will be changed to Node4. In In round 3 we change both read and write consistency level
this round, there is no K2 and K3 request since all operation are to Consistency.Quorum, to see how it affect our algorithm.
Consistency.One. Factor Diff is defined as: Since all requests are not Consistency.One, so there is no
Factor Diffi = (loadmax - loadi) / loadtotal K1 request. Table V presents the detail result, from which we
can see if we first connected client application to Node1 or
Table II shows if the Connected Node is Node1 at first, it Node2, the Connected Node need to be changed to Node4. As
will reduce 34.8% of the forward read request and 31% of the we can see in table VI, the read reduced ratio is the same with
forward write request. For Node2, the value is 29.4% and round one, write reduced ratio is less than the one in round one.
25.7% respectively. There is no read digest request that needed That means if the replicas number is the same, the stricter the
to be forward in synchronized way in this round. consistency level is, the less improvement it will get.
TABLE I. REQUEST LOAD DIFFERENCE IN ROUND1 TABLE V. REQUEST LOAD DIFFERENCE IN ROUND3
Node1 Node2 Node3 Node4 Node5 Node6 Node1 Node2 Node3 Node4 Node5 Node6
K1 Request 232 256 315 349 323 266 K2 Request 154 171 213 236 218 177
Total Load 464 512 630 698 646 532 K3 Request 78 84 102 113 105 89
Factor Diff 6.4% 5.0% 1.8% 0% 1.4% 4.5% Total Load 262.8 289.2 357.6 396.2 366.6 301.4
Factor Diff 6.8% 5.4% 2.0% 0% 1.5% 4.8%
TABLE II. FORWARD REQUEST REDUCED RATIO IN ROUND1
TABLE VI. FORWARD REQUEST REDUCED RATIO IN ROUND3
Forward Forward Read Reduced Write Reduced
Read request Write request Ratio Ratio Forward Forward Forward Read Write
Node1 236 113 34.8% 31% Read Read Digest Write Reduced Reduced
request request request Ratio Ratio
Node2 218 105 29.4% 25.7%
Node1 236 390 304 34.8% 11.5%
Node4 154 78 0% 0%
Node2 218 390 296 29.4% 9.1%
Node4 154 390 269 0% 0%
5. B. The result of considering storage capacity V. CONCLUSIONS AND FUTURE WORK
In this experiment we set moveRatio = 1.5 and In this paper we have presented two ways to improve
allowCapacityRatio = 2. Cassandra to make it aware request skew issue and notice
different nodes’ capacity when balancing the storage load.
First we run TPC-W for a long time to populate enough
data into cluster nodes. Then we run load balance script Firstly, we propose an algorithm that can minimize forward
command several times. In round one we use Cassandra’s request by shifting Connected Node dynamically to the one
original load balance algorithm, in round two we use our that can handle maximum number of request locally.
algorithm. Fig. 4 and Fig. 5 present the results. LB1 is the Load
Balance algorithm provided by Cassandra and LB2 is our Secondly, we give out a new idea that can improve storage
algorithm. utilization of each node by using used ratio instead of used size
to do data storage balance.
From Fig. 4 we can see although LB2 cannot distribute data
into different nodes as evenly as LB1, but comparing with the After that, we did several experiments to evaluate the
original load distribution, it also has significant effect. effectiveness of our approach. The result shows that in
different scenarios, we all can reduce both forward read request
From Fig. 5, it is obviously that our algorithm can utilize and forward write request a lot. Also from the experiment we
different nodes’ storage capacity much better than its original learnt the storage utilization will be balanced and improved
one. As its utilization is more balanced, thus the whole cluster obviously.
can store more data, which means the storage utilization is
improved. For now, we only assume all the nodes are within the same
datacenter, we will extend our research to different datacenter
in the future. Also, currently all data has the same replicas
number, in the next step, we will think to add additional
adaptive replicas for those nodes that contain spot hot data.
REFERENCES
[1] S. Ghemawat, H. Gobioff and S. Leung, “The Google File System”, In
19th Symposium on Operating Systems Principles, Lake George, New
York, 2003, pp. 29–43.
[2] F. Chang et al., “Bigtable: A distributed storage system for structured
data”, In Proc. OSDI, 2006, pp 205–218.
[3] G. DeCandia et al., “Dynamo: amazon’s highly available key-value
store”, In Proc. SOSP, 2007, pp. 205-220.
[4] B. F. Cooper et al., “PNUTS: Yahoo!’s Hosted Data Serving Platform”,
Proc. VLDB Endow, vol. 1, pp. 1277-1288, August 2008.
Figure 4. Storage size used by each node [5] A. Lakshman and P. Malik, “Cassandra: Adecentralized structured
storage system”, SIGOPS Oper. Syst. Rev. vol. 44, pp. 35-40, 2009.
[6] M. Stonebraker, “SQL Databases v. NoSQL Databases”, Commun.
ACM, vol. 53, pp. 10-11, April 2010.
[7] N. Leavitt, “Will NoSQL Databases Live Up to Their Promise?”,
Computer, vol. 43, pp. 12-14, February 2010.
[8] E. A. Brewer, “Towards robust distributed systems”, Principles of
Distributed Computing, Portland, Oregon, July 2000.
[9] H. T. Vo, C. Chen and B. C. Ooi, “Towards elastic transactional cloud
storage with range query support”, Proc. VLDB Endow., vol. 3, pp.
506–517, 2010.
[10] S. Bianchi, S. Serbu, P. Felber and P. Kropf, “Adaptive Load Balancing
for DHT Lookups”, ICCCN, 2006, pp. 411-418.
[11] M. Abdallah and E. Buyukkaya, “Fair load balancing under skewed
popularity patterns in heterogeneous DHT-based P2P systems”,
International Conference on Parallel and Distributed Computing and
Systems, 2007, pp. 484-490.
Figure 5. Storage utilization of each node [12] M. Abdallah and H.C. Le, “Scalable Range Query Processing for
Large-Scale Distributed Database Applications,” Proc. Int'l Conf.
Parallel and Distributed Computing Systems (PDCS), 2005.