16. 1. Background – 2. Solution – 3. Evaluation – 4. Conclusion Background Erasure Codes Regenerating Codes
. Reed-Solomon Codes (RAID-6) [6]
.
Theorem (necessary and sufficient condition)
.
Every possible k × k submatrix obtained by removing (n − k) rows
from EM ′ has full rank.
equivalent expression of full rank:
rank = k
non-singular
.
.
. .
Alternative view:
Consider the linear space of
P = [Pi ]i=1,2,...,n = [F1 , F2 , . . . , Fk , C1 , C2 , . . . , Cn−k ], its
dimension is k, and any k out of n vectors form a basis of the
linear space.
. . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
presented by Shuai YUAN FAST’12: NCCloud 7/20
34. 1. Background – 2. Solution – 3. Evaluation – 4. Conclusion the Paper’s Contributions F-MSR Code
. File Repair
.
Simulations
.
Consider multiple rounds of permanent node failures for different
values of n. In each round, we randomly pick a node to
permanently fail and trigger a repair.
.
.
Simulation result
.
If the loop of Steps 2 to 5 is repeated over 10 times ; bad repair
Only checking the MDS property, we see a bad repair very quickly:
after no more than 7 and 2 rounds for n = 8 and n = 12,
respectively.
.
. . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
presented by Shuai YUAN FAST’12: NCCloud 14/20
37. 1. Background – 2. Solution – 3. Evaluation – 4. Conclusion the Paper’s Contributions F-MSR Code
. File Repair
Cost of two-phrase checking
k
MDS property check: enumerating Cn subsets of n nodes to
see if each of their corresponding encoding matrices forms a
full rank.
Repair MDS property check: for any failed node (out of n
nodes), we collect any one out of (n − k) chunks from the
other (n − 1) surviving nodes to reconstruct, therefore the
cost is n(n − k)(n−1) Cn .
k
Return to .. Unsolved Problems
. . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
presented by Shuai YUAN FAST’12: NCCloud 14/20
38. 1. Background – 2. Solution – 3. Evaluation – 4. Conclusion the Paper’s Contributions F-MSR Code
. File Repair
Cost of two-phrase checking
k
MDS property check: enumerating Cn subsets of n nodes to
see if each of their corresponding encoding matrices forms a
full rank.
Repair MDS property check: for any failed node (out of n
nodes), we collect any one out of (n − k) chunks from the
other (n − 1) surviving nodes to reconstruct, therefore the
cost is n(n − k)(n−1) Cn .
k
Return to .. Unsolved Problems
We have to check more times for the current repair, but bad repair
will be rare in the future iterative repairs.
. . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
presented by Shuai YUAN FAST’12: NCCloud 14/20
39. 1. Background – 2. Solution – 3. Evaluation – 4. Conclusion Repair Traffic Analysis Cost Analysis Experiments
. Repair Traffic
Native data size M
k(n − k) native chunks of size M/k(n − k)
k−1
Repair Traffic: M/k(n − k) × (n − 1) = M/k × (1 + )
n−k
1
For k = n − 2, Repair Traffic: M/2 × (1 + )
n−2
limn→∞ Repair Traffic = M/2
.
Save the repair traffic by close to 50% when n is large.
.
. . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
presented by Shuai YUAN FAST’12: NCCloud 15/20
40. 1. Background – 2. Solution – 3. Evaluation – 4. Conclusion Repair Traffic Analysis Cost Analysis Experiments
. Cost Analysis
Table : Monthly price plans (in US dollars) for Amazon S3 (US
Standard), Rackspace Cloud Files and Windows Azure Storage, as of
September, 2011.
Amazon S3 Rackspace Azure
Storage (per GB) $0.14 $0.15 $0.15
Date transfer in (per GB) free free free
Date transfer out (per GB) $0.12 $0.18 $0.15
PUT,POST (per 10K requests) $0.10 free $0.01
GET (per 10K requests) $0.01 free $0.01
. . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
presented by Shuai YUAN FAST’12: NCCloud 16/20
41. 1. Background – 2. Solution – 3. Evaluation – 4. Conclusion Repair Traffic Analysis Cost Analysis Experiments
. Cost Analysis
Metadata of F-MSR
Metadata size = 160B; file size = several MBs
Overhead due to GET requests during repair
Assuming S3 plan in Sep 2011, n = 4, k = 2, file size = 4MB
Conventional repair: 0.427%
F-MSR repair: 0.854%
.
Overhead cost is low.
.
. . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
presented by Shuai YUAN FAST’12: NCCloud 16/20
42. 1. Background – 2. Solution – 3. Evaluation – 4. Conclusion Repair Traffic Analysis Cost Analysis Experiments
. Experiments
NCCloud deployment
Single machine connected to a cloud-of-clouds
n = 4, k = 2
Coding schemes
Reed-Solomon-based RAID-6 vs. F-MSR
Metric
Response time
Cloud environments:
Local cloud: OpenStack Swift
Commercial cloud: multiple containers in Azure
. . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
presented by Shuai YUAN FAST’12: NCCloud 17/20
43. 1. Background – 2. Solution – 3. Evaluation – 4. Conclusion Repair Traffic Analysis Cost Analysis Experiments
. Response Time: Local Cloud
F-MSR has higher response
time due to
encoding/decoding
overhead.
F-MSR has slightly less
response time in repair, due
to less data download
. . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
presented by Shuai YUAN FAST’12: NCCloud 18/20
44. 1. Background – 2. Solution – 3. Evaluation – 4. Conclusion Repair Traffic Analysis Cost Analysis Experiments
. Response Time: Commercial Cloud
No distinct response time difference, as network fluctuations play a
. . . . . . . . . . . . . . . . . . . .
bigger role in actual response time. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
presented by Shuai YUAN FAST’12: NCCloud 19/20
45. 1. Background – 2. Solution – 3. Evaluation – 4. Conclusion
. Conclusion & Unsolved Problems
Conclusion:
Propose an implementable design of F-MSR:
Preserve storage cost.
Use less repair traffic.
Do not require storage nodes to have encoding capabilities.
Build NCCloud, which realizes F-MSR
Source code:
http://ansrlab.cse.cuhk.edu.hk/software/nccloud/
. . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
presented by Shuai YUAN FAST’12: NCCloud 20/20
46. 1. Background – 2. Solution – 3. Evaluation – 4. Conclusion
. Conclusion & Unsolved Problems
Unsolved problems:
Repair costs is high when n and k are large: As we mentioned
before (Click .. here ), F-MSR uses two-phrase checking, which
consumes a lot of checking costs in the current repair phrase.
Just as Reed-Solomon codes use Vandermonde Matrix to
ensure MDS property, a better algorithm is still seeking to
replace the check-after-trying approach.
The reason why F-MSR chooses to download chunks from all
(n − 1) nodes for repairing a file comes from an argument in
[1]: The more nodes we download chunks from, the lower
repair traffic is. However, [1]’s conclusion is based on a
homogeneity model, and NCCloud’s multi-cloud solution is
actually a heterogeneous environment. Such a basis may be
invalid.
. . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
presented by Shuai YUAN FAST’12: NCCloud 20/20
47. 1. Background – 2. Solution – 3. Evaluation – 4. Conclusion
A.G. Dimakis, P.B. Godfrey, Y. Wu, M.J. Wainwright, and
K. Ramchandran.
Network coding for distributed storage systems.
Information Theory, IEEE Transactions on, 56(9):4539–4551,
2010.
A.G. Dimakis, K. Ramchandran, Y. Wu, and C. Suh.
A survey on network codes for distributed storage.
Proceedings of the IEEE, 99(3):476–489, 2011.
A. Duminuco and E. Biersack.
A practical study of regenerating codes for peer-to-peer
backup systems.
In Distributed Computing Systems, 2009. ICDCS’09. 29th
IEEE International Conference on, pages 376–384. IEEE, 2009.
C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan,
J. Li, and S. Yekhanin.
Erasure coding in windows azure storage.
. . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
presented by Shuai YUAN FAST’12: NCCloud 20/20
48. 1. Background – 2. Solution – 3. Evaluation – 4. Conclusion
In USENIX Annual Technical Conference (USENIX ATC),
2012.
J.S. Plank et al.
A tutorial on reed-solomon coding for fault-tolerance in
raid-like systems.
Software Practice and Experience, 27(9):995–1012, 1997.
I.S. Reed and G. Solomon.
Polynomial codes over certain finite fields.
Journal of the Society for Industrial & Applied Mathematics,
8(2):300–304, 1960.
B. Sklar.
Reed-solomon codes.
Downloaded from URL http://www. informit.
com/content/images/art. sub.–sklar7.
sub.–reed-solomo-n/elementLinks/art. sub.–sklar7.
sub.–reed-solomon. pdf,(unknown pub date), pages 1–33,
2001. . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.
..
. . . .
.. .. ..
presented by Shuai YUAN FAST’12: NCCloud 20/20