SlideShare una empresa de Scribd logo
1 de 31
How swift is your Swift?


Ning Zhang, OpenStack Engineer at Zmanda
Chander Kant, CEO at Zmanda



                                           1
Outline
     Build a cost-efficient Swift cluster with expected performance
        Background & Problem
        Solution
        Experiments
     When something goes wrong in a Swift cluster
        Two Types of Failures: Hard Drive, Entire Node
        What is performance degradation before the failures are fixed
        How soon the data will be back (when all failed nodes are back on-line)?
        Experiments


2
Zmanda

     Leader in Open Source Backup and Cloud Backup
     We got strong interest in integrating our cloud backup
      products with OpenStack Swift
     Backup to OpenStack Swift
        Alternative to tape based backups
     Swift Installation and Configuration Services

3
Background
     Public Storage Cloud
        Pros: pay-as-you-go, low upfront cost …
        Cons: expensive in the long run, performance is not clear …

     Private Storage Cloud (use case: backup data to private cloud by Zmanda
      products)
        Pros: low TCO in the long run, expected performance, in-house data …
        Cons: high upfront cost, long ramp-up period (prepare and tune HW & SW)

     Open Problem / Challenge:
        How to build a private cloud storage with ….
        Low upfront cost, expected performance, short ramp-up period

4
Background
     Swift is an open-source object store running on commodity HW
        High scalability (linear scale-out as needed)
        High availability (3 copies of data)
        High durability

     Swift has heterogeneous types of nodes




    Proxy – Swift’s brain (coordinate     Storage – Swift’s
    requests, handle failure…)            warehouse (store objects)

5
Problem
     How to provision the proxy and storage nodes in a Swift cluster for
      expected performance (SLA) while keeping low upfront cost?
                                                    Hardware: CPU, memory
                                                    network, I/O device …

                                                    Software: filesystem,
                                                    Swift configuration …
                                                                            ……




6
Lesson Learnt from Past


    CPU, network I/O intensive       Disk I/O intensive               XFS filesystem
    High-end CPU, 10 GE networking   Commodity CPU, 1 GE networking


    Are they always true in all cases? especially for different workloads?
        always pay off to choose 10GE (expensive!) for proxy nodes?
        always sufficient to use commodity CPU for storage nodes?
        always disk I/O intensive on storage nodes?
        how much difference in performance between XFS and other FS?
        ….
7
Solution
     Solution (similar to “Divide and Conquer” strategy)
        First, solve the problem in a small Swift cluster (e.g. 2 proxy nodes, 5-15 storage nodes)

       1: For each HW configuration for proxy node
                                                                      Pruning methods make it simple!
       2: For each HW configuration for storage node
       3:     For each number of storage node from 5, 10, 15…
       4:       For each SW parameter setting                       Exhaustive search?
       5:          A small Swift cluster is made, measure its performance, calculate and save its “perf/cost”
       6: Recommend the small Swift clusters with high “performance/cost”

        Then, scale out recommended small Swift clusters to large Swift clusters until SLA is met
            Performance and cost also get scaled (when networking is not a bottleneck)
            Their HW & SW settings (identified in small clusters) are also held true for large clusters



8
Evaluation - Hardware
     Hardware configuration for proxy and storage nodes
         Amazon EC2 (diverse HW resources, no front cost, virtualized HW -> physical HW)
         Two hardware choices for proxy node:
             # 1: Cluster Compute Extra Large Instance (EC2 Cluster)
             # 2: High-CPU Extra Large Instance (EC2 High-CPU)
         Two hardware choices for storage node:
             # 1: High-CPU
             # 2: Large Instance (EC2 Large)
                                 Cluster            High-CPU         Large
             CPU speed           33.5 EC2 Compute   20 EC2 Compute   4 EC2 Compute
                                 Units              Units            Units
             Memory              23 GB              7GB              7.5GB
             Network             10 GE              1 GE             1 GE
             Pricing (US East)   $ 1.30/h           $ 0.66/h         $ 0.32/h

9
Evaluation – Cost & Software
      Upfront Cost
         EC2 cost ($/hour)
         EC2 cost≠ physical HW cost, but it is a good implication of physical HW cost

      Software Configuration
         Filesystem
             XFS (recommended by RackSpace)
             Ext4 (popular FS, but not evaluated for Swift)
         Swift Configuration Files
             db_preallocate (it is suggested to set True for HDD to reduce defragmentation)
         OS settings
             disable TIME_WAIT, disable syn cookies …
             will discuss in our future blog …

10
Evaluation – Workloads
      2 Sample Workloads
                                   Upload (GET 5%, PUT 90%, DEL: 5%)
      Small Objects                Example: Online gaming hosting service
      (object size 1KB – 100 KB)   the game sessions are periodically saved as small files.
      Large Objects                Example: Enterprise Backup
      (object size 1MB – 10 MB)    the files are compressed into large trunk to backup.
                                   Occasionally, recovery and delete operations are needed.

      Object sizes are randomly and uniformly chosen within the pre-defined range
      Objects are continuously uploaded to the test Swift clusters

      COSBench – a cloud storage benchmark tool from
      Free to define your own workloads in COSBench !

11
Evaluation – Upload small objects
      Top-3 recommended hardware for a small Swift cluster
                                       HW for proxy node            HW for storage node           Throughput/$
             Upload Small        1     2 proxy nodes (High-CPU)     5 storage nodes (High-CPU)        151
             Objects             2     2 proxy nodes (Cluster)      10 storage nodes (High-CPU)       135
                                 3     2 proxies nodes (Cluster)    5 storage nodes (High-CPU)        123

      Storage node is all based on High-CPU
          CPU are intensively used for handling large # requests. CPU is the key resources.
          Comparing to Large Instance (4 EC2 Compute Units, $0.32/h)
          High-CPU Instance has 20 EC2 Compute Units with $0.66/h (5X more CPU resources, only 2X expensive)

      Proxy node
            Traffic pattern: high throughput, low network bandwidth (e.g. 1250 op/s -> 61MB/s)
            10 GE from Cluster Instance is over-provisioned for this traffic pattern
            Comparing to High-CPU, Cluster has 1.67X CPU resources, but 2X expensive
            Besides, 5 storage nodes can almost saturate 2 proxy nodes
12
Evaluation – Upload large objects
      Top-3 recommended hardware for a small Swift cluster
                                      HW for proxy node              HW for storage node           Throughput/$
             Upload Large       1     2 proxy nodes (Cluster)        10 storage nodes (Large)            5.6
             Objects            2     2 proxy nodes (High-CPU)       5 storage nodes (Large)             4.9
                                3     2 proxy nodes (Cluster)        5 storage nodes (Large)             4.7

      Storage node is all based on Large
          More time is spent on transferring objects to I/O devices. Write request rate is low, CPU is not the key factor.
          Comparing to High-CPU Instance (20 EC2 Compute Units, $0.66/h),
          Large Instance has 4 EC2 Compute Units (sufficient) with $0.32/h (2X cheaper).

      Proxy node
            Traffic pattern: low throughput, high network bandwidth
            e.g. 32 op/s -> 160 MB/s for incoming and ~500 MB/s for outgoing traffic (write in triplicate!)
            1 GE from High-CPU is under-provisioned, 10 GE from Cluster is paid off for this workload.
            Need 10 storage nodes to keep up with the 2 proxy nodes (10 GE)
13
Evaluation – Conclusion for HW
      Take-away points for provisioning HW for a Swift cluster

                                     Hardware for proxy node     Hardware for storage node
             Upload Small Object     1 GE                        1 GE
                                     High-end CPU                High-end CPU
             Upload Large Object     10 GE                       1 GE
                                     High-end CPU                Commodity CPU

      Download workloads: see the backup slides
      Contrary to the lessons learnt from the past
            It does NOT always pay off to choose 10 GE (expensive!) for proxy nodes
            It is NOT always sufficient to use commodity CPU for storage nodes
            Upload is disk I/O intensive (3 copies of data)
            but download is NOT always disk I/O intensive (retrieve one copy of data)
14
Evaluation – Conclusion for SW
      Take-away points for provisioning SW for a Swift cluster

                                     db_preallocate          XFS vs. Ext4
            Upload Small Objects           on                    XFS
            Upload Large Objects         on / off             XFS / Ext4



      Upload Small Object (more sensitive to software settings)
          db_preallocation: intensive updates on container DB. Setting it to on will gain 10-20% better performance
          Filesystem: we observe XFS achieves 15-20% extra performance than Ext4




15
Evaluation – Scale out small cluster
      Workload #1: upload small objects (same workload for exploring HW & SW configurations for small Swift cluster)

      Based on the top-3 recommended small Swift clusters
                                    20
                                                      2X                                                       2 Proxy Node (High-CPU)
                                    18                                3X                                     5 Storage Node (High-CPU)
                                    16
            Upfront Cost ($/hour)




                                                  3X: 6 proxy                                                      2 Proxy Node (Cluster)
                                    14                                     1.5X
                                                   15 storage                                                10 Storage Node (High-CPU)
                                    12          Lowest cost                       2X
                                                                      Lowest cost                                     2 Proxy Node (Cluster)
                                    10
                                                                                                                 5 Storage Node (High-CPU)
                                     8
                                                                                                     1X
                                         SLA 1: 80% RT < 600ms 2X: 4 proxy                                                       1X
                                     6                          10 storage               SLA 2: 80% RT < 1000ms

                                     4
                                                                                                                              1X: 2 proxy
                                     2                                                                                           5 storage
                                     0
                                         0    200     400       600        800         1000   1200        1400       1600     1800    2000

16                                                              Response Time (ms) of 80% requests
Outline
      Build a cost-efficient Swift cluster with expected performance
         Background & Problem
         Solution
         Experiments
      When something goes wrong in a Swift cluster
         Two Types of Failures: Hard Drive, Entire Node
         What is performance degradation before the failures are fixed
         How soon the data will be back (when all failed nodes are back on-line)?
         Experiments


17
Why Consider Failures
      Failure stats in Google’s DC (from Google fellow Jeff Dean’s interview at 2008)
            A cluster of 1,800 servers in its first year……
            Totally, 1,000 servers failed, thousands of HDDs failed
            1 power distribution unit failed, bringing down 500 – 1,000 machines for 6 hours
            20 racks failed, each time causing 40 – 80 machines to vanish from network


      Failures in Swift
            Given a 5-zone setup, Swift can tolerate at most 2 zones failed (data will not be lost)
            But, performance will degrade to some extent before the failed zones are fixed.
            If Swift operators want to ensure certain performance level
            They need to benchmark the performance of their Swift clusters upfront


18
How Complex to Consider Failure
      (1) Possible failure at one node
          Disk
          Swift process (rsync is still working)
          Entire node
      (2) Which type of node failed
          Proxy
          Storage
      (3) How many nodes failed at same time
      Combining above three considerations, the total space of all failure scenarios is huge
          practical to prioritize those failure scenarios
          E.g. the worst or more common scenarios are considered first

19
Evaluation - Setup
      Focus on performance (not data availability)
      Measure performance degradation comparing to “no failure” case, before failed nodes back on-line
      Workload: Backup workload (uploading large objects is the major operation)
      Swift cluster: 2 proxy nodes (Cluster: Xeon CPU, 10 GE), 10 storage nodes
      Two common failure scenarios: (1) entire storage node failure (2) HDD failure in storage node

      (1) Entire storage node failure
          10%, 20%, 30% and 40% storage nodes failed in a cluster (E.g. partial power outage)
          Different HW resources are provisioned for storage node
              EC2 Large for storage node (cost-efficient, high performance/cost)
              EC2 High-CPU for storage node (costly, over-provisioned for CPU resources)

20
Evaluation - Setup
      (2) HDD failure in storage node (EC2 Large for storage node)
           Each storage node attaches 8 HDDs
           Intentionally umount some HDDs during the execution.
           Storage node is still accessible
           10%, 20%, 30% and 40% of HDDs failed in a cluster
           Compare two failure distributions:
               Uniform HDD failure (failed HDDs uniformly distributed over all storage nodes)
               Skewed HDD failure (some storage nodes get much more # HDDs failed than other
                nodes)




21
Evaluation – Entire Node Failure
                Cluster throughput (# operations / second)
           40
           35
           30
           25
           20                                                                        Large
           15                                                                        High-CPU
           10
            5
            0
                no failure   10% failure   20% failure   30% failure   40% failure

           Storage node based on Large Instance
               Throughput decreases as more storage failed
           Storage node based on High-CPU Instance
               Throughput decreases only when 40% nodes fail
22
Evaluation – Entire Node Failure
               CPU usage in unaffected storage node                         Network bandwidth (MB/s) in unaffected storage node
     100%                                                                   80
     80%                                                                    70
                                                                            60
     60%                                                                    50
                                                                 Large      40
     40%                                                                                                                              Large
                                                                            30
                                                                 High-CPU                                                             High-CPU
     20%                                                                    20
                                                                            10
      0%
                                                                             0
            no failure    10%       20%       30%       40%
                                                                                 no failure    10%       20%       30%       40%
                         failure   failure   failure   failure
                                                                                              failure   failure   failure   failure
              When storage node is based on High-CPU Instance
                 Over-provisioned resources in unaffected node get more used as # failures increases
                 So, it can keep performance from degrading initially
              When storage node is based on Large Instance
                 CPU is almost saturated when no failure happens
23
Evaluation – HDD Failure
              Cluster throughput (# operations / second)                                              Usage of unaffected disk (%)
     40                                                                                              when HDDs are uniformly failed
     35                                                                                    30
     30                                                                                    25
     25
                                                                                           20
     20
                                                                                           15
     15
     10                                                                                    10
      5                                                                                     5
      0                                                                                     0
          no failure   uniform    uniform    uniform    uniform     skewed     40% node         uniform 10%   uniform 20%   uniform 30%   uniform 40%
                       10% HDD    20% HDD    30% HDD    40% HDD    40% HDD      failure         HDD failure   HDD failure   HDD failure   HDD failure
                        failure    failure    failure    failure     failure


      When HDD are uniformly failed across all storage nodes                              When HDDs are uniformly failed
            Throughput does not decrease ! Why?                                              I/O loads are evenly distributed over other
                                                                                                unaffected HDDs
      When some storage nodes have more failed HDDs than others (skewed)
            Throughput decreases significantly, still better than entire node failure
            Extreme case: when all HDDs on a storage node fail, it is almost equal to entire node failure
24
Evaluation – Take-away points
      In order to maintain certain performance in fact of failure
          Make sense to “over-provision” the HW resources to some extent
          When failure happens, the “over-provisioned” resources will reduce the performance degradation

      Entire storage node failure vs. HDD failure
          Entire node failure is the worse than HDD failure.
          When only HDDs failed, performance degradation depends on:
              If failed HDDs are uniformly distributed across all storage nodes
                    degradation is smaller, because I/O load can be rebalanced over unaffected HDDs
              Otherwise (failure distribution is skewed)
                    degradation may be larger

      What if proxy node failure? proxy and storage nodes fail together?
          Reduce the performance, need to figure out in the future

25
When Failed Nodes Are Fixed
      When all failed (affected) nodes have been fixed and re-join the Swift cluster
         (1) How soon the recovery will take on the affected nodes?
         (2) What is performance when the recovery is undergoing?

         We will show empirical results in our blog (http://www.zmanda.com/blogs/)
         For (1), it depends on:
             How much data need to be recovered.
             Networking latency b/w unaffected and affected nodes
             HW resources (e.g. CPU) in unaffected nodes (lookup which data need to be restored)




26
When Failed Nodes Are Fixed
      When all failed nodes have been fixed and re-join the Swift cluster
         (1) How soon the recovery will take on the affected nodes?
         (2) What is performance when the recovery is undergoing?

         For (2), it depends on:
             HW resources in unaffected nodes. The unaffected nodes become more resource-
               intensive because they still serve requests, also help affected nodes to restore their data

         Performance will gradually increase as the recovery progress is close to 100%




27
Thanks! Questions/Comments?
        http://www.zmanda.com/blogs/

             swift@zmanda.com




                                       28
Back-up Slides




                 29
Evaluation – Download small objects
      Top-3 recommended hardware for a small Swift cluster
                                    HW for proxy node            HW for storage node
            Download Small     1    2 proxy nodes (High-CPU)     5 storage nodes (Large)
            Objects            2    2 proxy nodes (Cluster)      5 storage nodes (Large)
                               3    2 proxies nodes (High-CPU)   10 storage nodes (Large)

      Storage node is all based on Large
          Only one copy of data is retrieved. CPU and disk I/O are not busy
          Large is sufficient for workload and saves more cost than High-CPU

      Proxy node
          Traffic pattern: high throughput, low network bandwidth (e.g. 2400 op/s -> 117 MB/s)
          10 GE from Cluster is over-provisioned for this traffic pattern
          1 GE from High-CPU is adequate
          5 storage nodes can almost saturate the 2 proxy nodes.
30
Evaluation – Download large objects
      Top-3 recommended hardware for a small Swift cluster
                                   HW for proxy node          HW for storage node
             Download Large   1    2 proxy nodes (Cluster)    5 storage nodes (Large)
             Objects          2    2 proxy nodes (Cluster)    10 storage nodes (Large)
                              3    2 proxy nodes (High-CPU)   5 storage nodes (Large)

      Storage node is all based on Large
          Request rate is low, little load on CPU.
          Large Instance is sufficient for workload and saves more cost than High-CPU.

      Proxy node
          Traffic pattern: low throughput, high network bandwidth (70 op/s -> 350 MB/s)
          1 GE from High-CPU is under-provisioned, 10 GE from Cluster is paid off for this workload.
          5 storage nodes can nearly saturate the 2 proxy nodes.


31

Más contenido relacionado

La actualidad más candente

Replication Solutions for PostgreSQL
Replication Solutions for PostgreSQLReplication Solutions for PostgreSQL
Replication Solutions for PostgreSQLPeter Eisentraut
 
Tuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris EnvironmentTuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris EnvironmentJignesh Shah
 
PostgreSQL and Linux Containers
PostgreSQL and Linux ContainersPostgreSQL and Linux Containers
PostgreSQL and Linux ContainersJignesh Shah
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataJignesh Shah
 
Nn ha hadoop world.final
Nn ha hadoop world.finalNn ha hadoop world.final
Nn ha hadoop world.finalHortonworks
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsJignesh Shah
 
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Community
 
PostgreSQL Extensions: A deeper look
PostgreSQL Extensions:  A deeper lookPostgreSQL Extensions:  A deeper look
PostgreSQL Extensions: A deeper lookJignesh Shah
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSCeph Community
 
SOUG_Deployment__Automation_DB
SOUG_Deployment__Automation_DBSOUG_Deployment__Automation_DB
SOUG_Deployment__Automation_DBUniFabric
 
HDFS NameNode High Availability
HDFS NameNode High AvailabilityHDFS NameNode High Availability
HDFS NameNode High AvailabilityDataWorks Summit
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High CostsJonathan Long
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community
 
Built-in Replication in PostgreSQL
Built-in Replication in PostgreSQLBuilt-in Replication in PostgreSQL
Built-in Replication in PostgreSQLMasao Fujii
 
NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5UniFabric
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4UniFabric
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksCloudera, Inc.
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheadsSandeep Joshi
 

La actualidad más candente (20)

Replication Solutions for PostgreSQL
Replication Solutions for PostgreSQLReplication Solutions for PostgreSQL
Replication Solutions for PostgreSQL
 
Tuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris EnvironmentTuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris Environment
 
PostgreSQL and Linux Containers
PostgreSQL and Linux ContainersPostgreSQL and Linux Containers
PostgreSQL and Linux Containers
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
 
Nn ha hadoop world.final
Nn ha hadoop world.finalNn ha hadoop world.final
Nn ha hadoop world.final
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
 
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
 
PostgreSQL Extensions: A deeper look
PostgreSQL Extensions:  A deeper lookPostgreSQL Extensions:  A deeper look
PostgreSQL Extensions: A deeper look
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
SOUG_Deployment__Automation_DB
SOUG_Deployment__Automation_DBSOUG_Deployment__Automation_DB
SOUG_Deployment__Automation_DB
 
HDFS NameNode High Availability
HDFS NameNode High AvailabilityHDFS NameNode High Availability
HDFS NameNode High Availability
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High Costs
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Built-in Replication in PostgreSQL
Built-in Replication in PostgreSQLBuilt-in Replication in PostgreSQL
Built-in Replication in PostgreSQL
 
PostgreSQL on Solaris
PostgreSQL on SolarisPostgreSQL on Solaris
PostgreSQL on Solaris
 
DTraceCloud2012
DTraceCloud2012DTraceCloud2012
DTraceCloud2012
 
NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheads
 

Similar a How swift is your Swift - SD.pptx

Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
 
Experience In Building Scalable Web Sites Through Infrastructure's View
Experience In Building Scalable Web Sites Through Infrastructure's ViewExperience In Building Scalable Web Sites Through Infrastructure's View
Experience In Building Scalable Web Sites Through Infrastructure's ViewPhuwadon D
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Dave Holland
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operationniallmilton
 
Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM ShapeBlue
 
Building SuperComputers @ Home
Building SuperComputers @ HomeBuilding SuperComputers @ Home
Building SuperComputers @ HomeAbhishek Parolkar
 
My Sql Performance In A Cloud
My Sql Performance In A CloudMy Sql Performance In A Cloud
My Sql Performance In A CloudSky Jian
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Community
 
Designing Information Structures For Performance And Reliability
Designing Information Structures For Performance And ReliabilityDesigning Information Structures For Performance And Reliability
Designing Information Structures For Performance And Reliabilitybryanrandol
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceAmazon Web Services
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...In-Memory Computing Summit
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsJiannan Ouyang, PhD
 
Optimizing elastic search on google compute engine
Optimizing elastic search on google compute engineOptimizing elastic search on google compute engine
Optimizing elastic search on google compute engineBhuvaneshwaran R
 
Running ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in ProductionRunning ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in ProductionSearce Inc
 
Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVMAchieving the ultimate performance with KVM
Achieving the ultimate performance with KVMStorPool Storage
 
Nexenta at VMworld Hands-on Lab
Nexenta at VMworld Hands-on LabNexenta at VMworld Hands-on Lab
Nexenta at VMworld Hands-on LabNexenta Systems
 
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Amazon Web Services
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryScyllaDB
 

Similar a How swift is your Swift - SD.pptx (20)

Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Experience In Building Scalable Web Sites Through Infrastructure's View
Experience In Building Scalable Web Sites Through Infrastructure's ViewExperience In Building Scalable Web Sites Through Infrastructure's View
Experience In Building Scalable Web Sites Through Infrastructure's View
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
 
Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVM
 
Building SuperComputers @ Home
Building SuperComputers @ HomeBuilding SuperComputers @ Home
Building SuperComputers @ Home
 
My Sql Performance In A Cloud
My Sql Performance In A CloudMy Sql Performance In A Cloud
My Sql Performance In A Cloud
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
Designing Information Structures For Performance And Reliability
Designing Information Structures For Performance And ReliabilityDesigning Information Structures For Performance And Reliability
Designing Information Structures For Performance And Reliability
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance Performance
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-Kernels
 
Optimizing elastic search on google compute engine
Optimizing elastic search on google compute engineOptimizing elastic search on google compute engine
Optimizing elastic search on google compute engine
 
Running ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in ProductionRunning ElasticSearch on Google Compute Engine in Production
Running ElasticSearch on Google Compute Engine in Production
 
Achieving the ultimate performance with KVM
Achieving the ultimate performance with KVMAchieving the ultimate performance with KVM
Achieving the ultimate performance with KVM
 
Nexenta at VMworld Hands-on Lab
Nexenta at VMworld Hands-on LabNexenta at VMworld Hands-on Lab
Nexenta at VMworld Hands-on Lab
 
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent Memory
 

Más de OpenStack Foundation

Sponsor Webinar - OpenStack Summit Vancouver 2018
Sponsor Webinar  - OpenStack Summit Vancouver 2018Sponsor Webinar  - OpenStack Summit Vancouver 2018
Sponsor Webinar - OpenStack Summit Vancouver 2018OpenStack Foundation
 
OpenStack Summits 101: A Guide For Attendees
OpenStack Summits 101: A Guide For AttendeesOpenStack Summits 101: A Guide For Attendees
OpenStack Summits 101: A Guide For AttendeesOpenStack Foundation
 
OpenStack Marketing Plan - Community Presentation
OpenStack Marketing Plan - Community PresentationOpenStack Marketing Plan - Community Presentation
OpenStack Marketing Plan - Community PresentationOpenStack Foundation
 
OpenStack 5th Birthday - User Group Parties
OpenStack 5th Birthday - User Group PartiesOpenStack 5th Birthday - User Group Parties
OpenStack 5th Birthday - User Group PartiesOpenStack Foundation
 
Liberty release: Preliminary marketing materials & messages
Liberty release: Preliminary marketing materials & messagesLiberty release: Preliminary marketing materials & messages
Liberty release: Preliminary marketing materials & messagesOpenStack Foundation
 
OpenStack Foundation 2H 2015 Marketing Plan
OpenStack Foundation 2H 2015 Marketing PlanOpenStack Foundation 2H 2015 Marketing Plan
OpenStack Foundation 2H 2015 Marketing PlanOpenStack Foundation
 
OpenStack Summit Tokyo Sponsor Webinar
OpenStack Summit Tokyo Sponsor Webinar OpenStack Summit Tokyo Sponsor Webinar
OpenStack Summit Tokyo Sponsor Webinar OpenStack Foundation
 
Neutron Updates - Liberty Edition
Neutron Updates - Liberty Edition Neutron Updates - Liberty Edition
Neutron Updates - Liberty Edition OpenStack Foundation
 
Searchlight Updates - Liberty Edition
Searchlight Updates - Liberty EditionSearchlight Updates - Liberty Edition
Searchlight Updates - Liberty EditionOpenStack Foundation
 
Congress Updates - Liberty Edition
Congress Updates - Liberty EditionCongress Updates - Liberty Edition
Congress Updates - Liberty EditionOpenStack Foundation
 
Release Cycle Management Updates - Liberty Edition
Release Cycle Management Updates - Liberty EditionRelease Cycle Management Updates - Liberty Edition
Release Cycle Management Updates - Liberty EditionOpenStack Foundation
 
OpenStack Day CEE 2015: Real-World Use Cases
OpenStack Day CEE 2015: Real-World Use CasesOpenStack Day CEE 2015: Real-World Use Cases
OpenStack Day CEE 2015: Real-World Use CasesOpenStack Foundation
 

Más de OpenStack Foundation (20)

Sponsor Webinar - OpenStack Summit Vancouver 2018
Sponsor Webinar  - OpenStack Summit Vancouver 2018Sponsor Webinar  - OpenStack Summit Vancouver 2018
Sponsor Webinar - OpenStack Summit Vancouver 2018
 
OpenStack Summits 101: A Guide For Attendees
OpenStack Summits 101: A Guide For AttendeesOpenStack Summits 101: A Guide For Attendees
OpenStack Summits 101: A Guide For Attendees
 
OpenStack Marketing Plan - Community Presentation
OpenStack Marketing Plan - Community PresentationOpenStack Marketing Plan - Community Presentation
OpenStack Marketing Plan - Community Presentation
 
OpenStack 5th Birthday - User Group Parties
OpenStack 5th Birthday - User Group PartiesOpenStack 5th Birthday - User Group Parties
OpenStack 5th Birthday - User Group Parties
 
Liberty release: Preliminary marketing materials & messages
Liberty release: Preliminary marketing materials & messagesLiberty release: Preliminary marketing materials & messages
Liberty release: Preliminary marketing materials & messages
 
OpenStack Foundation 2H 2015 Marketing Plan
OpenStack Foundation 2H 2015 Marketing PlanOpenStack Foundation 2H 2015 Marketing Plan
OpenStack Foundation 2H 2015 Marketing Plan
 
OpenStack Summit Tokyo Sponsor Webinar
OpenStack Summit Tokyo Sponsor Webinar OpenStack Summit Tokyo Sponsor Webinar
OpenStack Summit Tokyo Sponsor Webinar
 
Cinder Updates - Liberty Edition
Cinder Updates - Liberty Edition Cinder Updates - Liberty Edition
Cinder Updates - Liberty Edition
 
Glance Updates - Liberty Edition
Glance Updates - Liberty EditionGlance Updates - Liberty Edition
Glance Updates - Liberty Edition
 
Heat Updates - Liberty Edition
Heat Updates - Liberty EditionHeat Updates - Liberty Edition
Heat Updates - Liberty Edition
 
Neutron Updates - Liberty Edition
Neutron Updates - Liberty Edition Neutron Updates - Liberty Edition
Neutron Updates - Liberty Edition
 
Nova Updates - Liberty Edition
Nova Updates - Liberty EditionNova Updates - Liberty Edition
Nova Updates - Liberty Edition
 
Sahara Updates - Liberty Edition
Sahara Updates - Liberty EditionSahara Updates - Liberty Edition
Sahara Updates - Liberty Edition
 
Searchlight Updates - Liberty Edition
Searchlight Updates - Liberty EditionSearchlight Updates - Liberty Edition
Searchlight Updates - Liberty Edition
 
Trove Updates - Liberty Edition
Trove Updates - Liberty EditionTrove Updates - Liberty Edition
Trove Updates - Liberty Edition
 
OpenStack: five years in
OpenStack: five years inOpenStack: five years in
OpenStack: five years in
 
Swift Updates - Liberty Edition
Swift Updates - Liberty EditionSwift Updates - Liberty Edition
Swift Updates - Liberty Edition
 
Congress Updates - Liberty Edition
Congress Updates - Liberty EditionCongress Updates - Liberty Edition
Congress Updates - Liberty Edition
 
Release Cycle Management Updates - Liberty Edition
Release Cycle Management Updates - Liberty EditionRelease Cycle Management Updates - Liberty Edition
Release Cycle Management Updates - Liberty Edition
 
OpenStack Day CEE 2015: Real-World Use Cases
OpenStack Day CEE 2015: Real-World Use CasesOpenStack Day CEE 2015: Real-World Use Cases
OpenStack Day CEE 2015: Real-World Use Cases
 

How swift is your Swift - SD.pptx

  • 1. How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda 1
  • 2. Outline  Build a cost-efficient Swift cluster with expected performance  Background & Problem  Solution  Experiments  When something goes wrong in a Swift cluster  Two Types of Failures: Hard Drive, Entire Node  What is performance degradation before the failures are fixed  How soon the data will be back (when all failed nodes are back on-line)?  Experiments 2
  • 3. Zmanda  Leader in Open Source Backup and Cloud Backup  We got strong interest in integrating our cloud backup products with OpenStack Swift  Backup to OpenStack Swift  Alternative to tape based backups  Swift Installation and Configuration Services 3
  • 4. Background  Public Storage Cloud  Pros: pay-as-you-go, low upfront cost …  Cons: expensive in the long run, performance is not clear …  Private Storage Cloud (use case: backup data to private cloud by Zmanda products)  Pros: low TCO in the long run, expected performance, in-house data …  Cons: high upfront cost, long ramp-up period (prepare and tune HW & SW)  Open Problem / Challenge:  How to build a private cloud storage with ….  Low upfront cost, expected performance, short ramp-up period 4
  • 5. Background  Swift is an open-source object store running on commodity HW  High scalability (linear scale-out as needed)  High availability (3 copies of data)  High durability  Swift has heterogeneous types of nodes Proxy – Swift’s brain (coordinate Storage – Swift’s requests, handle failure…) warehouse (store objects) 5
  • 6. Problem  How to provision the proxy and storage nodes in a Swift cluster for expected performance (SLA) while keeping low upfront cost? Hardware: CPU, memory network, I/O device … Software: filesystem, Swift configuration … …… 6
  • 7. Lesson Learnt from Past CPU, network I/O intensive Disk I/O intensive XFS filesystem High-end CPU, 10 GE networking Commodity CPU, 1 GE networking Are they always true in all cases? especially for different workloads?  always pay off to choose 10GE (expensive!) for proxy nodes?  always sufficient to use commodity CPU for storage nodes?  always disk I/O intensive on storage nodes?  how much difference in performance between XFS and other FS?  …. 7
  • 8. Solution  Solution (similar to “Divide and Conquer” strategy)  First, solve the problem in a small Swift cluster (e.g. 2 proxy nodes, 5-15 storage nodes) 1: For each HW configuration for proxy node Pruning methods make it simple! 2: For each HW configuration for storage node 3: For each number of storage node from 5, 10, 15… 4: For each SW parameter setting Exhaustive search? 5: A small Swift cluster is made, measure its performance, calculate and save its “perf/cost” 6: Recommend the small Swift clusters with high “performance/cost”  Then, scale out recommended small Swift clusters to large Swift clusters until SLA is met  Performance and cost also get scaled (when networking is not a bottleneck)  Their HW & SW settings (identified in small clusters) are also held true for large clusters 8
  • 9. Evaluation - Hardware  Hardware configuration for proxy and storage nodes  Amazon EC2 (diverse HW resources, no front cost, virtualized HW -> physical HW)  Two hardware choices for proxy node:  # 1: Cluster Compute Extra Large Instance (EC2 Cluster)  # 2: High-CPU Extra Large Instance (EC2 High-CPU)  Two hardware choices for storage node:  # 1: High-CPU  # 2: Large Instance (EC2 Large) Cluster High-CPU Large CPU speed 33.5 EC2 Compute 20 EC2 Compute 4 EC2 Compute Units Units Units Memory 23 GB 7GB 7.5GB Network 10 GE 1 GE 1 GE Pricing (US East) $ 1.30/h $ 0.66/h $ 0.32/h 9
  • 10. Evaluation – Cost & Software  Upfront Cost  EC2 cost ($/hour)  EC2 cost≠ physical HW cost, but it is a good implication of physical HW cost  Software Configuration  Filesystem  XFS (recommended by RackSpace)  Ext4 (popular FS, but not evaluated for Swift)  Swift Configuration Files  db_preallocate (it is suggested to set True for HDD to reduce defragmentation)  OS settings  disable TIME_WAIT, disable syn cookies …  will discuss in our future blog … 10
  • 11. Evaluation – Workloads  2 Sample Workloads Upload (GET 5%, PUT 90%, DEL: 5%) Small Objects Example: Online gaming hosting service (object size 1KB – 100 KB) the game sessions are periodically saved as small files. Large Objects Example: Enterprise Backup (object size 1MB – 10 MB) the files are compressed into large trunk to backup. Occasionally, recovery and delete operations are needed. Object sizes are randomly and uniformly chosen within the pre-defined range Objects are continuously uploaded to the test Swift clusters  COSBench – a cloud storage benchmark tool from  Free to define your own workloads in COSBench ! 11
  • 12. Evaluation – Upload small objects  Top-3 recommended hardware for a small Swift cluster HW for proxy node HW for storage node Throughput/$ Upload Small 1 2 proxy nodes (High-CPU) 5 storage nodes (High-CPU) 151 Objects 2 2 proxy nodes (Cluster) 10 storage nodes (High-CPU) 135 3 2 proxies nodes (Cluster) 5 storage nodes (High-CPU) 123  Storage node is all based on High-CPU  CPU are intensively used for handling large # requests. CPU is the key resources.  Comparing to Large Instance (4 EC2 Compute Units, $0.32/h)  High-CPU Instance has 20 EC2 Compute Units with $0.66/h (5X more CPU resources, only 2X expensive)  Proxy node  Traffic pattern: high throughput, low network bandwidth (e.g. 1250 op/s -> 61MB/s)  10 GE from Cluster Instance is over-provisioned for this traffic pattern  Comparing to High-CPU, Cluster has 1.67X CPU resources, but 2X expensive  Besides, 5 storage nodes can almost saturate 2 proxy nodes 12
  • 13. Evaluation – Upload large objects  Top-3 recommended hardware for a small Swift cluster HW for proxy node HW for storage node Throughput/$ Upload Large 1 2 proxy nodes (Cluster) 10 storage nodes (Large) 5.6 Objects 2 2 proxy nodes (High-CPU) 5 storage nodes (Large) 4.9 3 2 proxy nodes (Cluster) 5 storage nodes (Large) 4.7  Storage node is all based on Large  More time is spent on transferring objects to I/O devices. Write request rate is low, CPU is not the key factor.  Comparing to High-CPU Instance (20 EC2 Compute Units, $0.66/h),  Large Instance has 4 EC2 Compute Units (sufficient) with $0.32/h (2X cheaper).  Proxy node  Traffic pattern: low throughput, high network bandwidth  e.g. 32 op/s -> 160 MB/s for incoming and ~500 MB/s for outgoing traffic (write in triplicate!)  1 GE from High-CPU is under-provisioned, 10 GE from Cluster is paid off for this workload.  Need 10 storage nodes to keep up with the 2 proxy nodes (10 GE) 13
  • 14. Evaluation – Conclusion for HW  Take-away points for provisioning HW for a Swift cluster Hardware for proxy node Hardware for storage node Upload Small Object 1 GE 1 GE High-end CPU High-end CPU Upload Large Object 10 GE 1 GE High-end CPU Commodity CPU  Download workloads: see the backup slides  Contrary to the lessons learnt from the past  It does NOT always pay off to choose 10 GE (expensive!) for proxy nodes  It is NOT always sufficient to use commodity CPU for storage nodes  Upload is disk I/O intensive (3 copies of data)  but download is NOT always disk I/O intensive (retrieve one copy of data) 14
  • 15. Evaluation – Conclusion for SW  Take-away points for provisioning SW for a Swift cluster db_preallocate XFS vs. Ext4 Upload Small Objects on XFS Upload Large Objects on / off XFS / Ext4  Upload Small Object (more sensitive to software settings)  db_preallocation: intensive updates on container DB. Setting it to on will gain 10-20% better performance  Filesystem: we observe XFS achieves 15-20% extra performance than Ext4 15
  • 16. Evaluation – Scale out small cluster  Workload #1: upload small objects (same workload for exploring HW & SW configurations for small Swift cluster)  Based on the top-3 recommended small Swift clusters 20 2X 2 Proxy Node (High-CPU) 18 3X 5 Storage Node (High-CPU) 16 Upfront Cost ($/hour) 3X: 6 proxy 2 Proxy Node (Cluster) 14 1.5X 15 storage 10 Storage Node (High-CPU) 12 Lowest cost 2X Lowest cost 2 Proxy Node (Cluster) 10 5 Storage Node (High-CPU) 8 1X SLA 1: 80% RT < 600ms 2X: 4 proxy 1X 6 10 storage SLA 2: 80% RT < 1000ms 4 1X: 2 proxy 2 5 storage 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 16 Response Time (ms) of 80% requests
  • 17. Outline  Build a cost-efficient Swift cluster with expected performance  Background & Problem  Solution  Experiments  When something goes wrong in a Swift cluster  Two Types of Failures: Hard Drive, Entire Node  What is performance degradation before the failures are fixed  How soon the data will be back (when all failed nodes are back on-line)?  Experiments 17
  • 18. Why Consider Failures  Failure stats in Google’s DC (from Google fellow Jeff Dean’s interview at 2008)  A cluster of 1,800 servers in its first year……  Totally, 1,000 servers failed, thousands of HDDs failed  1 power distribution unit failed, bringing down 500 – 1,000 machines for 6 hours  20 racks failed, each time causing 40 – 80 machines to vanish from network  Failures in Swift  Given a 5-zone setup, Swift can tolerate at most 2 zones failed (data will not be lost)  But, performance will degrade to some extent before the failed zones are fixed.  If Swift operators want to ensure certain performance level  They need to benchmark the performance of their Swift clusters upfront 18
  • 19. How Complex to Consider Failure  (1) Possible failure at one node  Disk  Swift process (rsync is still working)  Entire node  (2) Which type of node failed  Proxy  Storage  (3) How many nodes failed at same time  Combining above three considerations, the total space of all failure scenarios is huge  practical to prioritize those failure scenarios  E.g. the worst or more common scenarios are considered first 19
  • 20. Evaluation - Setup  Focus on performance (not data availability)  Measure performance degradation comparing to “no failure” case, before failed nodes back on-line  Workload: Backup workload (uploading large objects is the major operation)  Swift cluster: 2 proxy nodes (Cluster: Xeon CPU, 10 GE), 10 storage nodes  Two common failure scenarios: (1) entire storage node failure (2) HDD failure in storage node  (1) Entire storage node failure  10%, 20%, 30% and 40% storage nodes failed in a cluster (E.g. partial power outage)  Different HW resources are provisioned for storage node  EC2 Large for storage node (cost-efficient, high performance/cost)  EC2 High-CPU for storage node (costly, over-provisioned for CPU resources) 20
  • 21. Evaluation - Setup  (2) HDD failure in storage node (EC2 Large for storage node)  Each storage node attaches 8 HDDs  Intentionally umount some HDDs during the execution.  Storage node is still accessible  10%, 20%, 30% and 40% of HDDs failed in a cluster  Compare two failure distributions:  Uniform HDD failure (failed HDDs uniformly distributed over all storage nodes)  Skewed HDD failure (some storage nodes get much more # HDDs failed than other nodes) 21
  • 22. Evaluation – Entire Node Failure Cluster throughput (# operations / second) 40 35 30 25 20 Large 15 High-CPU 10 5 0 no failure 10% failure 20% failure 30% failure 40% failure  Storage node based on Large Instance  Throughput decreases as more storage failed  Storage node based on High-CPU Instance  Throughput decreases only when 40% nodes fail 22
  • 23. Evaluation – Entire Node Failure CPU usage in unaffected storage node Network bandwidth (MB/s) in unaffected storage node 100% 80 80% 70 60 60% 50 Large 40 40% Large 30 High-CPU High-CPU 20% 20 10 0% 0 no failure 10% 20% 30% 40% no failure 10% 20% 30% 40% failure failure failure failure failure failure failure failure  When storage node is based on High-CPU Instance  Over-provisioned resources in unaffected node get more used as # failures increases  So, it can keep performance from degrading initially  When storage node is based on Large Instance  CPU is almost saturated when no failure happens 23
  • 24. Evaluation – HDD Failure Cluster throughput (# operations / second) Usage of unaffected disk (%) 40 when HDDs are uniformly failed 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 no failure uniform uniform uniform uniform skewed 40% node uniform 10% uniform 20% uniform 30% uniform 40% 10% HDD 20% HDD 30% HDD 40% HDD 40% HDD failure HDD failure HDD failure HDD failure HDD failure failure failure failure failure failure  When HDD are uniformly failed across all storage nodes  When HDDs are uniformly failed  Throughput does not decrease ! Why?  I/O loads are evenly distributed over other unaffected HDDs  When some storage nodes have more failed HDDs than others (skewed)  Throughput decreases significantly, still better than entire node failure  Extreme case: when all HDDs on a storage node fail, it is almost equal to entire node failure 24
  • 25. Evaluation – Take-away points  In order to maintain certain performance in fact of failure  Make sense to “over-provision” the HW resources to some extent  When failure happens, the “over-provisioned” resources will reduce the performance degradation  Entire storage node failure vs. HDD failure  Entire node failure is the worse than HDD failure.  When only HDDs failed, performance degradation depends on:  If failed HDDs are uniformly distributed across all storage nodes  degradation is smaller, because I/O load can be rebalanced over unaffected HDDs  Otherwise (failure distribution is skewed)  degradation may be larger  What if proxy node failure? proxy and storage nodes fail together?  Reduce the performance, need to figure out in the future 25
  • 26. When Failed Nodes Are Fixed  When all failed (affected) nodes have been fixed and re-join the Swift cluster  (1) How soon the recovery will take on the affected nodes?  (2) What is performance when the recovery is undergoing?  We will show empirical results in our blog (http://www.zmanda.com/blogs/)  For (1), it depends on:  How much data need to be recovered.  Networking latency b/w unaffected and affected nodes  HW resources (e.g. CPU) in unaffected nodes (lookup which data need to be restored) 26
  • 27. When Failed Nodes Are Fixed  When all failed nodes have been fixed and re-join the Swift cluster  (1) How soon the recovery will take on the affected nodes?  (2) What is performance when the recovery is undergoing?  For (2), it depends on:  HW resources in unaffected nodes. The unaffected nodes become more resource- intensive because they still serve requests, also help affected nodes to restore their data  Performance will gradually increase as the recovery progress is close to 100% 27
  • 28. Thanks! Questions/Comments? http://www.zmanda.com/blogs/ swift@zmanda.com 28
  • 30. Evaluation – Download small objects  Top-3 recommended hardware for a small Swift cluster HW for proxy node HW for storage node Download Small 1 2 proxy nodes (High-CPU) 5 storage nodes (Large) Objects 2 2 proxy nodes (Cluster) 5 storage nodes (Large) 3 2 proxies nodes (High-CPU) 10 storage nodes (Large)  Storage node is all based on Large  Only one copy of data is retrieved. CPU and disk I/O are not busy  Large is sufficient for workload and saves more cost than High-CPU  Proxy node  Traffic pattern: high throughput, low network bandwidth (e.g. 2400 op/s -> 117 MB/s)  10 GE from Cluster is over-provisioned for this traffic pattern  1 GE from High-CPU is adequate  5 storage nodes can almost saturate the 2 proxy nodes. 30
  • 31. Evaluation – Download large objects  Top-3 recommended hardware for a small Swift cluster HW for proxy node HW for storage node Download Large 1 2 proxy nodes (Cluster) 5 storage nodes (Large) Objects 2 2 proxy nodes (Cluster) 10 storage nodes (Large) 3 2 proxy nodes (High-CPU) 5 storage nodes (Large)  Storage node is all based on Large  Request rate is low, little load on CPU.  Large Instance is sufficient for workload and saves more cost than High-CPU.  Proxy node  Traffic pattern: low throughput, high network bandwidth (70 op/s -> 350 MB/s)  1 GE from High-CPU is under-provisioned, 10 GE from Cluster is paid off for this workload.  5 storage nodes can nearly saturate the 2 proxy nodes. 31