Sheepdog: yet another all in-one storage for openstack

Sheepdog: Yet Another All-In-One Storage
For Openstack
Openstack Hong Kong Summit
Liu Yuan 2013.11.8

Who I Am
• Active contributor to various open source projects such as Sheepdog,
QEMU, Linux Kernel, Xen, Openstack, etc.
• Primary top contributor of Sheepdog project and co-maintains it with
Kazutaka Morita from NTT Japan since 2011.12
• Technically lead the storage projects based on Sheepdog for internal
uses of www.taobao.com
• Contacts
–Email: namei.unix@gmail.com
–Micro Blog: @ 淘泰来

Agenda
Introduction - Sheepdog Overview
Exploration - Sheepdog Internals
Openstack - Sheepdog Goal
Roadmap - Features From The Future
Industry - How Industry Use Sheepdog

Introduction

Sheepdog Overview

What Is Sheepdog
•

Distributed Object Storage System In User Space
–Manage Disks and Nodes
• Aggregate the capacity and the power (IOPS + throughput)
• Hide the failure of hardware
• Dynamically grow or shrink the scale

–Manage Data
• Provide redundancy mechanisms (replication and erasure code) for high-availability
• Secure the data with auto-healing and auto-rebalanced mechanisms

–Provide Services
• Virtual volume for QEMU VM, iSCSI TGT (Perfectly supported by upstream)
• RESTful container (Openstack Swift and Amazon S3 Compatible, in progress)
• Storage for Openstack Cinder, Glance, Nova (Available for Havana)

Sheepdog Architecture
No meta servers!

Zookeeper: membership management and message queue

Node event: global rebalance
Disk event: local rebalance

Global Consistent Hash Ring and P2P
sheep daemon
Gateway

Gateway

Gateway

Store

Store

Store

1TB

1TB

1TB

1TB

4TB

2TB

1TB
Hot-plugged

1TB

X

2TB

dog admin tool
Private Hash Ring

Auto unplugged on EIO

Why Sheepdog
•

Minimal assumptions of underlying kernel and file system
–Any type of file systems that support extended attribute(xattr)
–Only require kernel version >= 2.6.32

•

Full of features
–snapshot, clone, incremental backup, cluster-wide snapshot, discard, etc.
–User-defined replication/erasure code scheme on VDI(Virtual Disk Image) basis
–Auto node/disk management

•

Easy to set up the cluster with thousands of nodes
–Single daemon can manage unlimited number of disks in one node as efficient as RAID0
–as many as 6k+ for a single cluster

•

Small
–Fast and very small memory footprint (less than 50 MB even when busy)
–Easy to hack and maintain, 35K lines of code in C as of now

Exploration

Sheepdog Internals

Sheepdog In A Nutshell
Gateway
256G
SSD

Data
Replicated or
Erasure coded

Store
4G

/dev/vda
Net

Weighted
Nodes & Disks

/dev/vda
Shared
Persistent
Cache

Store
Journal

2TB

1TB

1TB

1TB

1TB

Sheepdog Volume
• Copy-On-Write Snapshot
–Disk-only snapshot, disk & memory snapshot
–Live snapshot, offline snapshot
–Rollback(tree structure), clone
–Incremental backup
–Instant operation, only create 4M inode object

• Push many logics into client -> simple and fast code !
–Only 4 opcodes for store, read/write/create/remove, snapshot is done by QEMU
block driver or dog
–Requests serialization is not handled by Sheepdog but client
–Inode object is treated the same as data object

Gateway - Request Engine
Return Only All
Succeed
Req Queue
Node Ring
Socket
Cache
Concurrent
Handling

Route
Retry

Net
Cache

Req Manager
Store

1TB

1TB

2TB

X
Retry On
Error(Timeout, EIO)

2TB

4TB

Sheep

Sheep

Sheep

Sheep

X

Degraded to
GW-only On EIO

Store - Data Engine
Sheep
Disk Ring
GW

Net

Sheep

Disk Manager
Sheep

Journal

1TB

1TB

2TB

X

2TB

Auto unplugged on EIO
1. Fake network err to ask GW retry
2. Update disk ring
3. Start local data rebalance

Sheep

Redundancy Scheme

Erasure Coding

Full Replication

Parity

Sheep

Sheep

Sheep

Sheep

Sheep

Sheep

Sheep

Sheep

Sheep

Erasure Coding Over Full Replication
• Advantages
–Far less storage overhead

6 N
odes w t h 1G N C
i
b I
100

–Rumors breaking

80

• Better R/W performance
• Can run VM Images !

• Disadvantages

s
/
B
M

• Support random R/W

60

R i cat i on ( 3 C
epl
opy)
Er asur e ( 4: 2)

40
20
0

R
ead

–Generate more traffic for recovery
• (X + Y)/(Y + 1) times data ( Suppose X data, Y parity strips)
–

2 times data for 4:2 compared to 3 full copies

Wi t e
r

Recovery - Redundancy Repair & Data Rebalance
Sleep Queue
Node Ring
Update Version
Schedule
Recovery
Manager

If lost, read or
rebuild from other
copy
Sheep
Net

Req Queue
Migrate to other node
for rebalance
X

Sheep

Recovery Cont.
• Eager Recovery As Default
–Allow users to stop it and do manual recovery temporarily

• Node/Disk Events Handling
–Disk event and node event share the same algorithm
• Handle mixed node and disk events nicely

–Subsequent event will supersede previous one
• Handle group join/leave of disks and nodes gracefully

• Recovery Handling Transparently to the Client
–Put requests for objects being recovered on sleep queue and wake it up later
–Serve the request directly if object is right there in the store

Farm - Cluster Wide Snapshot
Sheep
Meta Data
Generator

Slicer

Net

128K
Hash Dedupper

Dedicated Backup Storage

Sheep

Sheep
• Incremental backup
• Up to 50% dedup ratio
• Compression doesn't help
Think of Sheepdog On Sheepdog ? Yeah!

Sheep

Openstack Storage Components
•

Cinder - Block Storage
–Support since day 1

•

Glance - Image Storage

Swift

Cinder

Nova

Glance

–Support merged at Havana version

•

Nova - Ephemeral Storage
–Not yet started

•

Unified Storage

Swift - Object Storage
–Swift API compatible In progress

•

Final Goal - Unified Storage
–Copy-On-Write anywhere ?
–Data dedup ?

Sheep

Sheep

Sheep

Sheep

Roadmap

Features From The Future

Look Into The Future
• RESTful Container
–Plans to be Openstack Swift API compatible first, coming soon

• Hyper Volume
–256PB Volume, coming soon

• Geo-Replication
• Sheepdog On Sheepdog
–Storage for cluster wide snapshot

• Slow Disk & Broken Disk Detecter
–Deal with dead D state process hang because of broken disk in massive
deployment

Industry

How Industry Use Sheepdog

Sheepdog In Taobao & NTT
SD VM SD VM

SD VM SD VM

SD

SD

SD

SD

SD

SD

SD

SD

SD

SD

SD

SD

HTTP
VM running inside
Sheepdog Cluster
for test & dev at
Taobao

Ongoing project with
10k+ ARM nodes
for cold data at
Taobao

LUN device pool
Sheepdog cluster run
as iSCSI TGT backend
storage at
NTT

Other Users In Production
Any more users I don't know ?

Q&A
Homepage
http://sheepdog.github.io/sheepdog/
Try me out
git clone git://github.com/sheepdog/sheepdog.git

Go Sheepdog !

Sheepdog: yet another all in-one storage for openstack

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Sheepdog: yet another all in-one storage for openstack

Similar a Sheepdog: yet another all in-one storage for openstack (20)

Último

Último (20)

Sheepdog: yet another all in-one storage for openstack

Notas del editor