SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
Divide and conquer – Shared disk
cluster file systems shipped with the
              Linux kernel
             Udo Seidel
Shared file systems
●   Multiple server access same data
●   Different approaches
    ●   Network based, e.g. NFS, CIFS
    ●   Clustered
        –   Shared disk, e.g. CXFS, CFS, GFS(2), OCFS2
        –   Distributed parallel, e.g. Lustre, Ceph
History
●   GFS(2)
    ●   First version in the mid 90’s
    ●   Started on IRIX, later ported to Linux
    ●   Commercial background: Sistina and RedHat
    ●   Part of Vanilla Linux kernel since 2.6.19
●   OCFS2
    ●   OCFS1 for database files only
    ●   First version in 2005
    ●   Part of Vanilla Linux kernel since 2.6.16
Features/Challenges/More
●   As much as possible similar to local file
    systems
    ●   Internal setup
    ●   management
●   Cluster awareness
    ●   Data integrity
    ●   Allocation
Framework
●   Bridges the gap between one-node and cluster
●   3 main components
    ●   Cluster-ware
    ●   Locking
    ●   Fencing
Framework GFS2 (I)
●   Cluster-ware of general purpose
    ●   More flexible
    ●   More options/functions
    ●   More complexity
    ●   Configuration files in XML
●   Locking uses cluster framework too
●   system­config­cluster OR Conga OR vi
    & scp
Framework GFS2 (II)
# cat /etc/cluster/cluster.conf
<?xml version="1.0" ?>

<cluster config_version="3" name="gfs2">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>

<clusternodes>
<clusternode name="node0" nodeid="1" votes="1">

<fence/>
</clusternode>

<clusternode name="node1" nodeid="2" votes="1">
...

</cluster>
#
Framework OCFS2 (I)
●   Cluster-ware just for OCFS2
    ●   Less flexible
    ●   Less options/functions
    ●   Less complexity
    ●   Configuration file in ASCII
●   Locking uses cluster framework too
●   ocfs2console OR vi & scp
Framework OCFS2 (II)
# cat /etc/ocfs2/cluster.conf
node:

      ip_port = 7777
      ip_address = 192.168.0.1

      number = 0
      name = node0

      cluster = ocfs2
...

cluster:
      node_count = 2

#
Locking
●   Distributed Lock Manager (DLM)
●   Based on VMS-DLM
●   Lock modes
    ●   Exclusive Lock (EX)
    ●   Protected Read (PR)
    ●   No Lock (NL)
    ●   Concurrent Write Lock (CW) – GFS2 only
    ●   Concurrent Read Lock (CR) – GFS2 only
    ●   Protected Write (PR) – GFS2 only
Locking - Compatibility
                 Existing Lock
Requested Lock   NL       CR     CW    PR    PW    EX
NL               Yes      Yes    Yes   Yes   Yes   Yes
CR               Yes      Yes    Yes   Yes   Yes   No
CW               Yes      Yes    Yes   No    No    No
PR               Yes      Yes    No    Yes   No    No
PW               Yes      Yes    No    No    No    No
EX               Yes      No     No    No    No    No
Fencing
●   Separation of host and storage
    ●   Power Fencing
        –   Power switch, e.g. APC
        –   Server side, e.g. IPMI, iLO
        –   Useful in other scenarios
        –   Post-mortem more difficult
    ●   I/O fencing
        –   SAN switch, e.g. Brocade, Qlogic
        –   Possible to investigate “unhealthy” server
Fencing - GFS2
●   Both fencing methods
●   Part of cluster configuration
●   Cascading possible
Fencing - OCFS2
●   Only power fencing
    ●   Only self fencing
GFS2 – Internals (I)
●   Superblock
    ●   Starts at block 128
    ●   Expected data + cluster information
    ●   Pointers to master and root directory
●   Resource groups
    ●   Comparable to cylinder groups of traditional Unix
        file system
    ●   Allocatable from different cluster nodes -> locking
        granularity
GFS2 – Internals (II)
●   Master directory
    ●   Contains meta-data, e.g journal index, quota, ...
    ●   Not visible for ls and Co.
    ●   File system unique and cluster node specific files
●   Journaling file system
    ●   One journal per cluster node
    ●   Each journal accessible by all nodes (recovery)
GFS2 – Internals (III)
●   Inode/Dinode
    ●   Usual information, e.g. owner, mode, time stamp
    ●   Pointers to blocks: either data or pointer
    ●   Only one level of indirection
    ●   “stuffing”
●   Directory management via Extendible Hashing
●   Meta file statfs
    ●   statfs()
    ●   Tuning via sysfs
GFS2 – Internals (IV)
●   Meta files
    ●   jindex directory containing the journals
        –   journalX
    ●   rindex Resource group index
    ●   quota
    ●   per_node directory containing node specific files
GFS2 – what else
●   Extended attributes xattr
●   ACL’s
●   Local mode = one node access
OCFS2 – Internals (I)
●   Superblock
    ●   Starts at block 3 (1+2 for OCFS1)
    ●   Expected data + cluster information
    ●   Pointers to master and root directory
    ●   Up to 6 backups
        –   at pre-defined offset
        –   at 2^n Gbyte, n=0,2,4,6,8,10
●   Cluster groups
    ●   Comparable to cylinder groups of traditional Unix
        file system
OCFS2 – Internals (II)
●   Master or system directory
    ●   Contains meta-data, e.g journal index, quota, ...
    ●   Not visible for ls and Co.
    ●   File system unique and cluster node specific files
●   Journaling file system
    ●   One journal per cluster node
    ●   Each journal accessible by all nodes (recovery)
OCFS2 – Internals (III)
●   Inode
    ●   Usual information, e.g. owner, mode, time stamp
    ●   Pointers to blocks: either data or pointer
    ●   Only one level of indirection
●   global_inode_alloc
    ●   Global meta data file
    ●   inode_alloc node specific counterpart
●   slot_map
    ●   Global meta data file
    ●   Active cluster nodes
OCFS2 – Internals (IV)
●   orphan_dir
    ●   Local meta data file
    ●   Cluster aware deletion of files in use
●   truncate_log
    ●   Local meta data file
    ●   Deletion cache
OCFS2 – what else
●   Two versions: 1.2 and 1.4
    ●   Mount compatible
    ●   Framework not network compatible
    ●   New features disabled per default
●   For 1.4:
    ●   Extended attributes xattr
    ●   Inode based snapshotting
    ●   preallocation
File system management
●   Known/expected tools + cluster details
    ●   mkfs
    ●   mount/umount
    ●   fsck
●   File system specific tools
    ●   gfs2_XXXX
    ●   tunefs.ocfs2, debugfs.ocfs2
GFS2 management (I)
●   File system creation needs additional
    information
    ●   Cluster name
    ●   Unique file system identifier (string)
    ●   Optional:
        –   Locking mode to be used
        –   number of journals
    ●   Tuning by changing default size for journals,
        resource groups, ...
GFS2 management (II)
●   Mount/umount
    ●   No real syntax surprise
    ●   First node checks all journals
    ●   Enabling ACL, quota, single node mode
GFS2 management (III)
●   File system check
    ●   Journal recovery of node X by node Y
    ●   Done by one node
    ●   file system offline anywhere else
    ●   Known phases
        –   Journals
        –   Meta data
        –   References: data blocks, inodes
GFS2 tuning (I)
●   gfs2_tool
    ●   Most powerful
        –   Display superblock
        –   Change superblock settings (locking mode, cluster name)
        –   List meta data
        –   freeze/unfreeze file system
        –   Special attributes, e.g. appendonly, noatime
    ●   Requires file system online (mostly)
GFS2 tuning (II)
●   gfs2_edit
    ●   Logical extension of gfs2_tool
    ●   More details, e.g. node-specific meta data, block
        level
●   gfs2_jadd
    ●   Different sizes possible
    ●   No deletion possible
    ●   Can cause data space shortage
GFS2 tuning (III)
●   gfs2_grow
    ●   Needs space in meta directory
    ●   Online only
    ●   No shrinking
OCFS2 management (I)
●   File system creation
    ●   no additional information needed
    ●   Tuning by optional parameters
●   Mount/umount
    ●   No real syntax surprise
    ●   First node checks all journals
    ●   Enabling ACL, quota, single node mode
OCFS2 management (II)
●   File system check
    ●   Journal recovery of node X by node Y
    ●   Done by one node
    ●   file system offline anywhere else
    ●   Fixed offset of superblock backup handy
    ●   Known phases
        –   Journals
        –   Meta data
        –   References: data blocks, inodes
OCFS2 tuning (I)
●   tunefs.ocfs2
    ●   Display/change file system label
    ●   Display/change number of journals
    ●   Change journal setup, e.g.size
    ●   Grow file system (no shrinking)
    ●   Create backup of superblock
    ●   Display/enable/disable specific file system features
        –   Sparse files
        –   “stuffed” inodes
OCFS2 tuning (II)
●   debugfs.ocfs2
    ●   Display file system settings, e.g. superblock
    ●   Display inode information
    ●   Access meta data files
Volume manager
●   Necessary to handle more than one
    LUN/partition
●   Cluster-aware
●   Bridge feature gap, e.g. volume based
    snapshotting
●   CLVM
●   EVMS – OCFS2 only
Key data - comparison
                                         GFS2                          OCFS2
Maximum # of cluster nodes   Supported 16 (theoretical: 256)             256

journaling                                Yes                            Yes
Cluster-less/local mode                   Yes                            Yes
Maximum file system size       25 TB (theoretical: 8 EB)       16 TB (theoretical: 4 EB)
Maximum file size              25 TB (theoretical: 8 EB)       16 TB (theoretical: 4 EB)
POSIX ACL                                 Yes                            Yes
Grow-able                           Yes/online only             Yes/online and offline
Shrinkable                                 No                             No
Quota                                     Yes                            Yes
O_DIRECT                              On file level                      Yes
Extended attributes                       Yes                            Yes
Maximum file name length                  255                            255
File system snapshots                      No                             No
Summary
●   GFS2 longer history than OCFS2
●   OCFS2 setup simpler and easier to maintain
●   GFS2 setup more flexible and powerful
●   OCFS2 getting close to GFS2
●   Dependence on choice of Linux vendor
References
http://sourceware.org/cluster/gfs/
http://www.redhat.com/gfs/
http://oss.oracle.com/projects/ocfs2/
http://sources.redhat.com/cluster/wiki/
http://sourceware.org/lvm2/
http://evms.sourceforge.net/
Thank you!

Más contenido relacionado

La actualidad más candente

MySQL with DRBD/Pacemaker/Corosync on Linux
 MySQL with DRBD/Pacemaker/Corosync on Linux MySQL with DRBD/Pacemaker/Corosync on Linux
MySQL with DRBD/Pacemaker/Corosync on LinuxPawan Kumar
 
Secure lustre on openstack
Secure lustre on openstackSecure lustre on openstack
Secure lustre on openstackJames Beal
 
Fedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUFedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUAndrey Vagin
 
Containers and Namespaces in the Linux Kernel
Containers and Namespaces in the Linux KernelContainers and Namespaces in the Linux Kernel
Containers and Namespaces in the Linux KernelOpenVZ
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and DriversKernel TLV
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageSage Weil
 
Configuring Syslog by Octavio
Configuring Syslog by OctavioConfiguring Syslog by Octavio
Configuring Syslog by OctavioRowell Dionicio
 
Linux containers-namespaces(Dec 2014)
Linux containers-namespaces(Dec 2014)Linux containers-namespaces(Dec 2014)
Linux containers-namespaces(Dec 2014)Ralf Dannert
 
Moscow virtualization meetup 2014: CRIU 1.0 What is next?
Moscow virtualization meetup 2014: CRIU 1.0 What is next?Moscow virtualization meetup 2014: CRIU 1.0 What is next?
Moscow virtualization meetup 2014: CRIU 1.0 What is next?Andrey Vagin
 
Lightweight Virtualization: LXC Best Practices
Lightweight Virtualization: LXC Best PracticesLightweight Virtualization: LXC Best Practices
Lightweight Virtualization: LXC Best PracticesWerner Fischer
 
MySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the PacemakerMySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the Pacemakerhastexo
 
Debugging with-wireshark-niels-de-vos
Debugging with-wireshark-niels-de-vosDebugging with-wireshark-niels-de-vos
Debugging with-wireshark-niels-de-vosGluster.org
 
brief introduction of drbd in SLE12SP2
brief introduction of drbd in SLE12SP2brief introduction of drbd in SLE12SP2
brief introduction of drbd in SLE12SP2Nick Wang
 
Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...
Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...
Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...ORACLE USER GROUP ESTONIA
 
Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architectureinside-BigData.com
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackSage Weil
 

La actualidad más candente (20)

MySQL with DRBD/Pacemaker/Corosync on Linux
 MySQL with DRBD/Pacemaker/Corosync on Linux MySQL with DRBD/Pacemaker/Corosync on Linux
MySQL with DRBD/Pacemaker/Corosync on Linux
 
Cluster filesystems
Cluster filesystemsCluster filesystems
Cluster filesystems
 
Secure lustre on openstack
Secure lustre on openstackSecure lustre on openstack
Secure lustre on openstack
 
Fedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUFedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIU
 
Drbd
DrbdDrbd
Drbd
 
Containers and Namespaces in the Linux Kernel
Containers and Namespaces in the Linux KernelContainers and Namespaces in the Linux Kernel
Containers and Namespaces in the Linux Kernel
 
Namespaces in Linux
Namespaces in LinuxNamespaces in Linux
Namespaces in Linux
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and Drivers
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
 
Configuring Syslog by Octavio
Configuring Syslog by OctavioConfiguring Syslog by Octavio
Configuring Syslog by Octavio
 
Linux containers-namespaces(Dec 2014)
Linux containers-namespaces(Dec 2014)Linux containers-namespaces(Dec 2014)
Linux containers-namespaces(Dec 2014)
 
Fun with FUSE
Fun with FUSEFun with FUSE
Fun with FUSE
 
Moscow virtualization meetup 2014: CRIU 1.0 What is next?
Moscow virtualization meetup 2014: CRIU 1.0 What is next?Moscow virtualization meetup 2014: CRIU 1.0 What is next?
Moscow virtualization meetup 2014: CRIU 1.0 What is next?
 
Lightweight Virtualization: LXC Best Practices
Lightweight Virtualization: LXC Best PracticesLightweight Virtualization: LXC Best Practices
Lightweight Virtualization: LXC Best Practices
 
MySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the PacemakerMySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the Pacemaker
 
Debugging with-wireshark-niels-de-vos
Debugging with-wireshark-niels-de-vosDebugging with-wireshark-niels-de-vos
Debugging with-wireshark-niels-de-vos
 
brief introduction of drbd in SLE12SP2
brief introduction of drbd in SLE12SP2brief introduction of drbd in SLE12SP2
brief introduction of drbd in SLE12SP2
 
Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...
Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...
Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...
 
Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architecture
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStack
 

Similar a Linuxkongress2010.gfs2ocfs2.talk

Linuxtag.ceph.talk
Linuxtag.ceph.talkLinuxtag.ceph.talk
Linuxtag.ceph.talkUdo Seidel
 
OSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
OSDC 2011 | Enterprise Linux Server Filesystems by Remo RickliOSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
OSDC 2011 | Enterprise Linux Server Filesystems by Remo RickliNETWAYS
 
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo Seidel
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo SeidelOSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo Seidel
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo SeidelNETWAYS
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureThomas Uhl
 
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE
 
TLPI Chapter 14 File Systems
TLPI Chapter 14 File SystemsTLPI Chapter 14 File Systems
TLPI Chapter 14 File SystemsShu-Yu Fu
 
Osdc2011.ext4btrfs.talk
Osdc2011.ext4btrfs.talkOsdc2011.ext4btrfs.talk
Osdc2011.ext4btrfs.talkUdo Seidel
 
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityRamp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityPythian
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLThijs Terlouw
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonSage Weil
 
Integrity and Security in Filesystems
Integrity and Security in FilesystemsIntegrity and Security in Filesystems
Integrity and Security in FilesystemsConferencias FIST
 
Comparison between OCFS2 and GFS2
Comparison between OCFS2 and GFS2Comparison between OCFS2 and GFS2
Comparison between OCFS2 and GFS2Gang He
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introductionkanedafromparis
 
Lt2013 glusterfs.talk
Lt2013 glusterfs.talkLt2013 glusterfs.talk
Lt2013 glusterfs.talkUdo Seidel
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems confluent
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
adp.ceph.openstack.talk
adp.ceph.openstack.talkadp.ceph.openstack.talk
adp.ceph.openstack.talkUdo Seidel
 

Similar a Linuxkongress2010.gfs2ocfs2.talk (20)

Linuxtag.ceph.talk
Linuxtag.ceph.talkLinuxtag.ceph.talk
Linuxtag.ceph.talk
 
OSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
OSDC 2011 | Enterprise Linux Server Filesystems by Remo RickliOSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
OSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
 
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo Seidel
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo SeidelOSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo Seidel
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo Seidel
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
 
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
 
TLPI Chapter 14 File Systems
TLPI Chapter 14 File SystemsTLPI Chapter 14 File Systems
TLPI Chapter 14 File Systems
 
Osdc2011.ext4btrfs.talk
Osdc2011.ext4btrfs.talkOsdc2011.ext4btrfs.talk
Osdc2011.ext4btrfs.talk
 
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityRamp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NL
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
Integrity and Security in Filesystems
Integrity and Security in FilesystemsIntegrity and Security in Filesystems
Integrity and Security in Filesystems
 
Scale 10x 01:22:12
Scale 10x 01:22:12Scale 10x 01:22:12
Scale 10x 01:22:12
 
FUSE Filesystems
FUSE FilesystemsFUSE Filesystems
FUSE Filesystems
 
Comparison between OCFS2 and GFS2
Comparison between OCFS2 and GFS2Comparison between OCFS2 and GFS2
Comparison between OCFS2 and GFS2
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Lt2013 glusterfs.talk
Lt2013 glusterfs.talkLt2013 glusterfs.talk
Lt2013 glusterfs.talk
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
adp.ceph.openstack.talk
adp.ceph.openstack.talkadp.ceph.openstack.talk
adp.ceph.openstack.talk
 

Más de Udo Seidel

ceph openstack dream team
ceph openstack dream teamceph openstack dream team
ceph openstack dream teamUdo Seidel
 
Gluster.community.day.2013
Gluster.community.day.2013Gluster.community.day.2013
Gluster.community.day.2013Udo Seidel
 
Lt2013 uefisb.talk
Lt2013 uefisb.talkLt2013 uefisb.talk
Lt2013 uefisb.talkUdo Seidel
 
Ostd.ksplice.talk
Ostd.ksplice.talkOstd.ksplice.talk
Ostd.ksplice.talkUdo Seidel
 
Cephfsglusterfs.talk
Cephfsglusterfs.talkCephfsglusterfs.talk
Cephfsglusterfs.talkUdo Seidel
 
Osdc2012 xtfs.talk
Osdc2012 xtfs.talkOsdc2012 xtfs.talk
Osdc2012 xtfs.talkUdo Seidel
 
Linuxconeurope2011.ext4btrfs.talk
Linuxconeurope2011.ext4btrfs.talkLinuxconeurope2011.ext4btrfs.talk
Linuxconeurope2011.ext4btrfs.talkUdo Seidel
 

Más de Udo Seidel (8)

ceph openstack dream team
ceph openstack dream teamceph openstack dream team
ceph openstack dream team
 
kpatch.kgraft
kpatch.kgraftkpatch.kgraft
kpatch.kgraft
 
Gluster.community.day.2013
Gluster.community.day.2013Gluster.community.day.2013
Gluster.community.day.2013
 
Lt2013 uefisb.talk
Lt2013 uefisb.talkLt2013 uefisb.talk
Lt2013 uefisb.talk
 
Ostd.ksplice.talk
Ostd.ksplice.talkOstd.ksplice.talk
Ostd.ksplice.talk
 
Cephfsglusterfs.talk
Cephfsglusterfs.talkCephfsglusterfs.talk
Cephfsglusterfs.talk
 
Osdc2012 xtfs.talk
Osdc2012 xtfs.talkOsdc2012 xtfs.talk
Osdc2012 xtfs.talk
 
Linuxconeurope2011.ext4btrfs.talk
Linuxconeurope2011.ext4btrfs.talkLinuxconeurope2011.ext4btrfs.talk
Linuxconeurope2011.ext4btrfs.talk
 

Último

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Último (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Linuxkongress2010.gfs2ocfs2.talk

  • 1. Divide and conquer – Shared disk cluster file systems shipped with the Linux kernel Udo Seidel
  • 2. Shared file systems ● Multiple server access same data ● Different approaches ● Network based, e.g. NFS, CIFS ● Clustered – Shared disk, e.g. CXFS, CFS, GFS(2), OCFS2 – Distributed parallel, e.g. Lustre, Ceph
  • 3. History ● GFS(2) ● First version in the mid 90’s ● Started on IRIX, later ported to Linux ● Commercial background: Sistina and RedHat ● Part of Vanilla Linux kernel since 2.6.19 ● OCFS2 ● OCFS1 for database files only ● First version in 2005 ● Part of Vanilla Linux kernel since 2.6.16
  • 4. Features/Challenges/More ● As much as possible similar to local file systems ● Internal setup ● management ● Cluster awareness ● Data integrity ● Allocation
  • 5. Framework ● Bridges the gap between one-node and cluster ● 3 main components ● Cluster-ware ● Locking ● Fencing
  • 6. Framework GFS2 (I) ● Cluster-ware of general purpose ● More flexible ● More options/functions ● More complexity ● Configuration files in XML ● Locking uses cluster framework too ● system­config­cluster OR Conga OR vi & scp
  • 7. Framework GFS2 (II) # cat /etc/cluster/cluster.conf <?xml version="1.0" ?> <cluster config_version="3" name="gfs2"> <fence_daemon post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="node0" nodeid="1" votes="1"> <fence/> </clusternode> <clusternode name="node1" nodeid="2" votes="1"> ... </cluster> #
  • 8. Framework OCFS2 (I) ● Cluster-ware just for OCFS2 ● Less flexible ● Less options/functions ● Less complexity ● Configuration file in ASCII ● Locking uses cluster framework too ● ocfs2console OR vi & scp
  • 9. Framework OCFS2 (II) # cat /etc/ocfs2/cluster.conf node: ip_port = 7777 ip_address = 192.168.0.1 number = 0 name = node0 cluster = ocfs2 ... cluster: node_count = 2 #
  • 10. Locking ● Distributed Lock Manager (DLM) ● Based on VMS-DLM ● Lock modes ● Exclusive Lock (EX) ● Protected Read (PR) ● No Lock (NL) ● Concurrent Write Lock (CW) – GFS2 only ● Concurrent Read Lock (CR) – GFS2 only ● Protected Write (PR) – GFS2 only
  • 11. Locking - Compatibility Existing Lock Requested Lock NL CR CW PR PW EX NL Yes Yes Yes Yes Yes Yes CR Yes Yes Yes Yes Yes No CW Yes Yes Yes No No No PR Yes Yes No Yes No No PW Yes Yes No No No No EX Yes No No No No No
  • 12. Fencing ● Separation of host and storage ● Power Fencing – Power switch, e.g. APC – Server side, e.g. IPMI, iLO – Useful in other scenarios – Post-mortem more difficult ● I/O fencing – SAN switch, e.g. Brocade, Qlogic – Possible to investigate “unhealthy” server
  • 13. Fencing - GFS2 ● Both fencing methods ● Part of cluster configuration ● Cascading possible
  • 14. Fencing - OCFS2 ● Only power fencing ● Only self fencing
  • 15. GFS2 – Internals (I) ● Superblock ● Starts at block 128 ● Expected data + cluster information ● Pointers to master and root directory ● Resource groups ● Comparable to cylinder groups of traditional Unix file system ● Allocatable from different cluster nodes -> locking granularity
  • 16. GFS2 – Internals (II) ● Master directory ● Contains meta-data, e.g journal index, quota, ... ● Not visible for ls and Co. ● File system unique and cluster node specific files ● Journaling file system ● One journal per cluster node ● Each journal accessible by all nodes (recovery)
  • 17. GFS2 – Internals (III) ● Inode/Dinode ● Usual information, e.g. owner, mode, time stamp ● Pointers to blocks: either data or pointer ● Only one level of indirection ● “stuffing” ● Directory management via Extendible Hashing ● Meta file statfs ● statfs() ● Tuning via sysfs
  • 18. GFS2 – Internals (IV) ● Meta files ● jindex directory containing the journals – journalX ● rindex Resource group index ● quota ● per_node directory containing node specific files
  • 19. GFS2 – what else ● Extended attributes xattr ● ACL’s ● Local mode = one node access
  • 20. OCFS2 – Internals (I) ● Superblock ● Starts at block 3 (1+2 for OCFS1) ● Expected data + cluster information ● Pointers to master and root directory ● Up to 6 backups – at pre-defined offset – at 2^n Gbyte, n=0,2,4,6,8,10 ● Cluster groups ● Comparable to cylinder groups of traditional Unix file system
  • 21. OCFS2 – Internals (II) ● Master or system directory ● Contains meta-data, e.g journal index, quota, ... ● Not visible for ls and Co. ● File system unique and cluster node specific files ● Journaling file system ● One journal per cluster node ● Each journal accessible by all nodes (recovery)
  • 22. OCFS2 – Internals (III) ● Inode ● Usual information, e.g. owner, mode, time stamp ● Pointers to blocks: either data or pointer ● Only one level of indirection ● global_inode_alloc ● Global meta data file ● inode_alloc node specific counterpart ● slot_map ● Global meta data file ● Active cluster nodes
  • 23. OCFS2 – Internals (IV) ● orphan_dir ● Local meta data file ● Cluster aware deletion of files in use ● truncate_log ● Local meta data file ● Deletion cache
  • 24. OCFS2 – what else ● Two versions: 1.2 and 1.4 ● Mount compatible ● Framework not network compatible ● New features disabled per default ● For 1.4: ● Extended attributes xattr ● Inode based snapshotting ● preallocation
  • 25. File system management ● Known/expected tools + cluster details ● mkfs ● mount/umount ● fsck ● File system specific tools ● gfs2_XXXX ● tunefs.ocfs2, debugfs.ocfs2
  • 26. GFS2 management (I) ● File system creation needs additional information ● Cluster name ● Unique file system identifier (string) ● Optional: – Locking mode to be used – number of journals ● Tuning by changing default size for journals, resource groups, ...
  • 27. GFS2 management (II) ● Mount/umount ● No real syntax surprise ● First node checks all journals ● Enabling ACL, quota, single node mode
  • 28. GFS2 management (III) ● File system check ● Journal recovery of node X by node Y ● Done by one node ● file system offline anywhere else ● Known phases – Journals – Meta data – References: data blocks, inodes
  • 29. GFS2 tuning (I) ● gfs2_tool ● Most powerful – Display superblock – Change superblock settings (locking mode, cluster name) – List meta data – freeze/unfreeze file system – Special attributes, e.g. appendonly, noatime ● Requires file system online (mostly)
  • 30. GFS2 tuning (II) ● gfs2_edit ● Logical extension of gfs2_tool ● More details, e.g. node-specific meta data, block level ● gfs2_jadd ● Different sizes possible ● No deletion possible ● Can cause data space shortage
  • 31. GFS2 tuning (III) ● gfs2_grow ● Needs space in meta directory ● Online only ● No shrinking
  • 32. OCFS2 management (I) ● File system creation ● no additional information needed ● Tuning by optional parameters ● Mount/umount ● No real syntax surprise ● First node checks all journals ● Enabling ACL, quota, single node mode
  • 33. OCFS2 management (II) ● File system check ● Journal recovery of node X by node Y ● Done by one node ● file system offline anywhere else ● Fixed offset of superblock backup handy ● Known phases – Journals – Meta data – References: data blocks, inodes
  • 34. OCFS2 tuning (I) ● tunefs.ocfs2 ● Display/change file system label ● Display/change number of journals ● Change journal setup, e.g.size ● Grow file system (no shrinking) ● Create backup of superblock ● Display/enable/disable specific file system features – Sparse files – “stuffed” inodes
  • 35. OCFS2 tuning (II) ● debugfs.ocfs2 ● Display file system settings, e.g. superblock ● Display inode information ● Access meta data files
  • 36. Volume manager ● Necessary to handle more than one LUN/partition ● Cluster-aware ● Bridge feature gap, e.g. volume based snapshotting ● CLVM ● EVMS – OCFS2 only
  • 37. Key data - comparison GFS2 OCFS2 Maximum # of cluster nodes Supported 16 (theoretical: 256) 256 journaling Yes Yes Cluster-less/local mode Yes Yes Maximum file system size 25 TB (theoretical: 8 EB) 16 TB (theoretical: 4 EB) Maximum file size 25 TB (theoretical: 8 EB) 16 TB (theoretical: 4 EB) POSIX ACL Yes Yes Grow-able Yes/online only Yes/online and offline Shrinkable No No Quota Yes Yes O_DIRECT On file level Yes Extended attributes Yes Yes Maximum file name length 255 255 File system snapshots No No
  • 38. Summary ● GFS2 longer history than OCFS2 ● OCFS2 setup simpler and easier to maintain ● GFS2 setup more flexible and powerful ● OCFS2 getting close to GFS2 ● Dependence on choice of Linux vendor