SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
Divide and conquer – Shared disk
cluster file systems shipped with the
              Linux kernel
             Udo Seidel
Shared file systems
●   Multiple server access same data
●   Different approaches
    ●   Network based, e.g. NFS, CIFS
    ●   Clustered
        –   Shared disk, e.g. CXFS, CFS, GFS(2), OCFS2
        –   Distributed parallel, e.g. Lustre, Ceph
History
●   GFS(2)
    ●   First version in the mid 90’s
    ●   Started on IRIX, later ported to Linux
    ●   Commercial background: Sistina and RedHat
    ●   Part of Vanilla Linux kernel since 2.6.19
●   OCFS2
    ●   OCFS1 for database files only
    ●   First version in 2005
    ●   Part of Vanilla Linux kernel since 2.6.16
Features/Challenges/More
●   As much as possible similar to local file
    systems
    ●   Internal setup
    ●   management
●   Cluster awareness
    ●   Data integrity
    ●   Allocation
Framework
●   Bridges the gap between one-node and cluster
●   3 main components
    ●   Cluster-ware
    ●   Locking
    ●   Fencing
Framework GFS2 (I)
●   Cluster-ware of general purpose
    ●   More flexible
    ●   More options/functions
    ●   More complexity
    ●   Configuration files in XML
●   Locking uses cluster framework too
●   system­config­cluster OR Conga OR vi
    & scp
Framework GFS2 (II)
# cat /etc/cluster/cluster.conf
<?xml version="1.0" ?>

<cluster config_version="3" name="gfs2">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>

<clusternodes>
<clusternode name="node0" nodeid="1" votes="1">

<fence/>
</clusternode>

<clusternode name="node1" nodeid="2" votes="1">
...

</cluster>
#
Framework OCFS2 (I)
●   Cluster-ware just for OCFS2
    ●   Less flexible
    ●   Less options/functions
    ●   Less complexity
    ●   Configuration file in ASCII
●   Locking uses cluster framework too
●   ocfs2console OR vi & scp
Framework OCFS2 (II)
# cat /etc/ocfs2/cluster.conf
node:

      ip_port = 7777
      ip_address = 192.168.0.1

      number = 0
      name = node0

      cluster = ocfs2
...

cluster:
      node_count = 2

#
Locking
●   Distributed Lock Manager (DLM)
●   Based on VMS-DLM
●   Lock modes
    ●   Exclusive Lock (EX)
    ●   Protected Read (PR)
    ●   No Lock (NL)
    ●   Concurrent Write Lock (CW) – GFS2 only
    ●   Concurrent Read Lock (CR) – GFS2 only
    ●   Protected Write (PR) – GFS2 only
Locking - Compatibility
                 Existing Lock
Requested Lock   NL       CR     CW    PR    PW    EX
NL               Yes      Yes    Yes   Yes   Yes   Yes
CR               Yes      Yes    Yes   Yes   Yes   No
CW               Yes      Yes    Yes   No    No    No
PR               Yes      Yes    No    Yes   No    No
PW               Yes      Yes    No    No    No    No
EX               Yes      No     No    No    No    No
Fencing
●   Separation of host and storage
    ●   Power Fencing
        –   Power switch, e.g. APC
        –   Server side, e.g. IPMI, iLO
        –   Useful in other scenarios
        –   Post-mortem more difficult
    ●   I/O fencing
        –   SAN switch, e.g. Brocade, Qlogic
        –   Possible to investigate “unhealthy” server
Fencing - GFS2
●   Both fencing methods
●   Part of cluster configuration
●   Cascading possible
Fencing - OCFS2
●   Only power fencing
    ●   Only self fencing
GFS2 – Internals (I)
●   Superblock
    ●   Starts at block 128
    ●   Expected data + cluster information
    ●   Pointers to master and root directory
●   Resource groups
    ●   Comparable to cylinder groups of traditional Unix
        file system
    ●   Allocatable from different cluster nodes -> locking
        granularity
GFS2 – Internals (II)
●   Master directory
    ●   Contains meta-data, e.g journal index, quota, ...
    ●   Not visible for ls and Co.
    ●   File system unique and cluster node specific files
●   Journaling file system
    ●   One journal per cluster node
    ●   Each journal accessible by all nodes (recovery)
GFS2 – Internals (III)
●   Inode/Dinode
    ●   Usual information, e.g. owner, mode, time stamp
    ●   Pointers to blocks: either data or pointer
    ●   Only one level of indirection
    ●   “stuffing”
●   Directory management via Extendible Hashing
●   Meta file statfs
    ●   statfs()
    ●   Tuning via sysfs
GFS2 – Internals (IV)
●   Meta files
    ●   jindex directory containing the journals
        –   journalX
    ●   rindex Resource group index
    ●   quota
    ●   per_node directory containing node specific files
GFS2 – what else
●   Extended attributes xattr
●   ACL’s
●   Local mode = one node access
OCFS2 – Internals (I)
●   Superblock
    ●   Starts at block 3 (1+2 for OCFS1)
    ●   Expected data + cluster information
    ●   Pointers to master and root directory
    ●   Up to 6 backups
        –   at pre-defined offset
        –   at 2^n Gbyte, n=0,2,4,6,8,10
●   Cluster groups
    ●   Comparable to cylinder groups of traditional Unix
        file system
OCFS2 – Internals (II)
●   Master or system directory
    ●   Contains meta-data, e.g journal index, quota, ...
    ●   Not visible for ls and Co.
    ●   File system unique and cluster node specific files
●   Journaling file system
    ●   One journal per cluster node
    ●   Each journal accessible by all nodes (recovery)
OCFS2 – Internals (III)
●   Inode
    ●   Usual information, e.g. owner, mode, time stamp
    ●   Pointers to blocks: either data or pointer
    ●   Only one level of indirection
●   global_inode_alloc
    ●   Global meta data file
    ●   inode_alloc node specific counterpart
●   slot_map
    ●   Global meta data file
    ●   Active cluster nodes
OCFS2 – Internals (IV)
●   orphan_dir
    ●   Local meta data file
    ●   Cluster aware deletion of files in use
●   truncate_log
    ●   Local meta data file
    ●   Deletion cache
OCFS2 – what else
●   Two versions: 1.2 and 1.4
    ●   Mount compatible
    ●   Framework not network compatible
    ●   New features disabled per default
●   For 1.4:
    ●   Extended attributes xattr
    ●   Inode based snapshotting
    ●   preallocation
File system management
●   Known/expected tools + cluster details
    ●   mkfs
    ●   mount/umount
    ●   fsck
●   File system specific tools
    ●   gfs2_XXXX
    ●   tunefs.ocfs2, debugfs.ocfs2
GFS2 management (I)
●   File system creation needs additional
    information
    ●   Cluster name
    ●   Unique file system identifier (string)
    ●   Optional:
        –   Locking mode to be used
        –   number of journals
    ●   Tuning by changing default size for journals,
        resource groups, ...
GFS2 management (II)
●   Mount/umount
    ●   No real syntax surprise
    ●   First node checks all journals
    ●   Enabling ACL, quota, single node mode
GFS2 management (III)
●   File system check
    ●   Journal recovery of node X by node Y
    ●   Done by one node
    ●   file system offline anywhere else
    ●   Known phases
        –   Journals
        –   Meta data
        –   References: data blocks, inodes
GFS2 tuning (I)
●   gfs2_tool
    ●   Most powerful
        –   Display superblock
        –   Change superblock settings (locking mode, cluster name)
        –   List meta data
        –   freeze/unfreeze file system
        –   Special attributes, e.g. appendonly, noatime
    ●   Requires file system online (mostly)
GFS2 tuning (II)
●   gfs2_edit
    ●   Logical extension of gfs2_tool
    ●   More details, e.g. node-specific meta data, block
        level
●   gfs2_jadd
    ●   Different sizes possible
    ●   No deletion possible
    ●   Can cause data space shortage
GFS2 tuning (III)
●   gfs2_grow
    ●   Needs space in meta directory
    ●   Online only
    ●   No shrinking
OCFS2 management (I)
●   File system creation
    ●   no additional information needed
    ●   Tuning by optional parameters
●   Mount/umount
    ●   No real syntax surprise
    ●   First node checks all journals
    ●   Enabling ACL, quota, single node mode
OCFS2 management (II)
●   File system check
    ●   Journal recovery of node X by node Y
    ●   Done by one node
    ●   file system offline anywhere else
    ●   Fixed offset of superblock backup handy
    ●   Known phases
        –   Journals
        –   Meta data
        –   References: data blocks, inodes
OCFS2 tuning (I)
●   tunefs.ocfs2
    ●   Display/change file system label
    ●   Display/change number of journals
    ●   Change journal setup, e.g.size
    ●   Grow file system (no shrinking)
    ●   Create backup of superblock
    ●   Display/enable/disable specific file system features
        –   Sparse files
        –   “stuffed” inodes
OCFS2 tuning (II)
●   debugfs.ocfs2
    ●   Display file system settings, e.g. superblock
    ●   Display inode information
    ●   Access meta data files
Volume manager
●   Necessary to handle more than one
    LUN/partition
●   Cluster-aware
●   Bridge feature gap, e.g. volume based
    snapshotting
●   CLVM
●   EVMS – OCFS2 only
Key data - comparison
                                         GFS2                          OCFS2
Maximum # of cluster nodes   Supported 16 (theoretical: 256)             256

journaling                                Yes                            Yes
Cluster-less/local mode                   Yes                            Yes
Maximum file system size       25 TB (theoretical: 8 EB)       16 TB (theoretical: 4 EB)
Maximum file size              25 TB (theoretical: 8 EB)       16 TB (theoretical: 4 EB)
POSIX ACL                                 Yes                            Yes
Grow-able                           Yes/online only             Yes/online and offline
Shrinkable                                 No                             No
Quota                                     Yes                            Yes
O_DIRECT                              On file level                      Yes
Extended attributes                       Yes                            Yes
Maximum file name length                  255                            255
File system snapshots                      No                             No
Summary
●   GFS2 longer history than OCFS2
●   OCFS2 setup simpler and easier to maintain
●   GFS2 setup more flexible and powerful
●   OCFS2 getting close to GFS2
●   Dependence on choice of Linux vendor
References
http://sourceware.org/cluster/gfs/
http://www.redhat.com/gfs/
http://oss.oracle.com/projects/ocfs2/
http://sources.redhat.com/cluster/wiki/
http://sourceware.org/lvm2/
http://evms.sourceforge.net/
Thank you!

Más contenido relacionado

La actualidad más candente

MySQL with DRBD/Pacemaker/Corosync on Linux
 MySQL with DRBD/Pacemaker/Corosync on Linux MySQL with DRBD/Pacemaker/Corosync on Linux
MySQL with DRBD/Pacemaker/Corosync on LinuxPawan Kumar
 
Secure lustre on openstack
Secure lustre on openstackSecure lustre on openstack
Secure lustre on openstackJames Beal
 
Fedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUFedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUAndrey Vagin
 
Containers and Namespaces in the Linux Kernel
Containers and Namespaces in the Linux KernelContainers and Namespaces in the Linux Kernel
Containers and Namespaces in the Linux KernelOpenVZ
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and DriversKernel TLV
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageSage Weil
 
Configuring Syslog by Octavio
Configuring Syslog by OctavioConfiguring Syslog by Octavio
Configuring Syslog by OctavioRowell Dionicio
 
Linux containers-namespaces(Dec 2014)
Linux containers-namespaces(Dec 2014)Linux containers-namespaces(Dec 2014)
Linux containers-namespaces(Dec 2014)Ralf Dannert
 
Moscow virtualization meetup 2014: CRIU 1.0 What is next?
Moscow virtualization meetup 2014: CRIU 1.0 What is next?Moscow virtualization meetup 2014: CRIU 1.0 What is next?
Moscow virtualization meetup 2014: CRIU 1.0 What is next?Andrey Vagin
 
Lightweight Virtualization: LXC Best Practices
Lightweight Virtualization: LXC Best PracticesLightweight Virtualization: LXC Best Practices
Lightweight Virtualization: LXC Best PracticesWerner Fischer
 
MySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the PacemakerMySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the Pacemakerhastexo
 
Debugging with-wireshark-niels-de-vos
Debugging with-wireshark-niels-de-vosDebugging with-wireshark-niels-de-vos
Debugging with-wireshark-niels-de-vosGluster.org
 
brief introduction of drbd in SLE12SP2
brief introduction of drbd in SLE12SP2brief introduction of drbd in SLE12SP2
brief introduction of drbd in SLE12SP2Nick Wang
 
Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...
Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...
Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...ORACLE USER GROUP ESTONIA
 
Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architectureinside-BigData.com
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackSage Weil
 

La actualidad más candente (20)

MySQL with DRBD/Pacemaker/Corosync on Linux
 MySQL with DRBD/Pacemaker/Corosync on Linux MySQL with DRBD/Pacemaker/Corosync on Linux
MySQL with DRBD/Pacemaker/Corosync on Linux
 
Cluster filesystems
Cluster filesystemsCluster filesystems
Cluster filesystems
 
Secure lustre on openstack
Secure lustre on openstackSecure lustre on openstack
Secure lustre on openstack
 
Fedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIUFedora Virtualization Day: Linux Containers & CRIU
Fedora Virtualization Day: Linux Containers & CRIU
 
Drbd
DrbdDrbd
Drbd
 
Containers and Namespaces in the Linux Kernel
Containers and Namespaces in the Linux KernelContainers and Namespaces in the Linux Kernel
Containers and Namespaces in the Linux Kernel
 
Namespaces in Linux
Namespaces in LinuxNamespaces in Linux
Namespaces in Linux
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and Drivers
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
 
Configuring Syslog by Octavio
Configuring Syslog by OctavioConfiguring Syslog by Octavio
Configuring Syslog by Octavio
 
Linux containers-namespaces(Dec 2014)
Linux containers-namespaces(Dec 2014)Linux containers-namespaces(Dec 2014)
Linux containers-namespaces(Dec 2014)
 
Fun with FUSE
Fun with FUSEFun with FUSE
Fun with FUSE
 
Moscow virtualization meetup 2014: CRIU 1.0 What is next?
Moscow virtualization meetup 2014: CRIU 1.0 What is next?Moscow virtualization meetup 2014: CRIU 1.0 What is next?
Moscow virtualization meetup 2014: CRIU 1.0 What is next?
 
Lightweight Virtualization: LXC Best Practices
Lightweight Virtualization: LXC Best PracticesLightweight Virtualization: LXC Best Practices
Lightweight Virtualization: LXC Best Practices
 
MySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the PacemakerMySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the Pacemaker
 
Debugging with-wireshark-niels-de-vos
Debugging with-wireshark-niels-de-vosDebugging with-wireshark-niels-de-vos
Debugging with-wireshark-niels-de-vos
 
brief introduction of drbd in SLE12SP2
brief introduction of drbd in SLE12SP2brief introduction of drbd in SLE12SP2
brief introduction of drbd in SLE12SP2
 
Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...
Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...
Why Exadata wins - real exadata case studies from Proact portfolio - Fabien d...
 
Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architecture
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStack
 

Similar a Linuxkongress2010.gfs2ocfs2.talk

Linuxtag.ceph.talk
Linuxtag.ceph.talkLinuxtag.ceph.talk
Linuxtag.ceph.talkUdo Seidel
 
OSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
OSDC 2011 | Enterprise Linux Server Filesystems by Remo RickliOSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
OSDC 2011 | Enterprise Linux Server Filesystems by Remo RickliNETWAYS
 
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo Seidel
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo SeidelOSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo Seidel
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo SeidelNETWAYS
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureThomas Uhl
 
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE
 
TLPI Chapter 14 File Systems
TLPI Chapter 14 File SystemsTLPI Chapter 14 File Systems
TLPI Chapter 14 File SystemsShu-Yu Fu
 
Osdc2011.ext4btrfs.talk
Osdc2011.ext4btrfs.talkOsdc2011.ext4btrfs.talk
Osdc2011.ext4btrfs.talkUdo Seidel
 
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityRamp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityPythian
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLThijs Terlouw
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonSage Weil
 
Integrity and Security in Filesystems
Integrity and Security in FilesystemsIntegrity and Security in Filesystems
Integrity and Security in FilesystemsConferencias FIST
 
Comparison between OCFS2 and GFS2
Comparison between OCFS2 and GFS2Comparison between OCFS2 and GFS2
Comparison between OCFS2 and GFS2Gang He
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introductionkanedafromparis
 
Lt2013 glusterfs.talk
Lt2013 glusterfs.talkLt2013 glusterfs.talk
Lt2013 glusterfs.talkUdo Seidel
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems confluent
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
adp.ceph.openstack.talk
adp.ceph.openstack.talkadp.ceph.openstack.talk
adp.ceph.openstack.talkUdo Seidel
 

Similar a Linuxkongress2010.gfs2ocfs2.talk (20)

Linuxtag.ceph.talk
Linuxtag.ceph.talkLinuxtag.ceph.talk
Linuxtag.ceph.talk
 
OSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
OSDC 2011 | Enterprise Linux Server Filesystems by Remo RickliOSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
OSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli
 
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo Seidel
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo SeidelOSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo Seidel
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo Seidel
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
 
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
 
TLPI Chapter 14 File Systems
TLPI Chapter 14 File SystemsTLPI Chapter 14 File Systems
TLPI Chapter 14 File Systems
 
Osdc2011.ext4btrfs.talk
Osdc2011.ext4btrfs.talkOsdc2011.ext4btrfs.talk
Osdc2011.ext4btrfs.talk
 
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous AvailabilityRamp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NL
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
Integrity and Security in Filesystems
Integrity and Security in FilesystemsIntegrity and Security in Filesystems
Integrity and Security in Filesystems
 
Scale 10x 01:22:12
Scale 10x 01:22:12Scale 10x 01:22:12
Scale 10x 01:22:12
 
FUSE Filesystems
FUSE FilesystemsFUSE Filesystems
FUSE Filesystems
 
Comparison between OCFS2 and GFS2
Comparison between OCFS2 and GFS2Comparison between OCFS2 and GFS2
Comparison between OCFS2 and GFS2
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Lt2013 glusterfs.talk
Lt2013 glusterfs.talkLt2013 glusterfs.talk
Lt2013 glusterfs.talk
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
adp.ceph.openstack.talk
adp.ceph.openstack.talkadp.ceph.openstack.talk
adp.ceph.openstack.talk
 

Más de Udo Seidel

ceph openstack dream team
ceph openstack dream teamceph openstack dream team
ceph openstack dream teamUdo Seidel
 
Gluster.community.day.2013
Gluster.community.day.2013Gluster.community.day.2013
Gluster.community.day.2013Udo Seidel
 
Lt2013 uefisb.talk
Lt2013 uefisb.talkLt2013 uefisb.talk
Lt2013 uefisb.talkUdo Seidel
 
Ostd.ksplice.talk
Ostd.ksplice.talkOstd.ksplice.talk
Ostd.ksplice.talkUdo Seidel
 
Cephfsglusterfs.talk
Cephfsglusterfs.talkCephfsglusterfs.talk
Cephfsglusterfs.talkUdo Seidel
 
Osdc2012 xtfs.talk
Osdc2012 xtfs.talkOsdc2012 xtfs.talk
Osdc2012 xtfs.talkUdo Seidel
 
Linuxconeurope2011.ext4btrfs.talk
Linuxconeurope2011.ext4btrfs.talkLinuxconeurope2011.ext4btrfs.talk
Linuxconeurope2011.ext4btrfs.talkUdo Seidel
 

Más de Udo Seidel (8)

ceph openstack dream team
ceph openstack dream teamceph openstack dream team
ceph openstack dream team
 
kpatch.kgraft
kpatch.kgraftkpatch.kgraft
kpatch.kgraft
 
Gluster.community.day.2013
Gluster.community.day.2013Gluster.community.day.2013
Gluster.community.day.2013
 
Lt2013 uefisb.talk
Lt2013 uefisb.talkLt2013 uefisb.talk
Lt2013 uefisb.talk
 
Ostd.ksplice.talk
Ostd.ksplice.talkOstd.ksplice.talk
Ostd.ksplice.talk
 
Cephfsglusterfs.talk
Cephfsglusterfs.talkCephfsglusterfs.talk
Cephfsglusterfs.talk
 
Osdc2012 xtfs.talk
Osdc2012 xtfs.talkOsdc2012 xtfs.talk
Osdc2012 xtfs.talk
 
Linuxconeurope2011.ext4btrfs.talk
Linuxconeurope2011.ext4btrfs.talkLinuxconeurope2011.ext4btrfs.talk
Linuxconeurope2011.ext4btrfs.talk
 

Último

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Último (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Linuxkongress2010.gfs2ocfs2.talk

  • 1. Divide and conquer – Shared disk cluster file systems shipped with the Linux kernel Udo Seidel
  • 2. Shared file systems ● Multiple server access same data ● Different approaches ● Network based, e.g. NFS, CIFS ● Clustered – Shared disk, e.g. CXFS, CFS, GFS(2), OCFS2 – Distributed parallel, e.g. Lustre, Ceph
  • 3. History ● GFS(2) ● First version in the mid 90’s ● Started on IRIX, later ported to Linux ● Commercial background: Sistina and RedHat ● Part of Vanilla Linux kernel since 2.6.19 ● OCFS2 ● OCFS1 for database files only ● First version in 2005 ● Part of Vanilla Linux kernel since 2.6.16
  • 4. Features/Challenges/More ● As much as possible similar to local file systems ● Internal setup ● management ● Cluster awareness ● Data integrity ● Allocation
  • 5. Framework ● Bridges the gap between one-node and cluster ● 3 main components ● Cluster-ware ● Locking ● Fencing
  • 6. Framework GFS2 (I) ● Cluster-ware of general purpose ● More flexible ● More options/functions ● More complexity ● Configuration files in XML ● Locking uses cluster framework too ● system­config­cluster OR Conga OR vi & scp
  • 7. Framework GFS2 (II) # cat /etc/cluster/cluster.conf <?xml version="1.0" ?> <cluster config_version="3" name="gfs2"> <fence_daemon post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="node0" nodeid="1" votes="1"> <fence/> </clusternode> <clusternode name="node1" nodeid="2" votes="1"> ... </cluster> #
  • 8. Framework OCFS2 (I) ● Cluster-ware just for OCFS2 ● Less flexible ● Less options/functions ● Less complexity ● Configuration file in ASCII ● Locking uses cluster framework too ● ocfs2console OR vi & scp
  • 9. Framework OCFS2 (II) # cat /etc/ocfs2/cluster.conf node: ip_port = 7777 ip_address = 192.168.0.1 number = 0 name = node0 cluster = ocfs2 ... cluster: node_count = 2 #
  • 10. Locking ● Distributed Lock Manager (DLM) ● Based on VMS-DLM ● Lock modes ● Exclusive Lock (EX) ● Protected Read (PR) ● No Lock (NL) ● Concurrent Write Lock (CW) – GFS2 only ● Concurrent Read Lock (CR) – GFS2 only ● Protected Write (PR) – GFS2 only
  • 11. Locking - Compatibility Existing Lock Requested Lock NL CR CW PR PW EX NL Yes Yes Yes Yes Yes Yes CR Yes Yes Yes Yes Yes No CW Yes Yes Yes No No No PR Yes Yes No Yes No No PW Yes Yes No No No No EX Yes No No No No No
  • 12. Fencing ● Separation of host and storage ● Power Fencing – Power switch, e.g. APC – Server side, e.g. IPMI, iLO – Useful in other scenarios – Post-mortem more difficult ● I/O fencing – SAN switch, e.g. Brocade, Qlogic – Possible to investigate “unhealthy” server
  • 13. Fencing - GFS2 ● Both fencing methods ● Part of cluster configuration ● Cascading possible
  • 14. Fencing - OCFS2 ● Only power fencing ● Only self fencing
  • 15. GFS2 – Internals (I) ● Superblock ● Starts at block 128 ● Expected data + cluster information ● Pointers to master and root directory ● Resource groups ● Comparable to cylinder groups of traditional Unix file system ● Allocatable from different cluster nodes -> locking granularity
  • 16. GFS2 – Internals (II) ● Master directory ● Contains meta-data, e.g journal index, quota, ... ● Not visible for ls and Co. ● File system unique and cluster node specific files ● Journaling file system ● One journal per cluster node ● Each journal accessible by all nodes (recovery)
  • 17. GFS2 – Internals (III) ● Inode/Dinode ● Usual information, e.g. owner, mode, time stamp ● Pointers to blocks: either data or pointer ● Only one level of indirection ● “stuffing” ● Directory management via Extendible Hashing ● Meta file statfs ● statfs() ● Tuning via sysfs
  • 18. GFS2 – Internals (IV) ● Meta files ● jindex directory containing the journals – journalX ● rindex Resource group index ● quota ● per_node directory containing node specific files
  • 19. GFS2 – what else ● Extended attributes xattr ● ACL’s ● Local mode = one node access
  • 20. OCFS2 – Internals (I) ● Superblock ● Starts at block 3 (1+2 for OCFS1) ● Expected data + cluster information ● Pointers to master and root directory ● Up to 6 backups – at pre-defined offset – at 2^n Gbyte, n=0,2,4,6,8,10 ● Cluster groups ● Comparable to cylinder groups of traditional Unix file system
  • 21. OCFS2 – Internals (II) ● Master or system directory ● Contains meta-data, e.g journal index, quota, ... ● Not visible for ls and Co. ● File system unique and cluster node specific files ● Journaling file system ● One journal per cluster node ● Each journal accessible by all nodes (recovery)
  • 22. OCFS2 – Internals (III) ● Inode ● Usual information, e.g. owner, mode, time stamp ● Pointers to blocks: either data or pointer ● Only one level of indirection ● global_inode_alloc ● Global meta data file ● inode_alloc node specific counterpart ● slot_map ● Global meta data file ● Active cluster nodes
  • 23. OCFS2 – Internals (IV) ● orphan_dir ● Local meta data file ● Cluster aware deletion of files in use ● truncate_log ● Local meta data file ● Deletion cache
  • 24. OCFS2 – what else ● Two versions: 1.2 and 1.4 ● Mount compatible ● Framework not network compatible ● New features disabled per default ● For 1.4: ● Extended attributes xattr ● Inode based snapshotting ● preallocation
  • 25. File system management ● Known/expected tools + cluster details ● mkfs ● mount/umount ● fsck ● File system specific tools ● gfs2_XXXX ● tunefs.ocfs2, debugfs.ocfs2
  • 26. GFS2 management (I) ● File system creation needs additional information ● Cluster name ● Unique file system identifier (string) ● Optional: – Locking mode to be used – number of journals ● Tuning by changing default size for journals, resource groups, ...
  • 27. GFS2 management (II) ● Mount/umount ● No real syntax surprise ● First node checks all journals ● Enabling ACL, quota, single node mode
  • 28. GFS2 management (III) ● File system check ● Journal recovery of node X by node Y ● Done by one node ● file system offline anywhere else ● Known phases – Journals – Meta data – References: data blocks, inodes
  • 29. GFS2 tuning (I) ● gfs2_tool ● Most powerful – Display superblock – Change superblock settings (locking mode, cluster name) – List meta data – freeze/unfreeze file system – Special attributes, e.g. appendonly, noatime ● Requires file system online (mostly)
  • 30. GFS2 tuning (II) ● gfs2_edit ● Logical extension of gfs2_tool ● More details, e.g. node-specific meta data, block level ● gfs2_jadd ● Different sizes possible ● No deletion possible ● Can cause data space shortage
  • 31. GFS2 tuning (III) ● gfs2_grow ● Needs space in meta directory ● Online only ● No shrinking
  • 32. OCFS2 management (I) ● File system creation ● no additional information needed ● Tuning by optional parameters ● Mount/umount ● No real syntax surprise ● First node checks all journals ● Enabling ACL, quota, single node mode
  • 33. OCFS2 management (II) ● File system check ● Journal recovery of node X by node Y ● Done by one node ● file system offline anywhere else ● Fixed offset of superblock backup handy ● Known phases – Journals – Meta data – References: data blocks, inodes
  • 34. OCFS2 tuning (I) ● tunefs.ocfs2 ● Display/change file system label ● Display/change number of journals ● Change journal setup, e.g.size ● Grow file system (no shrinking) ● Create backup of superblock ● Display/enable/disable specific file system features – Sparse files – “stuffed” inodes
  • 35. OCFS2 tuning (II) ● debugfs.ocfs2 ● Display file system settings, e.g. superblock ● Display inode information ● Access meta data files
  • 36. Volume manager ● Necessary to handle more than one LUN/partition ● Cluster-aware ● Bridge feature gap, e.g. volume based snapshotting ● CLVM ● EVMS – OCFS2 only
  • 37. Key data - comparison GFS2 OCFS2 Maximum # of cluster nodes Supported 16 (theoretical: 256) 256 journaling Yes Yes Cluster-less/local mode Yes Yes Maximum file system size 25 TB (theoretical: 8 EB) 16 TB (theoretical: 4 EB) Maximum file size 25 TB (theoretical: 8 EB) 16 TB (theoretical: 4 EB) POSIX ACL Yes Yes Grow-able Yes/online only Yes/online and offline Shrinkable No No Quota Yes Yes O_DIRECT On file level Yes Extended attributes Yes Yes Maximum file name length 255 255 File system snapshots No No
  • 38. Summary ● GFS2 longer history than OCFS2 ● OCFS2 setup simpler and easier to maintain ● GFS2 setup more flexible and powerful ● OCFS2 getting close to GFS2 ● Dependence on choice of Linux vendor