Oracle Clusterware Node Management and Voting Disks
1. <Insert Picture Here>
Node Management in Oracle Clusterware
Markus Michalewicz
Senior Principal Product Manager Oracle RAC and Oracle RAC One Node
2. The following is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracle’s
products remain at the sole discretion of Oracle.
Agenda
• Oracle Clusterware 11.2.0.1 Processes
<Insert Picture Here>
• Node Monitoring Basics
• Node Eviction Basics
• Re-bootless Node Fencing (restart)
• Advanced Node Management
• The Corner Cases
• More Information / Q&A
3. Oracle Clusterware 11g Rel. 2 Processes
Most are not important for node management
Oracle Clusterware 11g Rel. 2 Processes
Most are not important for node management – focus!
OHASD
CSSD
ora.cssd
CSSDMONITOR
(was: oprocd)
ora.cssdmonitor
4. <Insert Picture Here>
Node Monitoring Basics
Basic Hardware Layout Oracle Clusterware
Node management is hardware independent
Public Lan Public Lan
Private Lan /
Interconnect
CSSD CSSD CSSD
SAN SAN
Network Network
Voting
Disk
5. What does CSSD do?
CSSD monitors and evicts nodes
• Monitors nodes using 2 communication channels:
– Private Interconnect Network Heartbeat
– Voting Disk based communication Disk Heartbeat
• Evicts (forcibly removes nodes from a cluster)
nodes dependent on heartbeat feedback (failures)
CSSD “Ping” CSSD
“Ping”
Network Heartbeat
Interconnect basics
• Each node in the cluster is “pinged” every second
• Nodes must respond in css_misscount time (defaults to 30 secs.)
– Reducing the css_misscount time is generally not supported
• Network heartbeat failures will lead to node evictions
– CSSD-log: [date / time] [CSSD][1111902528]clssnmPollingThread: node
mynodename (5) at 75% heartbeat fatal, removal in 6.770 seconds
CSSD “Ping” CSSD
6. Disk Heartbeat
Voting Disk basics – Part 1
• Each node in the cluster “pings” (r/w) the Voting Disk(s) every second
• Nodes must receive a response in (long / short) diskTimeout time
– I/O errors indicate clear accessibility problems timeout is irrelevant
• Disk heartbeat failures will lead to node evictions
– CSSD-log: … [CSSD] [1115699552] >TRACE: clssnmReadDskHeartbeat:
node(2) is down. rcfg(1) wrtcnt(1) LATS(63436584) Disk lastSeqNo(1)
CSSD CSSD
“Ping”
Voting Disk Structure
Voting Disk basics – Part 2
• Voting Disks contain dynamic and static data:
– Dynamic data: disk heartbeat logging
– Static data: information about the nodes in the cluster
• With 11.2.0.1 Voting Disks got an “identity”:
– E.g. Voting Disk serial number: [GRID]> crsctl query css votedisk
1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]
• Voting Disks must therefore not be copied using “dd” or “cp” anymore
Node information Disk Heartbeat Logging
7. “Simple Majority Rule”
Voting Disk basics – Part 3
• Oracle supports redundant Voting Disks for disk failure protection
• “Simple Majority Rule” applies:
– Each node must “see” the simple majority of configured Voting Disks
at all times in order not to be evicted (to remain in the cluster)
trunc(n/2+1) with n=number of voting disks configured and n>=1
CSSD CSSD
Insertion 1: “Simple Majority Rule”…
… In extended Oracle clusters
• http://www.oracle.com/goto/rac
– Using standard NFS to support
a third voting file for extended
cluster configurations (PDF)
CSSD CSSD
• Same principles apply
• Voting Disks are just
geographically dispersed
8. Insertion 2: Voting Disk in Oracle ASM
The way of storing Voting Disks doesn’t change its use
[GRID]> crsctl query css votedisk
1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]
2. 2 aafab95f9ef84f03bf6e26adc2a3b0e8 (/dev/sde5) [DATA]
3. 2 28dd4128f4a74f73bf8653dabd88c737 (/dev/sdd6) [DATA]
Located 3 voting disk(s).
• Oracle ASM auto creates 1/3/5 Voting Files
– Based on Ext/Normal/High redundancy
and on Failure Groups in the Disk Group
– Per default there is one failure group per disk
– ASM will enforce the required number of disks
– New failure group type: Quorum Failgroup
<Insert Picture Here>
Node Eviction Basics
9. Why are nodes evicted?
To prevent worse things from happening…
• Evicting (fencing) nodes is a preventive measure (a good thing)!
• Nodes are evicted to prevent consequences of a split brain:
– Shared data must not be written by independently operating nodes
– The easiest way to prevent this is to forcibly remove a node from the cluster
1 2
CSSD CSSD
How are nodes evicted in general?
“STONITH like” or node eviction basics – Part 1
• Once it is determined that a node needs to be evicted,
– A “kill request” is sent to the respective node(s)
– Using all (remaining) communication channels
• A node (CSSD) is requested to “kill itself” “STONITH like”
– “STONITH” foresees that a remote node kills the node to be evicted
1 2
CSSD CSSD
10. How are nodes evicted?
EXAMPLE: Heartbeat failure
• The network heartbeat between nodes has failed
– It is determined which nodes can still talk to each other
– A “kill request” is sent to the node(s) to be evicted
Using all (remaining) communication channels Voting Disk(s)
• A node is requested to “kill itself”; executer: typically CSSD
1
CSSD CSSD
2
How can nodes be evicted?
Using IPMI / Node eviction basics – Part 2
• Oracle Clusterware 11.2.0.1 and later supports IPMI (optional)
– Intelligent Platform Management Interface (IPMI) drivers required
• IPMI allows remote-shutdown of nodes using additional hardware
– A Baseboard Management Controller (BMC) per cluster node is required
1
CSSD CSSD
11. Insertion: Node Eviction Using IPMI
EXAMPLE: Heartbeat failure
• The network heartbeat between the nodes has failed
– It is determined which nodes can still talk to each other
– IPMI is used to remotely shutdown the node to be evicted
1
CSSD
Which node is evicted?
Node eviction basics – Part 3
• Voting Disks and heartbeat communication is used to determine the node
• In a 2 node cluster, the node with the lowest node number should survive
• In a n-node cluster, the biggest sub-cluster should survive (votes based)
1 2
CSSD CSSD
12. <Insert Picture Here>
Re-bootless Node
Fencing (restart)
Re-bootless Node Fencing (restart)
Fence the cluster, do not reboot the node
• Until Oracle Clusterware 11.2.0.2, fencing meant “re-boot”
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less, because:
– Re-boots affect applications that might run an a node, but are not protected
– Customer requirement: prevent a reboot, just stop the cluster – implemented...
Standalone Standalone
App X App Y
Oracle RAC Oracle RAC
DB Inst. 1 DB Inst. 2
CSSD CSSD
13. Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
– Instead of fast re-booting the node, a graceful shutdown of the stack is attempted
• It starts with a failure – e.g. network heartbeat or interconnect failure
Standalone Standalone
App X App Y
Oracle RAC Oracle RAC
DB Inst. 1 DB Inst. 2
CSSD CSSD
Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
– Instead of fast re-booting the node, a graceful shutdown of the stack is attempted
• It starts with a failure – e.g. network heartbeat or interconnect failure
Standalone Standalone
App X App Y
Oracle RAC Oracle RAC
DB Inst. 1 DB Inst. 2
CSSD CSSD
14. Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
– Instead of fast re-booting the node, a graceful shutdown of the stack is attempted
• Then IO issuing processes are killed; it is made sure that no IO process remains
– For a RAC DB mainly the log writer and the database writer are of concern
Standalone Standalone
App X App Y
Oracle RAC
DB Inst. 1
CSSD CSSD
Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
– Instead of fast re-booting the node, a graceful shutdown of the stack is attempted
• Once all IO issuing processes are killed, remaining processes are stopped
– IF the check for a successful kill of the IO processes, fails → reboot
Standalone Standalone
App X App Y
Oracle RAC
DB Inst. 1
CSSD CSSD
15. Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
– Instead of fast re-booting the node, a graceful shutdown of the stack is attempted
• Once all remaining processes are stopped, the stack stops itself with a “restart flag”
Standalone Standalone
App X App Y
Oracle RAC
DB Inst. 1
CSSD OHASD
Re-bootless Node Fencing (restart)
How it works
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
– Instead of fast re-booting the node, a graceful shutdown of the stack is attempted
• OHASD will finally attempt to restart the stack after the graceful shutdown
Standalone Standalone
App X App Y
Oracle RAC
DB Inst. 1
CSSD OHASD
16. Re-bootless Node Fencing (restart)
EXCEPTIONS
• With Oracle Clusterware 11.2.0.2, re-boots will be seen less, unless…:
– IF the check for a successful kill of the IO processes fails → reboot
– IF CSSD gets killed during the operation → reboot
– IF cssdmonitor (oprocd replacement) is not scheduled → reboot
– IF the stack cannot be shutdown in “short_disk_timeout”-seconds → reboot
Standalone Standalone
App X App Y
Oracle RAC Oracle RAC
DB Inst. 1 DB Inst. 2
CSSD CSSD
<Insert Picture Here>
Advanced Node
Management
17. Determine the Biggest Sub-Cluster
Voting Disk basics – Part 4
• Each node in the cluster is “pinged” every second (network heartbeat)
• Each node in the cluster “pings” (r/w) the Voting Disk(s) every second
1 2 3
CSSD CSSD CSSD
1
2
3
Determine the Biggest Sub-Cluster
Voting Disk basics – Part 4
• In a n-node cluster, the biggest sub-cluster should survive (votes based)
1 2 3
CSSD CSSD CSSD
2
1
3
18. Redundant Voting Disks – Why odd?
Voting Disk basics – Part 5
• Redundant Voting Disks Oracle managed redundancy
• Assume for a moment only 2
1 voting disks are supported…
CSSD
2 3
CSSD CSSD
Redundant Voting Disks – Why odd?
Voting Disk basics – Part 5
• Advanced scenarios need to be considered
1
• Without the “Simple Majority
CSSD
Rule”, what would we do?
2 3
CSSD CSSD
• Even with the “Simple
Majority Rule” in place
– Each node can see only one
voting disk, which would lead
to an eviction of all nodes
20. <Insert Picture Here>
The Corner Cases
Case 1: Partial Failures in the Cluster
When somebody uses a pair of scissors in the wrong way…
• A properly configured cluster
with 3 voting disks as shown
CSSD CSSD
• What happens if there is a
storage network failure as
shown (lost remote access)?
21. Case 1: Partial Failures in the Cluster
When somebody uses a pair of scissors in the wrong way…
• There will be no node eviction!
• IF storage mirroring is used
(for data files), the respective
solution must handle this case.
CSSD CSSD
• Covered in Oracle ASM 11.2.0.2:
– _asm_storagemaysplit = TRUE
– Backported to 11.1.0.7
Case 2: CSSD is stuck
CSSD cannot execute request
• A node is requested to “kill itself”
• BUT CSSD is “stuck” or “sick” (does not execute) – e.g.:
– CSSD failed for some reason
– CSSD is not scheduled within a certain margin
OCSSDMONITOR (was: oprocd) will take over and execute
1
CSSD CSSD
22. Case 2: CSSD is stuck
CSSD cannot execute request
• A node is requested to “kill itself”
• BUT CSSD is “stuck” or “sick” (does not execute) – e.g.:
– CSSD failed for some reason
– CSSD is not scheduled within a certain margin
OCSSDMONITOR (was: oprocd) will take over and execute
1
CSSD CSSDmonitor
CSSD
Case 3: Node Eviction Escalation
Members of a cluster can escalate kill requests
• Cluster members (e.g Oracle RAC instances) can request
Oracle Clusterware to kill a specific member of the cluster
• Oracle Clusterware will then attempt to kill the requested member
Oracle RAC Oracle RAC
DB Inst. 1 DB Inst. 2
Inst. 1:
kill inst. 2
CSSD CSSD
23. Case 3: Node Eviction Escalation
Members of a cluster can escalate kill requests
• Oracle Clusterware will then attempt to kill the requested member
• If the requested member kill is unsuccessful, a node eviction
escalation can be issued, which leads to the eviction of the
node, on which the particular member currently resides
Oracle RAC Oracle RAC
DB Inst. 1 DB Inst. 2
Inst. 1:
kill inst. 2
CSSD CSSD
Case 3: Node Eviction Escalation
Members of a cluster can escalate kill requests
• Oracle Clusterware will then attempt to kill the requested member
• If the requested member kill is unsuccessful, a node eviction
escalation can be issued, which leads to the eviction of the
node, on which the particular member currently resides
Oracle RAC Oracle RAC
DB Inst. 1 DB Inst. 2
Inst. 1:
kill inst. 2
CSSD CSSD
24. Case 3: Node Eviction Escalation
Members of a cluster can escalate kill requests
• Oracle Clusterware will then attempt to kill the requested member
• If the requested member kill is unsuccessful, a node eviction
escalation can be issued, which leads to the eviction of the
node, on which the particular member currently resides
Oracle RAC
DB Inst. 1
CSSD
<Insert Picture Here>
More Information
25. More Information
• My Oracle Support Notes:
– ID 294430.1 - CSS Timeout Computation in Oracle Clusterware
– ID 395878.1 - Heartbeat/Voting/Quorum Related Timeout Configuration
for Linux, OCFS2, RAC Stack to Avoid Unnecessary Node Fencing,
Panic and Reboot
• http://www.oracle.com/goto/clusterware
– Oracle Clusterware 11g Release 2 Technical Overview
• http://www.oracle.com/goto/asm
• http://www.oracle.com/goto/rac