SlideShare una empresa de Scribd logo
1 de 26
Failover Cluster Troubleshooting 10.08.2011 Hakan YÜKSEL hakan.yuksel@turkiyefinans.com.tr http://yukselis.wordpress.com
Ajanda ,[object Object]
Kavramlar, Gereksinimler, Mimari, Log Yönetimi, ..
Quorum Modeli
Troubleshooting
Soru – Cevap,[object Object]
Storage: You must use shared storage that is compatible with Windows Server 2008 R2
Network adapters and cable (for network communication): The network hardware, like other components in the failover cluster solution, must be marked as "Certified for Windows Server 2008 R2." If you use iSCSI, your network adapters should be dedicated to either network communication or iSCSI, not both
Account for administering the cluster: When you first create a cluster or add servers to it, you must be logged on to the domain with an account that has administrator rights and permissions on all servers in that cluster. The account does not need to be a Domain Admins account—it can be a Domain Users account that is in the Administrators group on each clustered server. In addition, if the account is not a Domain Admins account, the account (or the group that the account is a member of) must be delegated Create Computer Objects and Read All Properties permissions in the domain
Standart Edition üzerindeki sunucular üzerinde cluster activate edilebilir
SCSI-3 CommandsPersistent Reservations (PRs) Required
Basic GPT and MBR disks supported
Multipath IO (MPIO) recommended,[object Object]
Cluster Validate ,[object Object]
Gereksinimlerin karşılanmaması durumunda uyarı verir
Clusterı oluşturan servers vestorage ile ilgili tüm kontrolleri yapar
Her değişiklik sonrası çalıştırılması gerekir
Create a new cluster
Add a node, disk, or network
Update system software (drivers, firmware, service packs, MPIO) ,[object Object]
Change any component in your solution
It’s the very first thing you do!http://technet.microsoft.com/en-us/library/cc732035(WS.10).aspx#BKMK_understanding_tests
Quorum ve Majority Node Set ,[object Object]
Windows Server 2008 ile yeni bir Quorum modeli mevcut (Node and Disk Majority), bu sefer Quorum diskin kullanımı biraz farklı oluyor: Quorumu node sayısı ile beraber bir oy hakkı olarak kullanıyoruz..
Majority Node Set MNS demokratik bir sistemdir. Quorum da sadece bir oy var ise ve buna sahiplenen cluster a sahiplenebiliyorsa, MNS de çoğunluk clustera sahiplenir. Mesela 5 nodelu cluster da split brain senaryosu yaşanırsa her node toplam kaç node ila haberleşebildiğine bakar. Bir node iki node ile haberleşebiliyorsa, 3 node 5 nodedan çoğunluğu oluşturur ve cluster  sahiplenir. Diğer iki node azınlıkta olduklarını anlar ve diğer 3 node un haberleşebildiğini varsayarlar. 
2003 Cluster ortamında yaşanılan bir split brain senaryosunda hangi node quorum diskinin sahibi ise uygulamalar onun üzerinde aktif olarak çalışmakta,   clientların erişip erişememesinin bir önemi bulunmamaktaydı. ,[object Object]

Más contenido relacionado

Más de Hakan Yüksel (6)

İş Sürekliliği
İş Sürekliliğiİş Sürekliliği
İş Sürekliliği
 
Failover Clustering Sql Server
Failover Clustering Sql ServerFailover Clustering Sql Server
Failover Clustering Sql Server
 
Bulut Bilişim El Kitabı
Bulut Bilişim El KitabıBulut Bilişim El Kitabı
Bulut Bilişim El Kitabı
 
Dell Tarzı / Dell Way
Dell Tarzı / Dell WayDell Tarzı / Dell Way
Dell Tarzı / Dell Way
 
Webcast - Failover Cluster Architecture
Webcast - Failover Cluster Architecture Webcast - Failover Cluster Architecture
Webcast - Failover Cluster Architecture
 
WebCast - Remote Desktop Services
WebCast - Remote Desktop ServicesWebCast - Remote Desktop Services
WebCast - Remote Desktop Services
 

Último

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Webcast - Failover Cluster Troubleshooting

  • 1. Failover Cluster Troubleshooting 10.08.2011 Hakan YÜKSEL hakan.yuksel@turkiyefinans.com.tr http://yukselis.wordpress.com
  • 2.
  • 6.
  • 7. Storage: You must use shared storage that is compatible with Windows Server 2008 R2
  • 8. Network adapters and cable (for network communication): The network hardware, like other components in the failover cluster solution, must be marked as "Certified for Windows Server 2008 R2." If you use iSCSI, your network adapters should be dedicated to either network communication or iSCSI, not both
  • 9. Account for administering the cluster: When you first create a cluster or add servers to it, you must be logged on to the domain with an account that has administrator rights and permissions on all servers in that cluster. The account does not need to be a Domain Admins account—it can be a Domain Users account that is in the Administrators group on each clustered server. In addition, if the account is not a Domain Admins account, the account (or the group that the account is a member of) must be delegated Create Computer Objects and Read All Properties permissions in the domain
  • 10. Standart Edition üzerindeki sunucular üzerinde cluster activate edilebilir
  • 12. Basic GPT and MBR disks supported
  • 13.
  • 14.
  • 16. Clusterı oluşturan servers vestorage ile ilgili tüm kontrolleri yapar
  • 17. Her değişiklik sonrası çalıştırılması gerekir
  • 18. Create a new cluster
  • 19. Add a node, disk, or network
  • 20.
  • 21. Change any component in your solution
  • 22. It’s the very first thing you do!http://technet.microsoft.com/en-us/library/cc732035(WS.10).aspx#BKMK_understanding_tests
  • 23.
  • 24. Windows Server 2008 ile yeni bir Quorum modeli mevcut (Node and Disk Majority), bu sefer Quorum diskin kullanımı biraz farklı oluyor: Quorumu node sayısı ile beraber bir oy hakkı olarak kullanıyoruz..
  • 25. Majority Node Set MNS demokratik bir sistemdir. Quorum da sadece bir oy var ise ve buna sahiplenen cluster a sahiplenebiliyorsa, MNS de çoğunluk clustera sahiplenir. Mesela 5 nodelu cluster da split brain senaryosu yaşanırsa her node toplam kaç node ila haberleşebildiğine bakar. Bir node iki node ile haberleşebiliyorsa, 3 node 5 nodedan çoğunluğu oluşturur ve cluster sahiplenir. Diğer iki node azınlıkta olduklarını anlar ve diğer 3 node un haberleşebildiğini varsayarlar. 
  • 26.
  • 28. Nodes (1 each), Disk Witness (1 max), File Share Witness (1 max)
  • 29. 4 Quorum TypesNode majority Node and File Share majority Disk only (not recommended) Node and Disk majority Vote Vote Vote Vote Vote
  • 30.
  • 31. No Majority: Disk Only is not recommended, because of the disk subsystem’s single point of failure
  • 32.
  • 33. File Share Witness içerisine de clusdb kopyalanmaktadır. When the computer is started, the Cluster Disk Driver (Clusdisk.sys) reads the following local registry key to obtain a list of the signatures of the shared disks under cluster management:HKEY_LOCAL_MACHINEYSTEMurrentControlSeterviceslusDiskarameters ignatures Recommandation private only hb public mix olmalı
  • 34.
  • 35. Preffered Owner listeside hangi node gideceğini karar verecek,
  • 36. Possible ownerda hangi node gidip gidemeyeceğine karar verecek.
  • 37. Tüm resourcesların aynı ownerlara sahip olması gerekmektedir.
  • 38. Affecti group resource fail olursa group failover yapsın.
  • 39. Diskteki efekti group seçili gelmekte.
  • 41.
  • 42.
  • 43. Disk üzerinde Turn On maintanence for this disk işaretlersek is alive ve looks alive işlemleri yapılmayacaktır yani diskin statusunu kontrol etmeyecek, diske erişim yapmayacak (içerisine dir çekme) cluster servisi devamli online oldugunu farzeder. The Resource Hosting Subsystem (RHS) conducts periodic health checks of all cluster resources to ensure they are functioning properly. This is accomplished by executing IsAlive and LooksAliveprocesses which are specific to the type of resource
  • 44. Failover Süreci  2 node birbirine ulaşamadiği durumda quarum diskine erişmeye çalışır bu duruma arbitration process denilir. Clusdisk.sys dosyası nodeların ikisininde disklere erişimin engellemek için yönetimi yapar. MNS mimarisi ile birlikte quarum bilgisi register replikasyonu ile sağlanmaktadır. Bu dosyalara %indowsystem32onfig altından erişilebilinir. Cluster açılması esnasında clusdb dosyasını registryden download edilerek cluster işletimi çalışmaya başlar. Bu konfigürasyon dosyasında hangi disklere erişebileceğinin bilgisi yer almaktadır..
  • 45. Cluster Komponentleri OBJECT MANAGER (clussvc.exe) (OM) Şu anki configurasyonu tutar HOST MANAGER (HM) Host ekleme çıkarma, node faile görme, modüller ile birlikte çalışıyor, cluster ayağa kalktı,kim cevap verirse 3343 üzeridnen onunla konuşuyor MEMBERSİP MANAGER (MM) Hklm clussvc altına lokalde yazar sonrada gider object managere ilertir OM bunu ram üzerine alır, Join oldu, evict oldu, MM bunu kayıt altına alır, bilgi paylaşımını sağlar GLOBAL UPDATE MANAGER (GUM) Bütün değişikilklerin replikasyonundan sorumludur Backup – VSS çalışıyor bilgisini diğer nodelar üzerine bildiri böylelikle diğer nodelar üzerinde değişklik yapmanın önüne geçer Tüm updatelerden sorumlu RESOURCE CONTROL MANAGER (RCM) Rsh.exe ile çalışır Dependencilerden bu sorumlu En baba modül :P TOPOLOGY MANAGER NETWORK MANAGER (nm) / INTERFACE MANGER (im) Nic up / fail DATABASE MANAGER Replikasyondan sorumlu Gup.mang. üzerinden yapıyor Logu tutan dm yapmaktadır Registry. Clusdb yüklenmektedir. QUORUM MANAGER Quorum oluştumu, oluşmadımı Hangi quorum modeli olmakta ona bakar Doğru replikeyi seçmekten o sorumlu RCM ile konuşabilir, quoarum oluşruramıyoruz rcm devreye sokup diyoruz ki nerede ise quorum oluşturacaz bize bir vote verebilir misin, 1 eksik miyiz. SECURİTY MANAGER Encryption, kerberos ilişkileri
  • 46. Microsoft Failover Cluster Virtual Adapter Microsoft Cluster ortamlarda “Microsoft Failover Cluster Virtual Adapter” adında bir interface oluşturur, hidden bir interface’dir NetFT (Network Faut Tolerant) dosyasını simüle eder, clusterlar arası iletişimi yürütür, heartbeat için bir redundancy sağlar. Bu interface mevcut interface üzerine bind olur smb’den SAN’e olan trafik bu kart üzerinde utilize edilir. NetFT, ipconfig /All üzerinden görülür kendisine APIPA adresi tahsis (169.254.1.2) eder, bu ip üzerinden aslında data transferi yapılmaz bu IP fiziksel kart üzerine bind olduğunda TM üzerinden utilizasyon görülmektedir.
  • 47. Failover Cluster Kurulum Adımları Failover Cluster Prerequisites Establish a Network Naming Convention TCP/IP Network Configuration Public Network Storage Network Heartbeat Network Procedures Prepare the Failover Cluster Create a Domain User Account Add Nodes to an Active Directory Domain Expose Storage to Cluster Nodes Install the Failover Cluster Feature Run Cluster Validation Create and Configure the Failover Cluster Create a Cluster Set Cluster Network Properties and Apply Naming Convention Create a Highly Available Services -> Create a Highly Available iSCSI Target Configuring Windows Firewall for Microsoft iSCSI Software Target Installing the Microsoft iSCSI Software Target Create the Failover iSCSI Target Resource Group Create an iSCSI Target in the Microsoft iSCSI Target MMC Create and Configure Virtual Disks Connect Initiators Testing Your Failover Cluster Configuration Server Core Installation Option of Windows Server 2008 Step-by-Step Guide: http://technet2.microsoft.com/windowsserver2008/en/library/47a23a74-e13c-46de-8d30-ad0afb1eaffc1033.mspx?mfr=true
  • 48. Troubleshooting Reviewing cluster events Reviewing hardware events Using the Validate a Configuration Wizard Reviewing storage/SAN events Troubleshooting methodologies for cluster issues, whether in Windows 2003 or Windows 2008, are fairly similar. Most of the typical support issues in the cluster category fall under the following categories: · Cluster Service fails to start. · Cluster resources in a failed state or fail to come online. · Determine root cause of cluster failure. · Initial configuration of the cluster The Win 2003 legacy CLUSTER.LOG text file no longer exists. In Win 2008 the cluster log is handled by the Windows Event Tracing (ETW) process. This is the same logging infrastructure that handles events for other aspects you are already well familiar with, such as the System or Application Event logs you view in Event Viewer. Command Line c:gt;cluster log /gen Powershell C:S> Get-ClusterLog ForceQuorum net start clussvc /forcequorum (or /fq)
  • 49.
  • 50. Cluster Eventları Cluster Events Recent Cluster Events üzerinde son 24 saate ait eventlar görünmektedir. Monitoring Cluster Events Fully featured Failover Cluster Management Packs Cluster logging level Set-ClusterLog –level 3
  • 51. Configuring Debug Logging Logging enabled by default Log files stored as .ETL in: %WinDir%ystem32inevtogsicrosoft-Windows-FailoverClustering Default log size is 100 MB Set-Clusterlog –Size 100 Default log level is 3 Set-Clusterlog –Level 3 Up to three log files This means log history can be kept for up to three reboots The number of logs can be modified via the registry: HKLMoftwareicrosoftindowsurrentVersionINEVThannelsicrosoft-Windows-FailoverClustering/DiagnosticileMax Default Can have performance impact
  • 52. Genişletilmiş PowerShell Konutları http://blogs.technet.com/b/josebda/archive/2010/09/19/mapping-cluster-exe-commands-to-windows-powershell-cmdlets-for-failover-clusters-extended-edition.aspx
  • 53. Cluster Nodlara bağlanmada yaşanan problemler ‘Create Cluster Wizard’, ‘Validate a Configuration Wizard’, and ‘Add Node Wizard’, so any of the following messages and warnings we list could be due to WMI issues: · "RPC Server Unavailable" error. ·         Access is Denied. ·         The computer ‘Node1’ could not be reached. ·         Failed to retrieve the maximum number of nodes for ‘{0}’. ·         The computer ‘Node1.contoso.com’ does not have the Failover Clustering feature installed.  Use Server Manager to install the feature on this computer. o   Note: first confirm you have installed the Failover Clustering feature on this node Troubleshooting Steps 1) Ensure it is not a DNS Issue 2) Check your that WMI is Running on the Node (wbemtest) 3) Check your Firewall Settings 4) Reboot the Node 5) Rebuild a Corrupt WMI Repository ·         In the Services console, manually stop the WMI service to ensure that dependent services are stopped ·         Start WMI service again ·         Launch and elevated CMD or PowerShell ·         CMD/PS > winmgmt /salvagerepository 6) Patch WMI for Performance Improvements (974930)
  • 54.
  • 56. The temp folder for the Cluster Service account. For example, exclude the lusterserviceaccountocal Settingsemp folder from virus scanning. w2k3http://support.microsoft.com/kb/250355#appliesto
  • 57. Cluster Log Error Anlamları status 170 - Which means "The requested resource is in use." This could be related to Persistent Reservation problems, it can also be MPIO, fibre/HBA drivers and/or some type of lower level file system driver or software such as anti-virus, quota management, open file agent for backup software, etc, etc,: 00000c94.000008d4::<date and time>.585 INFO Physical Disk <Disk Q:>: [DiskArb] Issuing Reserve on signature 33af636f. 00000c94.000008d4::<date and time>.616 ERR Physical Disk <Disk Q:>: [DiskArb] Reserve completed, status 170. 00000c94.000008d4::<date and time>.616 INFO Physical Disk <Disk Q:>: [DiskArb] Arbitrate returned status 170. status 5 - Is usually a permissions related problem, in this case it was a problem with either Cluster Service Account (CSA) username/password were not synchronized between the nodes. This can also happen if the cluster looses it's Secure Channel connection to the DC in order for the CSA to get authenticated. Another situation in which this can occur, is when one of the domain Group Policy Objects (GPO) or one of the Local Policy Objects is missing a User Rights Assignment needed for the CSA to funtion properly. 000014a0.00001460::::<date and time>.629 WARN [JOIN] JoinVersion data for sponsor <Cluster Name> is invalid, status 5.000014a0.000017d0::::<date and time>.629 WARN [JOIN] Unable to get join version data from sponsor 10.7.47.100 using NTLM package, status 5. status 1117 - Which means an ERROR_IO_DEVICE (The request could not be performed because of an I/O device error) when Event ID 1123 occurs 000015a0.000014a8::<date and time>.511 WARN IP Address <IP Address resource name>: IP Interface 4 (address 10.101.160.65) failed LooksAlive check, status 1117, address 0x10119e0, instance 0xf74d6fb8.
  • 58.

Notas del editor

  1. 1 dk
  2. What is a quorum? To put it simply, a quorum is the cluster’s configuration database. The database resides in a file named \\MSCS\\quolog.log. The quorum is sometimes also referred to as the quorum log.it tells the cluster which node should be active