3. Sun Confidential: Internal Only 3
Day 2 PM and Acknowledgements
• I have borrowed/stolen/copied* the following in this
presentation.
• Newisys decoder from Barry Wright
• HDT from Bernward Schwartz
• SGR from
http://panacea/twiki/bin/view/SGR/WebHome
4. Sun Confidential: Internal Only 4
SP Diags for V20/40z
• Not to be confused with “spdiag” Tool
• Bootable CD (nsv 2.2.0.6 or above required) or SP based
• Enable diagnostic boot in BIOS for bootable CD
• NSV installed on remote system and mounted
locally by NFS
.
5. Sun Confidential: Internal Only 5
SP Diags for V20/40z
● Install diags:
cp -r /mnt/cdrom/nsv_file /mnt/nsv/
cd /mnt/nsv/
unzip -a *.zip
chmod 777 /mnt/nsv/diags/NSV_version_number/scripts
chmod -R 755 /mnt/nsv/diags/NSV_version_number/mppc
Note:Now ensure nfs is enabled on server and can export file system
sp add mount -r NFS_server_hostname:/directory_with_NSV_files -l /mnt
sp update diags -p /mnt/diags/DIAGS_version#
6. Sun Confidential: Internal Only 6
SP Diags for V20/40z
● diags start (for standalone)
● diags start -n (on-line nic,disk,mem)
● diags get state (confirm diags are loaded)
● diags get tests (list diagnostics tests)
● diags run tests -av
● diags run tests -av >/mnt/log/diags.log
● diags terminate
Ensure diags and BIOS, drivers are compatible
Diags will fail to run otherwise
7. Sun Confidential: Internal Only 7
SP Diags for V20/40z
• diags -h this will show all syntax
• diags -a -v full test
• Bootable CD
> diags terminate -n
> diags start -n
> diags run tests -a -v >diags.out &
> tail -f diags.out
8. Sun Confidential: Internal Only 8
The “spdiag” Tool (Galaxy)
• SP based diagnostic
• Test i2c , voltage , fans , temp
• Stop ipmi /etc/init.d/ipmistack stop
• /usr/local/bin/spdiag 1 g4 i2ctst
• Reboot SP
9. Sun Confidential: Internal Only 9
PC Check
• Supplemental/Tools CD and now boot menu
• AMD based X2100,X2100M2,X2200M2 and all new
X4x40 platforms
• All Intel based platforms
• Monitor and keyboard
• Serial port
• Scripts , burn-in tests , loopback
10. Sun Confidential: Internal Only 10
PC Check
• Front Menu:
System Information menu
Advanced Diagnostics Tests
Immediate Burn-in Testing
Deferred Burn-in Testing
Create Diagnostic Partition
Show Results Summary
Print Results Report
11. Sun Confidential: Internal Only 11
PC Check
• Burn-in Testing:
> quick.tst - requires user input, no time-out
> noinput.tst – no user input, good first test
> full.tst – requires loopback & user input
• Command Line:
> Example pccheck cpu.tst /BD
> pccheck /? - shows all flags
> pccheck suncsi.tst /IS /BD /KS /MH 30 /HMD 1m /HDD 1m
/SD 5m
12. Sun Confidential: Internal Only 12
SUNvts
• What are you trying to test/replicate ?
• Local or bootable CD-ROM
• Galaxy 2.2 cd contains vts6.3
• GUI or command line
• Unsupported platforms:
> /opt/SUNWvts/lib/conf/platform.conf
> smbios | grep Product
> Boot with graphics head
> Edit tty boot console=ttya,ttya-mode=”9600,8,n,1,-”
14. Sun Confidential: Internal Only 14
HDT (Hardware Debug Tool)
• PLEASE USE WITH CAUTION !!!
• Will hang the host if OS running
• Reboot SP after use
http://panacea/twiki/bin/view/Products/Galaxydiag
• Additional tools:
• /usr/local/bin/collectHostStatus.sh
-nohdtl disables hdt test
• /usr/local/bin/collectDebugInfo.sh
15. Sun Confidential: Internal Only 15
Platform Specifics
• On G4:
> HDT uses some signals over the i2c bus => IPMI on the SP has
to be shut down. SP should be rebooted when done with hdt
diags.
> JTAG chain goes through all CPU modules => All slots must have
CPU or filler module inserted for HDT to work on G4
> Direct access to all CPU's, default is cpu 0
• Other Platforms:
> Only CPU0 in JTAG chain
> no i2c involved, only used for platform identification
From Bernward Schwarte presentation.
16. Sun Confidential: Internal Only 16
Getting Started & Cautions
• hdt or hdtl?
> Current hdt binary and some documentation at:
http://nsgtwiki.sfbay.sun.com/twiki/bin/view/Galaxy/SpBasedHdtDiag under Galaxy->Pre-
OS-Diagnostics
> Copy to SP: scp hdt sunservice@<SPIP>:/coredump
> ssh sunservice@<SPIP> password: changeme
> cd /coredump (or check /usr/local/bin for the built in copy)
• Caution: All hdt commands stop CPU's. Some hdt commands will
reset/power-cycle system.
• All command line parameters are interpreted as hex values !
• ./hdt prints syntax of all available commands
• ./hdt –pd 0 18 0
• hdt leaves CPU in HDT-mode when exiting, use “-e” option to exit HDT-
mode
17. Sun Confidential: Internal Only 17
Available Commands/Diagnostics
• Basics:
> Single HDT command: -h * Note: this is not -help
> Access io- and memory space: -mr, -mw, -ir, -iw
> Access CPU registers: -rd, -rr, -rw
> Single step: -hs
• Control:
> Reset : -xr [b c]
> Stop at reset: -xs [b c p]
> resource init: “-hi” : sets up HT routing and resources
> Power On/Off : -o [0 off, 1 cycle, 2 on]
> set CPU: -c G4 only
> Breakpoints: -bps -bpm -bpc
> exit: -e
18. Sun Confidential: Internal Only 18
Diagnostics
• Extended:
> PCI configuration space access -pr, -pw -pd, -ps
> “Dump” commands
– Machine check: -dm
– DIMM SPD: -dd
– CMOS: -dc
– SIO: -ds
– Flash: -df
> HT link testing: -a
– Powercycles, stops at reset vector, sets all HT links, warm
resets
19. Sun Confidential: Internal Only 19
HDT Not Working
> Depending on System state HDT can be non-functional
> To capture some system/error state:
– Reset system and stop at reset vector: hdt -xs b
– Init HT routing and PCI bridge enumeration hdt -hi
– Dump Machine check and HT link status: hdt -dm -dl
hdtDiag: Galaxy/Thumper HDT Diagnostics, Version 0.7.0
-------------------------------------------------------
hdtDiag: Error, HDT command failed, no CFF cpu 0
hdtDiag: SysIdent: HDT access failing
hdtDiag: defaulting to G12X
20. Sun Confidential: Internal Only 20
HDT
• Check Versions 0.8.0 , 0.8.3 , 0.9.9, 1.3, 1.4.1 etc
• ./hdt -xs
• ./hdt -hi
• ./hdt -l -q or try ./hdt -l -a
• ./hdt -e
• Reboot SP
21. Sun Confidential: Internal Only 21
CSTH (Continuous System Telemetry Harness)
• Calls ipmitool to create a telemetry stream of:
> volt,temp,current,fans and PSU variables
● Collect data and submit for analysis to engineering:
● ./start-csth-ipmi <spname> <splogin> <sppasswd> [--interval <numsecs>]
● Example:
➢ ./start-csth-ipmi test-sp admin test.pass 60 &
➢ ./stop-csth-ipmi test-sp
23. Sun Confidential: Internal Only 23
HERD (Hardware Error Report Decode)
http://nsgtwiki.sfbay.sun.com/twiki/bin/view/Galaxy/HERD
•Hardware error report and decoding from mcelog or
via the command line with kernel 2.6.4 or above
•Installed as RPM on top of SLES and Red Hat
•Be provide by Sun Microsystems
•Will report errors to messages file and service
processor (if applicable)
•Same command line options as mcelog
•Must be run on the same host as the machine that
reported the errors when using the herd -e function.
24. Sun Confidential: Internal Only 24
HERD (Hardware Error Report Decode)
•Example from console / logs:
•Example of running herd manually (pre herd install):
Mar 5 18:03:01 va64-x2200c-gmp03 herd: HARDWARE ERROR. This is *NOT* a software problem!
Mar 5 18:03:01 va64-x2200c-gmp03 herd: Please contact your hardware vendor
Mar 5 18:03:01 va64-x2200c-gmp03 herd: CPU 0 4 northbridge
Mar 5 18:03:01 va64-x2200c-gmp03 herd: TSC fcc73b11cf
Mar 5 18:03:01 va64-x2200c-gmp03 herd: ADDR 142110
Mar 5 18:03:01 va64-x2200c-gmp03 herd: Northbridge Chipkill ECC error
Mar 5 18:03:01 va64-x2200c-gmp03 herd: Chipkill ECC syndrome = 11ea
Mar 5 18:03:01 va64-x2200c-gmp03 herd: bit46 = corrected ecc error
Mar 5 18:03:01 va64-x2200c-gmp03 herd: bit57 = processor context corrupt
Mar 5 18:03:01 va64-x2200c-gmp03 herd: bit61 = error uncorrected
Mar 5 18:03:01 va64-x2200c-gmp03 herd: bus error 'local node response, request didn't time out
generic read mem transaction memory access, level generic'
Mar 5 18:03:01 va64-x2200c-gmp03 herd: STATUS b675410011080a13 MCGSTATUS 0
# herd -e 142110
000000142110: Cpu Node 0, DIMM 2
25. Sun Confidential: Internal Only 25
EDAC (Kernel 2.6.20.xx and above)
2 examples of edac not working & working (x2200):
Feb 25 06:50:57 va64-x2200c kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout)
memory transaction type(generic read), mem or i/o(mem access), cache level(generic) << Multiple CE in quick succession or DIMM layout
Feb 25 06:50:57 va64-x2200c kernel: EDAC k8 MC0: Failed to translate InputAddr to csrow for address 0xbb2c2fc0
Feb 25 06:50:57 va64-x2200c kernel: MC0: CE - no information available: k8_edac
Feb 25 06:50:57 va64-x2200c kernel: MC0: CE - no information available: k8_edac Error Overflow set
Feb 25 06:50:57 va64-x2200c kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 error ^^ Failed to translate due to overflow bit set
This happens if more than one error has occurred before edac gets to it or if edac does not understand the DIMM layout.
Here is the correct format of edac's output:
Mar 4 10:43:42 va64-x2200c kernel: EDAC k8 MC0: general bus error: partic ipating processor(local node origin), time-out(no
timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) << Always CPU 0 (reporting error)
Mar 4 10:43:42 va64-x2200c kernel: MC0: CE page 0x100010, offset 0x10, grain 8, syndrome 0xa1e8, row 0, channel 0, label "": k8_edac
^^ This event tells you the actual offending CPU which in this instance is CPU 0. (label not used by default but Sun may/customer may populate)
Mar 4 10:43:42 va64-x2200c kernel: EDAC k8 MC0: extended error code: ECC error <<Decode below:
MC0: CE error page 0x100010 adding offset of 0x10 = Address (0x10100010) Grain = 8 which is Chipkill Row 0, Channel 0 = CPU0,DIMM0
Channel 0Channel 1 Channel 0Channel 1
=================================== ===================================
Row> csrow0 | DIMM_A0| DIMM_B0 | csrow2 | DIMM_A1| DIMM_B1 |
csrow1 | DIMM_A0| DIMM_B0 | csrow3 | DIMM_A1| DIMM_B1 |
=================================== ===================================
If single rank DIMMs (1GB or less) then csrow1 and csrow3 are not used/available.
26. Sun Confidential: Internal Only 26
EDAC Continued:
•Example output from the SP (not created by edac):
1 | 02/23/2008 | 02:13:08 | Memory CPU0 DIMM0 | Correctable ECC | Asserted
>>> Edac log should be here but does not show - Instead, you just see the BIOS scrubber results <<<
2 | 02/25/2008 | 16:27:55 | Memory CPU0 DIMM0 | Correctable ECC | Asserted
3 | 02/25/2008 | 17:27:58 | Memory CPU0 DIMM0 | Correctable ECC | Asserted
4 | 02/25/2008 | 18:28:00 | Memory CPU0 DIMM0 | Correctable ECC | Asserted
5 | 02/25/2008 | 19:28:02 | Memory CPU0 DIMM0 | Correctable ECC | Asserted
6 | 02/25/2008 | 20:28:04 | Memory CPU0 DIMM0 | Correctable ECC | Asserted
7 | 02/25/2008 | 21:28:06 | Memory CPU0 DIMM0 | Correctable ECC | Asserted
8 | 02/25/2008 | 22:28:08 | Memory CPU0 DIMM0 | Correctable ECC | Asserted
9 | 02/25/2008 | 23:28:10 | Memory CPU0 DIMM0 | Correctable ECC | Asserted ... and so on ...
Do a cat of /proc/mc/0 to give you an understanding of the events occurred in a row/column summary
It's edac or herd, not both!!! They both try to grab /dev/mce events and report. (rmmod k8_edac to remove)
And remember, the SEL log is your friend so always get an ipmi dump first before escalating or decoding.
27. Sun Confidential: Internal Only 27
The “mcelog”
• Linux kernels after 2.6.4 do not print recoverable
machine check errors
• Messages are saved in /var/log/mcelog
• Mcelog read errors from /dev/mcelog and then deletes
entries
• Typically run as a cron jog:
> /usr/sbin/mcelg >> /var/log/mce
> *Note this is not collected by sysreport
• RedHat implemented as a daemon
• See RedHat advisory RHEA-2006-0134-7
28. Sun Confidential: Internal Only 28
MCAT (Machine Check Analysis Tool)
Event Source 62 - WMIxWDM
Processor Number : 0
Bank Number : 4
Time Stamp (0x): 01C856C4 58A8C10D
Error Status (0x): D4714000 E1080A13
Error Address (0x): 00000000 A047BF50
Error Misc. (0x): 00000000 00000000
Single bit errors:
Correctable ECC error
Error address valid in MCi_ADDR
Error reporting enabled
Second error
Error valid Cont: >>
Bus Error Code:
Participation processor: Local node responded to
the request (RES)
Time-out: Request did not time out
Memory transaction type: Generic read (RD)
I/O: DRAM memory access (MEM)
Cache level: Generic (LG)
North Bridge Error MC4:
Extended Error Code: 0x8 - ChipKill ECC Error
Error Code: 0x0A13
DRAM memory access (MEM) Generic read (RD),
on Generic (LG) cache
ChipKill Syndrome: 0xE1E2
Error address at 2564 MB
Takes input from a Windows Event Log entry and decodes the output:
29. Sun Confidential: Internal Only 29
MCAT Continued
• This can be gathered by running ipmitool fru:
FRU Device Description : p0.fru (ID 6)
Product Manufacturer : ADVANCED MICRO DEVICES
Product Name : DUAL CORE AMD OPTERON(TM) 275
Product Part Number : 0F21
Product Version : 02
FRU Device Description : p0.d0.fru (ID 8)
Product Manufacturer : MICRON TECHNOLOGY
Product Name : 1024MB DDR 400 (PC3200) ECC
Product Part Number : 18VDDF12872G-40BD3
Product Version : 0300
Product Serial : D7010058
Continued: >>>
FRU Device Description : p0.d1.fru (ID 9)
Product Manufacturer : MICRON TECHNOLOGY
Product Name : 1024MB DDR 400 (PC3200) ECC
Product Part Number : 18VDDF12872G-40BD3
Product Version : 0300
Product Serial : D7010056
FRU Device Description : p0.d2.fru (ID 10)
Product Manufacturer : MICRON TECHNOLOGY
Product Name : 1024MB DDR 400 (PC3200) ECC
Product Part Number : 18VDDF12872G-40BD3
Product Version : 0300
Product Serial : D701A6F4
FRU Device Description : p0.d3.fru (ID 11)
Product Manufacturer : MICRON TECHNOLOGY
Product Name : 1024MB DDR 400 (PC3200) ECC
Product Part Number : 18VDDF12872G-40BD3
Product Version : 0300
Product Serial : D701A6EE
FRU Output for this failing platform:
30. Sun Confidential: Internal Only 30
Manual Diagnosis
Processor Number :0 - CPU 0 (If said 4 then it would be socket CPU4, not core 4).
Error address at 2564 MB (i.e. between 2 and 3 GBytes).
From the FRU information, each DIMM is 1 Gbyte.
The DIMMs are numbered for closest to CPU outwards based on mapping.
(DIMMs should be populated from outside inward but are mapped closest to CPU outwards).
The BIOS sets up memory from DIMM0/1 outwards.
Assuming "optimal defaults":
Our Opterons use a 128-bit wide data path. DIMM0 and DIMM1 are used in a pair.
These are single-rank DIMMs but they are all the same so is "chipselect interleaving".
The first 128KB are on DIMM0 and 1. The second 128KB are on DIMM2 and 3.
2564/128 = 20.03 ----> which is in DIMM0 and DIMM1 pair.
(Always replace Opteron platform DIMMs in pairs).
Windows reporting decode is performed as follows:
31. Sun Confidential: Internal Only 31
Manual Diagnosis
ChipKill Syndrome: 0xE1E2
Looking this up in the table 26 of the AMD BIOS And Kernel Writer's Guide shows this is symbol 0x1a
which according to the text above 26, this symbol maps to the upper 64-bits of the 128-bit data path.
DIMM0 from 00h-0fh provides the low 64-bits, DIMM1 from 10h-1fh provides the high 64-bits.
The check bits for the lower 64-bits is 20h-21h and the check bits for the upper 64-bits is 22h-23h
Technical documentation including the AMD BIOS and Kernel Writers Guide is available from AMD via:
http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_9003,00.html
Remember though to download the correct document for your processor revision:
SingleDual core Opteron for x2100, x2200, x4100, x4200, x4500, x4600 etc is document family 0fh.
Quad code Opteron for supported platforms is document family 10h.
Manual diagnosis continued:
34. Sun Confidential: Internal Only 34
Manual Diagnosis
ECC Syndrome Table (for completion) for 0Fh CPUs
(Single Error Correction, Double Error Detection):
n=0 n=1 n=2 n=3 n=4 n=5 n=6 n=7
Bit (0+n) ce cb d3 d5 d6 d9 da dc
Bit (8+n) 23 25 26 29 2a 2c 31 34
Bit (16+n) 0e 0b 13 15 16 19 1a 1c
Bit (24+n) e3 e5 e6 e9 ea ec f1 f4
Bit (32+n) 4f 4a 52 54 57 58 5b 5d
Bit (40+n) a2 a4 a7 a8 ab ad b0 b5
Bit (48+n) 8f 8a 92 94 97 98 9b 9d
Bit (56+n) 62 64 67 68 6b 6d 70 75
Bit (64+n) 01 02 04 08 10 20 40 80
*Typically used for single DIMM configurations
35. Sun Confidential: Internal Only 35
Other Tools/Diags (Un-supported)
• Bonnie
> Benchmark to measure performance of filesystem
http://www.textuality.com/bonnie/
• Memtest86+
> Standalone bootable diagnostic
http://www.memtest.org/
http://www.memtest86.com/ Original version
• Other memory tool
http://people.redhat.com/dledford/memtest.html
http://sourceforge.net/
• Netperf or ttcp - google for them - network tools
36. Sun Confidential: Internal Only 36
SGR
• Situation appraisal – Recognise a problem
• Problem Analysis - Find True Cause
http://systems-tsc/twiki/pub/SGR/SgrtOnlineHelp/PA-guide.pdf
The Steps in FTC are:
* Define a Problem Statement
* Describe the problem with a Problem Specification
* Develop Possible Causes from either Experience or Differences and
Changes
* Identify the Most Probable Cause
* Test the Most Probable Cause against the Problem Specification
* Verify the Most Probable Cause
37. Sun Confidential: Internal Only 37
Newisys MCE Decoder v20/40z
What to gather from inventory get all -v
1. How many CPU's?
2. How many Dimms per CPU?
3. What is the part number of the Dimm?
NOTE:This is for V20/40z ONLY and only works on
Northbridge Errors
38. Sun Confidential: Internal Only 38
Details from CPU0 explained
●
•
Here you see 4 identical Dimms on CPU0.
•
The Dimm Manufacture part # is: 36VDDF25672G-40BD2
●
●
●
●
Name Type OEM Manufacture Date Hardware Revision Part #
●CPU 0 DIMM 0 memory 2cffffffffffffff 2005-04-16 0200 36VDDF25672G-40BD2
●
CPU 0 DIMM 1 memory 2cffffffffffffff 2005-04-16 0200 36VDDF25672G-40BD2
●CPU 0 DIMM 2 memory 2cffffffffffffff 2005-03-19 0200 36VDDF25672G-40BD2
●CPU 0 DIMM 3 memory 2cffffffffffffff 2005-03-19 0200 36VDDF25672G-40BD2
●DDR 0 VRM memvrm S-SCI448 2005-05-27 A01 S01479
•
CPU 0 VRM vrm NA
39. Sun Confidential: Internal Only 39
Determine Type & Rank of Dimm
** Dimms can be single rank or dual rank. For a description of the differences see:
http://pts-platform.uk/twiki/bin/view/Products/ProdFAQv2040z or
http://en.wikipedia.org/wiki/DIMM#Ranking
Browse to the Qualified Memory page:
http://nsgtwiki.sfbay.sun.com/twiki/bin/view/Stinger/StingerQualifiedMemory
Compare your DIMM Manufacture part number to the list: 36VDDF25672G-40BD2
This equates to a 2GB Micron Dual Rank DIMM:
Micron:
512MB: MT18VDDF6472G-40BG3 Die: G Single Rank SPD 1.0
1GB: MT18VDDF12872G-40BD3 Die: D Single Rank SPD 1.0
2GB: MT36VDDF25672G-40BD2 Die: D Dual Rank SPD 1.0
Now we are ready to populate the Memory Decode Tool
40. Sun Confidential: Internal Only 40
Warning! Decode Tool is Sun Internal
https://supportcenter.newisys.com/edbug/edbug_int.pl?auth=dfqsdftqw11p4jdvhasovygm82cbcfrk
This link cannot be shared with customers.
It is internal for Sun use only.
The link has the account and password in it.
42. Sun Confidential: Internal Only 42
Information for Decode Tool
Enter the CPU that has the machine check: (From the Error)
0, 1, 2, or 3
Enter the platform type:
2100 = V20z
4300 = V40z
Enter the machine check status: (From the Error)
Enter the machine check address: (From the Error)
Specify which CPUs have DIMMs: (From inventory ger all -v)
Specify which DIMMs are populated on each CPU: (From Inventory get all -v)
Specify the DIMM type: (Rank from Qualified Memory Page)
BIOS defaults: Leave this at the default (Place a √ in DIMM interleaving, 128 bit
DIMM interface, and Chipkill ECC enabled. No √ in Node interleaving)
43. Sun Confidential: Internal Only 43
Result Output
Only one error is present
Error details:
K8_CPU-0 is reporting this corrected error:
DRAM chipkill ECC error found by scrubber
The DRAM error was at address '00000000 9C6A0B30' (2 GB range)
This error is related to DIMM 1 on K8_CPU-0
The ECC syndrome ('5E34'x) maps to a correctable error at data bit 66
Within the DIMM, this would be an error at physical bit 2
Processor was responding to another source of the transaction
Transaction was a read
Error classification:
Error type: DRAM ECC
Error severity: Corrected
Error enabled: yes
Error recovered: yes
Possible sympathy: no
Error address: '000000009C6A0B30'x
Address type: Physical
44. Sun Confidential: Internal Only 44
Anything Else........
• Newisys Machine Check (northbridge only)
• V20/40z only
• http://systems-tsc/twiki/pub/Products/ProdTroubleshootingV20z/V20z-V40z-Memory-DIMM
• Windows Debugging
> http://www.microsoft.com/whdc/devtools/debugging/default.mspx
• MCAT
• http://www.amd.com/us-
en/assets/content_type/utilities/mcatsetup.exe
Machine Check Analysis Tool (MCAT) is a command line utility
that takes Windows System Event Log (.evt) file as an argument
and decodes the MCA Error logs into human readable format.
MCAT can alternatively take in MCE Error information as raw
register hexadecimal values as command line argument as well.