SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
presented by
Server RAS and UEFI CPER
Spring 2017 UEFI Seminar and Plugfest
March 27 - 31, 2017
Presented by
Lucia, Mao (Intel)
Spike Yuan (Intel)
UEFI Plugfest – March 2017 www.uefi.org 1
Updated 2011-06-01
Agenda
• RAS Basic
• Server RAS Challenges
• SW Building Blocks
• Possible Solutions
• Call to Action
UEFI Plugfest – March 2017 www.uefi.org 2
RAS Basic
UEFI Plugfest – March 2017 www.uefi.org 3
• Reliability:
– System capability to detect errors, correct errors, and flag errors.
– Measured in FITs (Failure in Time).
– 1FIT = 1 Failure / 1 Billion hours (MTBF – Mean Time Between Failures
= 114K Years!)
• Availability
– System capability to stay operational even when error occur.
– Measured in terms of ‘down time within a time interval’
– Five 9’s (99.999% Up Time) => 22 seconds Down Time in One Month
• Serviceability
– System capability to report failures for “FRU Isolation” and ease of
repair. FRU – Field Replaceable Unit, e.g., DIMM, PCI Express* device
RAS101: Key Definitions
Service Failure
Sources of Fault
Fault, Error, and Failure
Operator
Mistake
Unstable
Environment
Marginal
Hardware
Incorrect
Design
Physical
Defect
Error
Service Failure:
When the delivered service deviates from the
specified service
Unobservable
State
Observable
State
Detected
Fatal
SN Error Source Definition Example
1 Transient Error Electrical Noise induced faults mainly
affecting links such as DDR Bus, or PCI
Express links.
Transient errors on links may alter the data,
Command or Address bits during read/write
operation. Reads won’t alter the DRAM
stored value, but ‘Writes’ may alter.
2 Soft Error Errors due to external high energy
particle strike, e.g., Alpha particles,
Neutrons. Soft errors could occur is any
known good system
Result in affecting storage structures such as
DRAM cell (SBE or MBE), L1/L2/L3 caches.
3 Hard (Device)
Failure
Device failure due to marginality of the
device or degradation over time.
Failure of entire device such as DRAM,
memory buffer chip, or CPU chip
Sources of Fault/Error
• SBE: Single-bit Error. MBE: Multi-bit Error
An Example - Intel® Xeon®
Processor Fault Classification
MCA: Machine Check Architecture
DUE: Detectable but Uncorrected Error
UCR: Uncorrected Recoverable
Faults
Detected
(e.g., MCA)
Corrected Uncorrected
Catastrophic
(DUE)
Fatal
(DUE)
Recoverable
(UCR)
Uncorrected
No Action
(UCNA)
SW Recoverable
Action Optional
(SRAO)
SW Recoverable
Action Required
(SRAR)
Undetected
Benign Critical
Server RAS Challenges
UEFI Plugfest – March 2017 www.uefi.org 8
Apps
OS
FW
HW
Apps/Service
Reconfigured
Fault Correction
HW/FW/SW
Fault DetectionFault Avoidance
Life of a Fault – Pillars of RAS
AppsRestored
Fault
(HW)
Fault
(SW)
Detection
Correct in
HW
Apps
Failure
Apps
Degradation
Handle in
OS
Correct in
FW
Avoidance
System Reliability Serviceability
System Availability
Fault (SW) Examples:
1. Programming bug
2. Configuration Error
3. Operator Error
Fault (HW) Examples:
1. Marginal Design
2. Unstable environment
3. High energy particle strike
4. Component failure or degradation
Fault Avoidance
Hidden: Apps, SW, FW, HW
Visible: None
Feature Example:
1. Micro-circuit level capabilities
Fault Detection in HW
Hidden: Apps, SW, FW
Visible: HW
Feature Examples:
1. CRC Check
2. Parity Detection
3. Patrol Scrub
Fault Correction in HW
Hidden: Apps, SW, FW
Visible: HW
Examples:
1. Cache and Buffer ECC
2. Memory SDDC, Mirroring
3. Links CRC Retry
Fault Correction in FW
Hidden: Apps, SW
Visible: HW, FW
Feature Example:
1. Memory Sparing
Fault Handling in SW Hidden: Apps
Visible: HW, FW, SW
Feature Examples:
1. PCI Express* Link Retry using AER (Advance
Error Reporting)
2. CMCI (Corrected Machine Check Interrupt)
based Predictive Failure Analysis
System Availability Delivered Through the Stack (HW, FW, SW)
RAS Enabling Framework
Fault Handling = RAS (Four Pillars)
Fault Tolerance
1. Avoidance
2. Detection
3. Correction
Fault Management
4. Reconfiguration
E.g., Failed DIMM Isolation
Diagnosability/Serviceability/Manageability
Minimizing Downtime
Error Logging
FW-based Fault Management
OS-based Fault Management
Silicon
FW (Silicon RefCode)
FW (OEM/IBV/BMC)
OS/VMM
Application
System Stack
Error Signaling/Polling
RAS Enabling Requires HW, FW/BIOS, OS/SW
E.g., Memory ECC
System Reliability
Extending the Uptime
Fault Avoidance, Detection, and
Correction in HW
Fault Correction Through FW
Fault Correction Through OS
Fault Correction at App Layer
Which of these is Cloud Cluster and HPC Cluster?
RAS Needs of Cloud and
HPC Clusters
Cloud Infrastructure HPC Infrastructure
Need Fault Handling Need Fault Handling
Check pointing is not used Check-pointing is actively used
Applications can tolerate single machine failure Applications can not tolerant single machine
failure
Fault Management Capabilities for improving
TCO
Fault Tolerant Capabilities for extending uptime
Example: Automated techniques for identify
failed component
Example: HW/FW based self-healing
techniques
RAS Needs of Mission Critical Segment
Prevent/minimize unplanned downtime
Source: Trend in IT Value Report, Standish Group International, 2008
System UP TimeCorrectable
Fault
Fatal
Unplanned System DOWN
Undesirable
Intel® Xeon® RAS features directly impact end-user’s bottom line!
What would an outage cost?
FW/SW Building Blocks
UEFI Plugfest – March 2017 www.uefi.org 13
Error Reporting Basics
• Error Reporting includes two functions:
– Logging
– Signaling
• Logging
– Through MCA Banks, PCIe AER Registers, and Memory Corrected Error Registers
• eMCA2 Mode – Enhanced Error reporting to support Firmware-First mode;
• Signaling of Corrected Errors
– CMCI (Corrected Machine Check Interrupt)
• Threshold based
• Enabled only in IA32-legacy MCA mode. Disabled in eMCA2 mode.
– CSMI (Corrected SMI) for core/uncore (Part of the eMCA2 new Feature)
• Enabled only in eMCA2 mode. Disabled in IA32-legacy MCA mode.
• No Threshold
– SMI (System Management Interrupt) for Memory errors
– MSI (Message Signaled Interrupt) or external signaling for PCI Express* errors
Error Reporting Basics
(Continued)
• Signaling of Uncorrected Recoverable Errors (e.g., UCNA)
– CMCI for core/uncore errors at the source
• Enabled only in IA32-legacy MCA mode. Disabled in eMCA2 mode.
– MSMI (Machine Check SMI) for core/uncore errors at the source (Part of the eMCA2)
• Enabled only in eMCA2 mode. Disabled in IA32-legacy MCA mode.
• MSMI trigger (same as SMI).
– MSI or external signaling for Severity1 PCIe AER nonfatal errors
• Signaling of Uncorrected Recoverable Errors (e.g., SRAO and SRAR)
– MCERR (Machine Check Error) for core/uncore errors
• External signaling – via CATERR_N pin (16 BCLK Pulse). Allows propagation to other
sockets.
• In-band signaling – MCE trigger (vector 18h). In-band SMI trigger if configured.
• Enabled only in IA32-legacy MCA mode. Disabled in eMCA2 mode
– MSMI (Machine Check SMI) for core/uncore errors at the source(Part of the eMCA2)
• Enabled only in eMCA2 mode. Disabled in IA32-legacy MCA mode.
• External signaling – via MSMI_N pin. Allows propagation to other sockets.
• In-band signaling – MSMI trigger (same as SMI).
UCNA: Uncorrected No Action
SRAO: SW Recoverable Action Optional
SRAR: SW Recoverable Action Required
Linux RAS Blocks
UEFI Plugfest – March 2017 www.uefi.org 16
Linux CMCI/MCA Handler
with eMCA2
UEFI Plugfest – March 2017 www.uefi.org 17
Machine Check Banks
Other HW Registers
Every Corrected
Error Occurred
SMI
Hardware
OS
CMCI Handler
Read
Error
Log
SMI HandlerBIOS
Main Memory
Enhance
Error
Log
CMCI
Optional
Read
Enhance
Error
Log
Read
Error
Log
eMCA Enabled
Machine Check Banks
Other HW Registers
Uncorrected Error
Occurred
SMI
Hardware
OS
MCA Handler
Read
Error
Log
SMI Handler
BIOS
Main Memory
Enhance
Error
Log
Read
Enhance
Error
Log
Read
Error
Log
eMCA Enabled
INT18
1
2
2
34
5a 5b
APEI Overview
Application Software, e.g., APEI based Tool
Kernel OSPM System Code
Device Driver ACPI Driver
ACPI Tables BIOS
Hardware Platform including Processors, Memory, and I/O
OS Independent code
OS Specific code
ACPI Table Interface
Existing
industry
standard
register
interface, e.g.,
MSR Rd/Wr
Boot-time:
Step 1a: BIOS/SMM presents
APEI tables towards OS. BIOS
checks if target processor
supports APEI feature prior to
presenting to OS
Run-time:
Step 2b: OS requests BIOS/SMM
to enable APEI, i.e., enable
machine-check bank non-zero
value write capability
Step 2c: OS writes Machine-
check banks values requested in
Step 2d: OS requests BIOS/SMM
to either inject MCERR or CMCI.
Note that actual trigger event
occurs within OS context.
Step 2f: Once testing is
completed, OS requests
disabling MCERR/CMCI signaling.
Run-time:
Step 2a: APEI based Error
Injection Tool requests
OS to inject EINJ based
error in selected
Machine-check bank
Step 2e: BIOS discovers
the banks updated by OS
and program SMI
triggering source
registers.
UEFI CPER Overview
• Common Platform Error Record
• CPER is also the format used to
describe platform hardware error by
various APEI tables, such as ERST,
BERT and HEST etc.
UEFI Plugfest – March 2017 www.uefi.org 19
Linux CPER implementation
• Legacy MCA way:
In arch/x86/kernel/cpu/mcheck/mce-apei.c -
an error report given to the kernel by APEI and pass it into the normal
Linux error logging code that invoke after finding an error in a machine
check bank. This code didn’t provide address information in the machine
check bank.
• eMCA2 Mode:
In drivers/acpi/acpi_extlog.c –
on recent generation servers with eMCA, this code picks up additional
information provided by BIOS associated with each error logged. All Linux
really looks for in the CPER record is the “handle” into the SMBIOS/DMI
table so that, for example, it can report which DIMM is associated with the
error.
UEFI Plugfest – March 2017 www.uefi.org 20
precisely map error vs.
specific component
UEFI Plugfest – March 2017 www.uefi.org 21
Solution
Platform UEFI
Boot Service
1. Platform Error logging
2. Platform Error report
3. Platform Specific
algorithms
UEFI Runtime
Service using
CPER
1. Classify error severity
2. Consolidate error sources
3. Standardize RAS policy
OS Module
1. OS Error log
2. OS/App Error
Recovery
3. Component offline or
replacement
1
2 3
CPER w/ eMCA2
UEFI Plugfest – March 2017 www.uefi.org 23
Machine Check Banks
Other HW Registers
Uncorrected Error
Occurred
SMI
Hardware
OS
UEFI CPER
Read
Error
Log
SMI Handler
BIOS
Main Memory
Enhance
Error
Log
Read
Enhance
Error
Log
Read
Error
Log
eMCA Enabled
INT18
1
2
2
34
5a 5b
Server RAS Policy
Call to Action
UEFI Plugfest – March 2017 www.uefi.org 24
Call to Action
• Standardize UEFI Error Reporting/Error
Handling/RAS Policy Protocol
• Centralize more Errors sources like PCIe
AER/MCA etc.
• Connect APEI/CPER with OS-based
policy like CPU/Memory/PCIe devices
hotplug;
• OS-guided RAS policy back to FW and
HW, like page offline.
UEFI Plugfest – March 2017 www.uefi.org 25
Thanks for attending the Spring
2017 UEFI Seminar and Plugfest
For more information on the
Unified EFI Forum and UEFI
Specifications, visit
http://www.uefi.org
presented by
UEFI Plugfest – March 2017 www.uefi.org 26

Más contenido relacionado

La actualidad más candente

PopcornSAR Specialized in AUTOSAR_Company profile
PopcornSAR Specialized in AUTOSAR_Company profilePopcornSAR Specialized in AUTOSAR_Company profile
PopcornSAR Specialized in AUTOSAR_Company profilePopcornSAR
 
Automotive embedded systems part5 v2
Automotive embedded systems part5 v2Automotive embedded systems part5 v2
Automotive embedded systems part5 v2Keroles karam khalil
 
Understanding and Improving Device Access Complexity
Understanding and Improving Device Access ComplexityUnderstanding and Improving Device Access Complexity
Understanding and Improving Device Access Complexityasimkadav
 
How to enable AMD IOMMU in coreboot?
How to enable AMD IOMMU in coreboot?How to enable AMD IOMMU in coreboot?
How to enable AMD IOMMU in coreboot?Piotr Król
 
BlackHat 2011 - Exploiting Siemens Simatic S7 PLCs (slides)
BlackHat 2011 - Exploiting Siemens Simatic S7 PLCs (slides)BlackHat 2011 - Exploiting Siemens Simatic S7 PLCs (slides)
BlackHat 2011 - Exploiting Siemens Simatic S7 PLCs (slides)Michael Smith
 
Fault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemFault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemanujos25
 
Embedded system-Introduction to development cycle and development tool
Embedded system-Introduction to development cycle and development  toolEmbedded system-Introduction to development cycle and development  tool
Embedded system-Introduction to development cycle and development toolPantech ProLabs India Pvt Ltd
 
A Comprehensive Implementation and Evaluation of Direct Interrupt Delivery
A Comprehensive Implementation and Evaluation of Direct Interrupt DeliveryA Comprehensive Implementation and Evaluation of Direct Interrupt Delivery
A Comprehensive Implementation and Evaluation of Direct Interrupt DeliveryCheng-Chun William Tu
 
Interrupt handling
Interrupt handlingInterrupt handling
Interrupt handlingmaverick2203
 
Time is ready for the Civil Infrastructure Platform
Time is ready for the Civil Infrastructure PlatformTime is ready for the Civil Infrastructure Platform
Time is ready for the Civil Infrastructure PlatformYoshitake Kobayashi
 
Automotive embedded systems part5 v1
Automotive embedded systems part5 v1Automotive embedded systems part5 v1
Automotive embedded systems part5 v1Keroles karam khalil
 
What is AUTOSAR MCAL? Learn about the software module architecture and device...
What is AUTOSAR MCAL? Learn about the software module architecture and device...What is AUTOSAR MCAL? Learn about the software module architecture and device...
What is AUTOSAR MCAL? Learn about the software module architecture and device...Embitel Technologies (I) PVT LTD
 
Xen Euro Par07
Xen Euro Par07Xen Euro Par07
Xen Euro Par07congvc
 
Autosar software component
Autosar software componentAutosar software component
Autosar software componentFarzad Sadeghi
 
Android power management
Android power managementAndroid power management
Android power managementJerrin George
 

La actualidad más candente (19)

PopcornSAR Specialized in AUTOSAR_Company profile
PopcornSAR Specialized in AUTOSAR_Company profilePopcornSAR Specialized in AUTOSAR_Company profile
PopcornSAR Specialized in AUTOSAR_Company profile
 
ARM AAE - System Issues
ARM AAE - System IssuesARM AAE - System Issues
ARM AAE - System Issues
 
Automotive embedded systems part5 v2
Automotive embedded systems part5 v2Automotive embedded systems part5 v2
Automotive embedded systems part5 v2
 
Understanding and Improving Device Access Complexity
Understanding and Improving Device Access ComplexityUnderstanding and Improving Device Access Complexity
Understanding and Improving Device Access Complexity
 
How to enable AMD IOMMU in coreboot?
How to enable AMD IOMMU in coreboot?How to enable AMD IOMMU in coreboot?
How to enable AMD IOMMU in coreboot?
 
BlackHat 2011 - Exploiting Siemens Simatic S7 PLCs (slides)
BlackHat 2011 - Exploiting Siemens Simatic S7 PLCs (slides)BlackHat 2011 - Exploiting Siemens Simatic S7 PLCs (slides)
BlackHat 2011 - Exploiting Siemens Simatic S7 PLCs (slides)
 
Fault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemFault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating system
 
Embedded system-Introduction to development cycle and development tool
Embedded system-Introduction to development cycle and development  toolEmbedded system-Introduction to development cycle and development  tool
Embedded system-Introduction to development cycle and development tool
 
A Comprehensive Implementation and Evaluation of Direct Interrupt Delivery
A Comprehensive Implementation and Evaluation of Direct Interrupt DeliveryA Comprehensive Implementation and Evaluation of Direct Interrupt Delivery
A Comprehensive Implementation and Evaluation of Direct Interrupt Delivery
 
Interrupt handling
Interrupt handlingInterrupt handling
Interrupt handling
 
Time is ready for the Civil Infrastructure Platform
Time is ready for the Civil Infrastructure PlatformTime is ready for the Civil Infrastructure Platform
Time is ready for the Civil Infrastructure Platform
 
Automotive embedded systems part5 v1
Automotive embedded systems part5 v1Automotive embedded systems part5 v1
Automotive embedded systems part5 v1
 
What is AUTOSAR MCAL? Learn about the software module architecture and device...
What is AUTOSAR MCAL? Learn about the software module architecture and device...What is AUTOSAR MCAL? Learn about the software module architecture and device...
What is AUTOSAR MCAL? Learn about the software module architecture and device...
 
EC6703 unit-4
EC6703 unit-4EC6703 unit-4
EC6703 unit-4
 
Xen Euro Par07
Xen Euro Par07Xen Euro Par07
Xen Euro Par07
 
F33 book-depend-pres-pt6
F33 book-depend-pres-pt6F33 book-depend-pres-pt6
F33 book-depend-pres-pt6
 
4213ijsea06
4213ijsea064213ijsea06
4213ijsea06
 
Autosar software component
Autosar software componentAutosar software component
Autosar software component
 
Android power management
Android power managementAndroid power management
Android power management
 

Similar a Spike yuan server ras and uefi cper final

Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Linaro
 
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118Wei Fu
 
Reliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxReliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxSamsung Open Source Group
 
Low cost embedded system
Low cost embedded systemLow cost embedded system
Low cost embedded systemece svit
 
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.Atollic
 
Simulation of Signals with Field Signal Simulator
Simulation of Signals with Field Signal SimulatorSimulation of Signals with Field Signal Simulator
Simulation of Signals with Field Signal SimulatorIOSR Journals
 
OSLecture1.ppt
OSLecture1.pptOSLecture1.ppt
OSLecture1.pptAkkiiDerp
 
Introduction to embedded System.pptx
Introduction to embedded System.pptxIntroduction to embedded System.pptx
Introduction to embedded System.pptxPratik Gohel
 
Cpu And Memory Events
Cpu And Memory EventsCpu And Memory Events
Cpu And Memory EventsAero Plane
 
Troubleshooting & Tools
Troubleshooting & ToolsTroubleshooting & Tools
Troubleshooting & ToolsPrabu U
 
6 profiling tools
6 profiling tools6 profiling tools
6 profiling toolsvideos
 
AAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and OptimizationAAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and OptimizationAnh Dung NGUYEN
 

Similar a Spike yuan server ras and uefi cper final (20)

Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
 
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118
 
Reliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxReliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on Linux
 
Low cost embedded system
Low cost embedded systemLow cost embedded system
Low cost embedded system
 
Techno-Fest-15nov16
Techno-Fest-15nov16Techno-Fest-15nov16
Techno-Fest-15nov16
 
Os note
Os noteOs note
Os note
 
EE8691 – EMBEDDED SYSTEMS.pptx
EE8691 – EMBEDDED SYSTEMS.pptxEE8691 – EMBEDDED SYSTEMS.pptx
EE8691 – EMBEDDED SYSTEMS.pptx
 
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.
Advanced debugging on ARM Cortex devices such as STM32, Kinetis, LPC, etc.
 
Simulation of Signals with Field Signal Simulator
Simulation of Signals with Field Signal SimulatorSimulation of Signals with Field Signal Simulator
Simulation of Signals with Field Signal Simulator
 
OSLecture1.ppt
OSLecture1.pptOSLecture1.ppt
OSLecture1.ppt
 
Introduction to embedded System.pptx
Introduction to embedded System.pptxIntroduction to embedded System.pptx
Introduction to embedded System.pptx
 
Cpu And Memory Events
Cpu And Memory EventsCpu And Memory Events
Cpu And Memory Events
 
Troubleshooting & Tools
Troubleshooting & ToolsTroubleshooting & Tools
Troubleshooting & Tools
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
6 profiling tools
6 profiling tools6 profiling tools
6 profiling tools
 
EC8791-U5-PPT.pptx
EC8791-U5-PPT.pptxEC8791-U5-PPT.pptx
EC8791-U5-PPT.pptx
 
Os introduction
Os introductionOs introduction
Os introduction
 
Os introduction
Os introductionOs introduction
Os introduction
 
AAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and OptimizationAAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and Optimization
 
UNIT-III ES.ppt
UNIT-III ES.pptUNIT-III ES.ppt
UNIT-III ES.ppt
 

Último

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 

Último (20)

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 

Spike yuan server ras and uefi cper final

  • 1. presented by Server RAS and UEFI CPER Spring 2017 UEFI Seminar and Plugfest March 27 - 31, 2017 Presented by Lucia, Mao (Intel) Spike Yuan (Intel) UEFI Plugfest – March 2017 www.uefi.org 1 Updated 2011-06-01
  • 2. Agenda • RAS Basic • Server RAS Challenges • SW Building Blocks • Possible Solutions • Call to Action UEFI Plugfest – March 2017 www.uefi.org 2
  • 3. RAS Basic UEFI Plugfest – March 2017 www.uefi.org 3
  • 4. • Reliability: – System capability to detect errors, correct errors, and flag errors. – Measured in FITs (Failure in Time). – 1FIT = 1 Failure / 1 Billion hours (MTBF – Mean Time Between Failures = 114K Years!) • Availability – System capability to stay operational even when error occur. – Measured in terms of ‘down time within a time interval’ – Five 9’s (99.999% Up Time) => 22 seconds Down Time in One Month • Serviceability – System capability to report failures for “FRU Isolation” and ease of repair. FRU – Field Replaceable Unit, e.g., DIMM, PCI Express* device RAS101: Key Definitions
  • 5. Service Failure Sources of Fault Fault, Error, and Failure Operator Mistake Unstable Environment Marginal Hardware Incorrect Design Physical Defect Error Service Failure: When the delivered service deviates from the specified service Unobservable State Observable State Detected Fatal
  • 6. SN Error Source Definition Example 1 Transient Error Electrical Noise induced faults mainly affecting links such as DDR Bus, or PCI Express links. Transient errors on links may alter the data, Command or Address bits during read/write operation. Reads won’t alter the DRAM stored value, but ‘Writes’ may alter. 2 Soft Error Errors due to external high energy particle strike, e.g., Alpha particles, Neutrons. Soft errors could occur is any known good system Result in affecting storage structures such as DRAM cell (SBE or MBE), L1/L2/L3 caches. 3 Hard (Device) Failure Device failure due to marginality of the device or degradation over time. Failure of entire device such as DRAM, memory buffer chip, or CPU chip Sources of Fault/Error • SBE: Single-bit Error. MBE: Multi-bit Error
  • 7. An Example - Intel® Xeon® Processor Fault Classification MCA: Machine Check Architecture DUE: Detectable but Uncorrected Error UCR: Uncorrected Recoverable Faults Detected (e.g., MCA) Corrected Uncorrected Catastrophic (DUE) Fatal (DUE) Recoverable (UCR) Uncorrected No Action (UCNA) SW Recoverable Action Optional (SRAO) SW Recoverable Action Required (SRAR) Undetected Benign Critical
  • 8. Server RAS Challenges UEFI Plugfest – March 2017 www.uefi.org 8
  • 9. Apps OS FW HW Apps/Service Reconfigured Fault Correction HW/FW/SW Fault DetectionFault Avoidance Life of a Fault – Pillars of RAS AppsRestored Fault (HW) Fault (SW) Detection Correct in HW Apps Failure Apps Degradation Handle in OS Correct in FW Avoidance System Reliability Serviceability System Availability Fault (SW) Examples: 1. Programming bug 2. Configuration Error 3. Operator Error Fault (HW) Examples: 1. Marginal Design 2. Unstable environment 3. High energy particle strike 4. Component failure or degradation Fault Avoidance Hidden: Apps, SW, FW, HW Visible: None Feature Example: 1. Micro-circuit level capabilities Fault Detection in HW Hidden: Apps, SW, FW Visible: HW Feature Examples: 1. CRC Check 2. Parity Detection 3. Patrol Scrub Fault Correction in HW Hidden: Apps, SW, FW Visible: HW Examples: 1. Cache and Buffer ECC 2. Memory SDDC, Mirroring 3. Links CRC Retry Fault Correction in FW Hidden: Apps, SW Visible: HW, FW Feature Example: 1. Memory Sparing Fault Handling in SW Hidden: Apps Visible: HW, FW, SW Feature Examples: 1. PCI Express* Link Retry using AER (Advance Error Reporting) 2. CMCI (Corrected Machine Check Interrupt) based Predictive Failure Analysis System Availability Delivered Through the Stack (HW, FW, SW)
  • 10. RAS Enabling Framework Fault Handling = RAS (Four Pillars) Fault Tolerance 1. Avoidance 2. Detection 3. Correction Fault Management 4. Reconfiguration E.g., Failed DIMM Isolation Diagnosability/Serviceability/Manageability Minimizing Downtime Error Logging FW-based Fault Management OS-based Fault Management Silicon FW (Silicon RefCode) FW (OEM/IBV/BMC) OS/VMM Application System Stack Error Signaling/Polling RAS Enabling Requires HW, FW/BIOS, OS/SW E.g., Memory ECC System Reliability Extending the Uptime Fault Avoidance, Detection, and Correction in HW Fault Correction Through FW Fault Correction Through OS Fault Correction at App Layer
  • 11. Which of these is Cloud Cluster and HPC Cluster? RAS Needs of Cloud and HPC Clusters Cloud Infrastructure HPC Infrastructure Need Fault Handling Need Fault Handling Check pointing is not used Check-pointing is actively used Applications can tolerate single machine failure Applications can not tolerant single machine failure Fault Management Capabilities for improving TCO Fault Tolerant Capabilities for extending uptime Example: Automated techniques for identify failed component Example: HW/FW based self-healing techniques
  • 12. RAS Needs of Mission Critical Segment Prevent/minimize unplanned downtime Source: Trend in IT Value Report, Standish Group International, 2008 System UP TimeCorrectable Fault Fatal Unplanned System DOWN Undesirable Intel® Xeon® RAS features directly impact end-user’s bottom line! What would an outage cost?
  • 13. FW/SW Building Blocks UEFI Plugfest – March 2017 www.uefi.org 13
  • 14. Error Reporting Basics • Error Reporting includes two functions: – Logging – Signaling • Logging – Through MCA Banks, PCIe AER Registers, and Memory Corrected Error Registers • eMCA2 Mode – Enhanced Error reporting to support Firmware-First mode; • Signaling of Corrected Errors – CMCI (Corrected Machine Check Interrupt) • Threshold based • Enabled only in IA32-legacy MCA mode. Disabled in eMCA2 mode. – CSMI (Corrected SMI) for core/uncore (Part of the eMCA2 new Feature) • Enabled only in eMCA2 mode. Disabled in IA32-legacy MCA mode. • No Threshold – SMI (System Management Interrupt) for Memory errors – MSI (Message Signaled Interrupt) or external signaling for PCI Express* errors
  • 15. Error Reporting Basics (Continued) • Signaling of Uncorrected Recoverable Errors (e.g., UCNA) – CMCI for core/uncore errors at the source • Enabled only in IA32-legacy MCA mode. Disabled in eMCA2 mode. – MSMI (Machine Check SMI) for core/uncore errors at the source (Part of the eMCA2) • Enabled only in eMCA2 mode. Disabled in IA32-legacy MCA mode. • MSMI trigger (same as SMI). – MSI or external signaling for Severity1 PCIe AER nonfatal errors • Signaling of Uncorrected Recoverable Errors (e.g., SRAO and SRAR) – MCERR (Machine Check Error) for core/uncore errors • External signaling – via CATERR_N pin (16 BCLK Pulse). Allows propagation to other sockets. • In-band signaling – MCE trigger (vector 18h). In-band SMI trigger if configured. • Enabled only in IA32-legacy MCA mode. Disabled in eMCA2 mode – MSMI (Machine Check SMI) for core/uncore errors at the source(Part of the eMCA2) • Enabled only in eMCA2 mode. Disabled in IA32-legacy MCA mode. • External signaling – via MSMI_N pin. Allows propagation to other sockets. • In-band signaling – MSMI trigger (same as SMI). UCNA: Uncorrected No Action SRAO: SW Recoverable Action Optional SRAR: SW Recoverable Action Required
  • 16. Linux RAS Blocks UEFI Plugfest – March 2017 www.uefi.org 16
  • 17. Linux CMCI/MCA Handler with eMCA2 UEFI Plugfest – March 2017 www.uefi.org 17 Machine Check Banks Other HW Registers Every Corrected Error Occurred SMI Hardware OS CMCI Handler Read Error Log SMI HandlerBIOS Main Memory Enhance Error Log CMCI Optional Read Enhance Error Log Read Error Log eMCA Enabled Machine Check Banks Other HW Registers Uncorrected Error Occurred SMI Hardware OS MCA Handler Read Error Log SMI Handler BIOS Main Memory Enhance Error Log Read Enhance Error Log Read Error Log eMCA Enabled INT18 1 2 2 34 5a 5b
  • 18. APEI Overview Application Software, e.g., APEI based Tool Kernel OSPM System Code Device Driver ACPI Driver ACPI Tables BIOS Hardware Platform including Processors, Memory, and I/O OS Independent code OS Specific code ACPI Table Interface Existing industry standard register interface, e.g., MSR Rd/Wr Boot-time: Step 1a: BIOS/SMM presents APEI tables towards OS. BIOS checks if target processor supports APEI feature prior to presenting to OS Run-time: Step 2b: OS requests BIOS/SMM to enable APEI, i.e., enable machine-check bank non-zero value write capability Step 2c: OS writes Machine- check banks values requested in Step 2d: OS requests BIOS/SMM to either inject MCERR or CMCI. Note that actual trigger event occurs within OS context. Step 2f: Once testing is completed, OS requests disabling MCERR/CMCI signaling. Run-time: Step 2a: APEI based Error Injection Tool requests OS to inject EINJ based error in selected Machine-check bank Step 2e: BIOS discovers the banks updated by OS and program SMI triggering source registers.
  • 19. UEFI CPER Overview • Common Platform Error Record • CPER is also the format used to describe platform hardware error by various APEI tables, such as ERST, BERT and HEST etc. UEFI Plugfest – March 2017 www.uefi.org 19
  • 20. Linux CPER implementation • Legacy MCA way: In arch/x86/kernel/cpu/mcheck/mce-apei.c - an error report given to the kernel by APEI and pass it into the normal Linux error logging code that invoke after finding an error in a machine check bank. This code didn’t provide address information in the machine check bank. • eMCA2 Mode: In drivers/acpi/acpi_extlog.c – on recent generation servers with eMCA, this code picks up additional information provided by BIOS associated with each error logged. All Linux really looks for in the CPER record is the “handle” into the SMBIOS/DMI table so that, for example, it can report which DIMM is associated with the error. UEFI Plugfest – March 2017 www.uefi.org 20
  • 21. precisely map error vs. specific component UEFI Plugfest – March 2017 www.uefi.org 21
  • 22. Solution Platform UEFI Boot Service 1. Platform Error logging 2. Platform Error report 3. Platform Specific algorithms UEFI Runtime Service using CPER 1. Classify error severity 2. Consolidate error sources 3. Standardize RAS policy OS Module 1. OS Error log 2. OS/App Error Recovery 3. Component offline or replacement 1 2 3
  • 23. CPER w/ eMCA2 UEFI Plugfest – March 2017 www.uefi.org 23 Machine Check Banks Other HW Registers Uncorrected Error Occurred SMI Hardware OS UEFI CPER Read Error Log SMI Handler BIOS Main Memory Enhance Error Log Read Enhance Error Log Read Error Log eMCA Enabled INT18 1 2 2 34 5a 5b Server RAS Policy
  • 24. Call to Action UEFI Plugfest – March 2017 www.uefi.org 24
  • 25. Call to Action • Standardize UEFI Error Reporting/Error Handling/RAS Policy Protocol • Centralize more Errors sources like PCIe AER/MCA etc. • Connect APEI/CPER with OS-based policy like CPU/Memory/PCIe devices hotplug; • OS-guided RAS policy back to FW and HW, like page offline. UEFI Plugfest – March 2017 www.uefi.org 25
  • 26. Thanks for attending the Spring 2017 UEFI Seminar and Plugfest For more information on the Unified EFI Forum and UEFI Specifications, visit http://www.uefi.org presented by UEFI Plugfest – March 2017 www.uefi.org 26