SlideShare una empresa de Scribd logo
1 de 14
Descargar para leer sin conexión
Detecting and
           Correcting Transient
            Hardware Errors
            John Byrne (john.l.byrne@hp.com), Norman P. Jouppi,
            Laura Ramirez, Parthasarathy Ranganathan, Bruce J. Walker
            HP Labs

            Nidhi Aggarwal, Kewal K. Saluja, James E. Smith
            University of Wisconsin – Madison




© 2009 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Availability/Reliability Spectrum
    High Availability;    Fault Tolerance; No lost
                                                     Fault Correction; keep
    Restart/Failover;     work; keep running,
                                                     Running with correct results
    Lose ongoing work     even with bad results




2      24 February 2009
Availability/Reliability Spectrum
    High Availability;             Fault Tolerance; No lost
                                                                Fault Correction; keep
    Restart VM;                    work; keep running,
                                                                Running with correct results
    Lose ongoing work              even with bad results

                      Ongoing FT Work:
                          Remus
                                                Checkpoint and restart on complete node failure
                          Kemari
                                              Output comparison; pick one if they don’t match
                          Marathon
       Lockstep
                          VMware
       Execution                               Backup takes over on complete node failure




3      24 February 2009
Availability/Reliability Spectrum
    High Availability;             Fault Tolerance; No lost
                                                                Fault Correction; keep
    Restart VM;                    work; keep running,
                                                                Running with correct results
    Lose ongoing work              even with bad results

                      Ongoing FT Work:
                          Remus
                                                Checkpoint and restart on complete node failure
                          Kemari
                                              Output comparison; pick one if they don’t match
                          Marathon
       Lockstep
                          VMware
       Execution                               Backup takes over on complete node failure

     Theme: Deal with Complete Node Failure


    No one is detecting or correcting transient processor failures



4      24 February 2009
Transient Hardware Errors
    International Technology Roadmap for Semiconductors
•
    has predicted significant reliability problems
    Intel study in 2005 indicated 100-fold increase in
•
    transient faults in scaling from 180nm to 16nm

Errors can:
a. Crash the OS;
b. Corrupt data;
c. Cause execution to take a different path;


Goal: Detect and Correct Transient Errors


5   24 February 2009
Detect and Correct Transient Errors
      Lockstep VMs with ongoing checkpoints.
1.
      Tee input to both VMs
2.
      Compare output from VMs and re-execute on
3.
      miscompare
      Compare checkpoints and re-execute on
4.
      miscompare
      Log input, interrupts and non-deterministic
5.
      instructions to allow completely accurate re-
      execution;



6    24 February 2009
Lockstep VMs
• Create 2 identical images at VM start time;
• Ensure response to each VMexit is identical;
    − Force them to be identical if necessary (rdtsc)
• Deliver              interrupt at the identical instruction
    − At each VMexit, deliver interrupts if they are pending;
    − Use the count-down PMU counter to force a
      synchronization point if no VMexits happen and deliver
      interrupts at that point.
• Log VMexit return values as necessary (for replay);
• Log when interrupts are delivered (for replay)


7   24 February 2009
I/O
• Input        is sent to both VMs (network and disk);
    − Input is logged as being part of a specific checkpoint so
      on replay inputs can come from the log;
• Output               is compared and if equal, is sent out;
    − If not equal, re-execute from the last good checkpoint
    − Blktap driver modified to allow disk output to be
      compared
    − Network backend driver modified to allow network
      output to be compared;
    − Output is counted so on replay the correct number of re-
      created outputs can be discarded and not output twice.

8   24 February 2009
Checkpointing and Comparing
    Incremental, periodic checkpoints of each VM
•
    − At exactly the same instruction;
    − Utilize COW or copy at checkpoint time;
    − Mark the checkpoint event in the input and output streams
    − Immediate continue execution after checkpoint done;
    After checkpoint X is done, compare the incremental
•
    checkpoints
    − If equal, delete one of the checkpoint
    − If equal, delete checkpoint X-1 + any logs for x-1;
    − If not equal, then do a replay from checkpoint X-1;
  At checkpoint event, tell input to start a new log;
•
• At checkpoint event, tell output to record the output count
  and start a new count for the next checkpoint


9    24 February 2009
Replay from Checkpoint X
• Restore                the registers and memory image
• Tell  input i/o to replay input from log for
     checkpoint X
• Tell  output that we are replaying from checkpoint X
     and it then throws away as many i/o’s as were
     done since checkpoint X.




10    24 February 2009
Initial Limitations
• Uniprocessor            VMs
• HVM            guests
• Single           Node implementation




11   24 February 2009
Performance Data to Date
• Implementation      to date includes lockstep VMs and
     disk i/o input funneling and output checking
     (network i/o is implemented but not yet tested)
• Bonnie   i/o benchmark didn’t show any
     degradation;
• SpecCPU     benchmark suite showed 2-5%
     degradation.




12    24 February 2009
Plans
• Implement             the checkpoint and replay code
• Work            on performance of lockstep and checkpoint
• Investigate           UP, HVM and single-site restrictions.




13   24 February 2009
XS Oracle 2009 Error Detection

Más contenido relacionado

La actualidad más candente

Linux On V Mware ESXi
Linux On V Mware ESXiLinux On V Mware ESXi
Linux On V Mware ESXi
Masafumi Ohta
 
Hyper-V Best Practices & Tips and Tricks
Hyper-V Best Practices & Tips and TricksHyper-V Best Practices & Tips and Tricks
Hyper-V Best Practices & Tips and Tricks
Amit Gatenyo
 
Презентация RDS & App-V, VDI
Презентация RDS & App-V, VDIПрезентация RDS & App-V, VDI
Презентация RDS & App-V, VDI
Виталий Стародубцев
 
i//:squared Business Continuity Event
i//:squared Business Continuity Eventi//:squared Business Continuity Event
i//:squared Business Continuity Event
Jonathan Allmayer
 

La actualidad más candente (20)

XS Japan 2008 Services English
XS Japan 2008 Services EnglishXS Japan 2008 Services English
XS Japan 2008 Services English
 
XS Boston 2008 Project Status
XS Boston 2008 Project StatusXS Boston 2008 Project Status
XS Boston 2008 Project Status
 
XS Boston 2008 Self IO Emulation
XS Boston 2008 Self IO EmulationXS Boston 2008 Self IO Emulation
XS Boston 2008 Self IO Emulation
 
Ian Prattlinuxworld Xen Aug2008
Ian Prattlinuxworld Xen Aug2008Ian Prattlinuxworld Xen Aug2008
Ian Prattlinuxworld Xen Aug2008
 
XS Oracle 2009 PVOps
XS Oracle 2009 PVOpsXS Oracle 2009 PVOps
XS Oracle 2009 PVOps
 
XS 2008 Boston VTPM
XS 2008 Boston VTPMXS 2008 Boston VTPM
XS 2008 Boston VTPM
 
Nakajima hvm-be final
Nakajima hvm-be finalNakajima hvm-be final
Nakajima hvm-be final
 
XS Boston 2008 OpenSolaris
XS Boston 2008 OpenSolarisXS Boston 2008 OpenSolaris
XS Boston 2008 OpenSolaris
 
XS Japan 2008 Xen Mgmt English
XS Japan 2008 Xen Mgmt EnglishXS Japan 2008 Xen Mgmt English
XS Japan 2008 Xen Mgmt English
 
XS Boston 2008 Network Topology
XS Boston 2008 Network TopologyXS Boston 2008 Network Topology
XS Boston 2008 Network Topology
 
How to Optimize Microsoft Hyper-V Failover Cluster and Double Performance
How to Optimize Microsoft Hyper-V Failover Cluster and Double PerformanceHow to Optimize Microsoft Hyper-V Failover Cluster and Double Performance
How to Optimize Microsoft Hyper-V Failover Cluster and Double Performance
 
Ina Pratt Fosdem Feb2008
Ina Pratt Fosdem Feb2008Ina Pratt Fosdem Feb2008
Ina Pratt Fosdem Feb2008
 
Linux On V Mware ESXi
Linux On V Mware ESXiLinux On V Mware ESXi
Linux On V Mware ESXi
 
XS Oracle 2009 Just Run It
XS Oracle 2009 Just Run ItXS Oracle 2009 Just Run It
XS Oracle 2009 Just Run It
 
Hyper V And Scvmm Best Practis
Hyper V And Scvmm Best PractisHyper V And Scvmm Best Practis
Hyper V And Scvmm Best Practis
 
Hyper-V Best Practices & Tips and Tricks
Hyper-V Best Practices & Tips and TricksHyper-V Best Practices & Tips and Tricks
Hyper-V Best Practices & Tips and Tricks
 
Презентация RDS & App-V, VDI
Презентация RDS & App-V, VDIПрезентация RDS & App-V, VDI
Презентация RDS & App-V, VDI
 
Windows Server "10": что нового в виртуализации
Windows Server "10": что нового в виртуализацииWindows Server "10": что нового в виртуализации
Windows Server "10": что нового в виртуализации
 
i//:squared Business Continuity Event
i//:squared Business Continuity Eventi//:squared Business Continuity Event
i//:squared Business Continuity Event
 
Building vSphere Perf Monitoring Tools
Building vSphere Perf Monitoring ToolsBuilding vSphere Perf Monitoring Tools
Building vSphere Perf Monitoring Tools
 

Destacado (7)

XS Oracle 2009 Networking 10gig
XS Oracle 2009 Networking 10gigXS Oracle 2009 Networking 10gig
XS Oracle 2009 Networking 10gig
 
Xen Directions HXEN Slides
Xen Directions HXEN SlidesXen Directions HXEN Slides
Xen Directions HXEN Slides
 
Ese Neno Da Rua
Ese Neno Da RuaEse Neno Da Rua
Ese Neno Da Rua
 
Apollotheme presentation
Apollotheme presentationApollotheme presentation
Apollotheme presentation
 
XS Boston 2008 Guest Spinning
XS Boston 2008 Guest SpinningXS Boston 2008 Guest Spinning
XS Boston 2008 Guest Spinning
 
Javaiostream
JavaiostreamJavaiostream
Javaiostream
 
XS Boston 2008 Paravirt Ops in Linux IA64
XS Boston 2008 Paravirt Ops in Linux IA64XS Boston 2008 Paravirt Ops in Linux IA64
XS Boston 2008 Paravirt Ops in Linux IA64
 

Similar a XS Oracle 2009 Error Detection

Network Connectivity Test Script
Network Connectivity Test ScriptNetwork Connectivity Test Script
Network Connectivity Test Script
Mersedeh Arvaneh
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Jignesh Shah
 
The Verification Methodology Landscape
The Verification Methodology LandscapeThe Verification Methodology Landscape
The Verification Methodology Landscape
DVClub
 
Everything you ever wanted to know about deployment but were afraid to ask
Everything you ever wanted to know about deployment but were afraid to askEverything you ever wanted to know about deployment but were afraid to ask
Everything you ever wanted to know about deployment but were afraid to ask
lauraxthomson
 
Designing and Deploying Internet-Scale Services
Designing and Deploying Internet-Scale ServicesDesigning and Deploying Internet-Scale Services
Designing and Deploying Internet-Scale Services
bigqiang zou
 
20080501 software verification_sharygina_lecture01
20080501 software verification_sharygina_lecture0120080501 software verification_sharygina_lecture01
20080501 software verification_sharygina_lecture01
Computer Science Club
 
Continuous delivery continuous integration 0.3
Continuous delivery continuous integration 0.3Continuous delivery continuous integration 0.3
Continuous delivery continuous integration 0.3
Alex Tregubov
 

Similar a XS Oracle 2009 Error Detection (20)

Network Connectivity Test Script
Network Connectivity Test ScriptNetwork Connectivity Test Script
Network Connectivity Test Script
 
A05
A05A05
A05
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
 
The Verification Methodology Landscape
The Verification Methodology LandscapeThe Verification Methodology Landscape
The Verification Methodology Landscape
 
Everything you ever wanted to know about deployment but were afraid to ask
Everything you ever wanted to know about deployment but were afraid to askEverything you ever wanted to know about deployment but were afraid to ask
Everything you ever wanted to know about deployment but were afraid to ask
 
SPA 2009 - Acceptance Testing AJAX Web Applications through the GUI
SPA 2009 - Acceptance Testing AJAX Web Applications through the GUISPA 2009 - Acceptance Testing AJAX Web Applications through the GUI
SPA 2009 - Acceptance Testing AJAX Web Applications through the GUI
 
Ruby и TestComplete
Ruby и TestCompleteRuby и TestComplete
Ruby и TestComplete
 
Designing and Deploying Internet-Scale Services
Designing and Deploying Internet-Scale ServicesDesigning and Deploying Internet-Scale Services
Designing and Deploying Internet-Scale Services
 
20080501 software verification_sharygina_lecture01
20080501 software verification_sharygina_lecture0120080501 software verification_sharygina_lecture01
20080501 software verification_sharygina_lecture01
 
When Web Services Go Bad
When Web Services Go BadWhen Web Services Go Bad
When Web Services Go Bad
 
Technical meeting automated testing with vs2010
Technical meeting automated testing with vs2010Technical meeting automated testing with vs2010
Technical meeting automated testing with vs2010
 
Verify Your Kubernetes Clusters with Upstream e2e tests
Verify Your Kubernetes Clusters with Upstream e2e testsVerify Your Kubernetes Clusters with Upstream e2e tests
Verify Your Kubernetes Clusters with Upstream e2e tests
 
How Not To Be Caught Flat-footed With Unpredictable FME Results
How Not To Be Caught Flat-footed With Unpredictable FME ResultsHow Not To Be Caught Flat-footed With Unpredictable FME Results
How Not To Be Caught Flat-footed With Unpredictable FME Results
 
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...
 
Ug. marketplace testing
Ug. marketplace testingUg. marketplace testing
Ug. marketplace testing
 
Wndows Phone 7 Marketplace testing
Wndows Phone 7 Marketplace testingWndows Phone 7 Marketplace testing
Wndows Phone 7 Marketplace testing
 
Viavi_TeraVM Core Emulator.pptx
Viavi_TeraVM Core Emulator.pptxViavi_TeraVM Core Emulator.pptx
Viavi_TeraVM Core Emulator.pptx
 
Reliability Patterns for Large-Scale Automated Tests
Reliability Patterns for Large-Scale Automated TestsReliability Patterns for Large-Scale Automated Tests
Reliability Patterns for Large-Scale Automated Tests
 
Continuous delivery continuous integration 0.3
Continuous delivery continuous integration 0.3Continuous delivery continuous integration 0.3
Continuous delivery continuous integration 0.3
 
Yield improvement of an eeprom for automotive applications
Yield improvement of an eeprom for automotive applicationsYield improvement of an eeprom for automotive applications
Yield improvement of an eeprom for automotive applications
 

Más de The Linux Foundation

Más de The Linux Foundation (20)

ELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made SimpleELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made Simple
 
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
 
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
 
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
 
XPDDS19 Keynote: Unikraft Weather Report
XPDDS19 Keynote:  Unikraft Weather ReportXPDDS19 Keynote:  Unikraft Weather Report
XPDDS19 Keynote: Unikraft Weather Report
 
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
 
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, XilinxXPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
 
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
 
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
XPDDS19: Memories of a VM Funk - Mihai Donțu, BitdefenderXPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
 
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
OSSJP/ALS19:  The Road to Safety Certification: Overcoming Community Challeng...OSSJP/ALS19:  The Road to Safety Certification: Overcoming Community Challeng...
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
 
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
 OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making... OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
 
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, CitrixXPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
 
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltdXPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
 
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
 
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&DXPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
 
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM SystemsXPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
 
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
 
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
 
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
 
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSEXPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

XS Oracle 2009 Error Detection

  • 1. Detecting and Correcting Transient Hardware Errors John Byrne (john.l.byrne@hp.com), Norman P. Jouppi, Laura Ramirez, Parthasarathy Ranganathan, Bruce J. Walker HP Labs Nidhi Aggarwal, Kewal K. Saluja, James E. Smith University of Wisconsin – Madison © 2009 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • 2. Availability/Reliability Spectrum High Availability; Fault Tolerance; No lost Fault Correction; keep Restart/Failover; work; keep running, Running with correct results Lose ongoing work even with bad results 2 24 February 2009
  • 3. Availability/Reliability Spectrum High Availability; Fault Tolerance; No lost Fault Correction; keep Restart VM; work; keep running, Running with correct results Lose ongoing work even with bad results Ongoing FT Work: Remus Checkpoint and restart on complete node failure Kemari Output comparison; pick one if they don’t match Marathon Lockstep VMware Execution Backup takes over on complete node failure 3 24 February 2009
  • 4. Availability/Reliability Spectrum High Availability; Fault Tolerance; No lost Fault Correction; keep Restart VM; work; keep running, Running with correct results Lose ongoing work even with bad results Ongoing FT Work: Remus Checkpoint and restart on complete node failure Kemari Output comparison; pick one if they don’t match Marathon Lockstep VMware Execution Backup takes over on complete node failure Theme: Deal with Complete Node Failure No one is detecting or correcting transient processor failures 4 24 February 2009
  • 5. Transient Hardware Errors International Technology Roadmap for Semiconductors • has predicted significant reliability problems Intel study in 2005 indicated 100-fold increase in • transient faults in scaling from 180nm to 16nm Errors can: a. Crash the OS; b. Corrupt data; c. Cause execution to take a different path; Goal: Detect and Correct Transient Errors 5 24 February 2009
  • 6. Detect and Correct Transient Errors Lockstep VMs with ongoing checkpoints. 1. Tee input to both VMs 2. Compare output from VMs and re-execute on 3. miscompare Compare checkpoints and re-execute on 4. miscompare Log input, interrupts and non-deterministic 5. instructions to allow completely accurate re- execution; 6 24 February 2009
  • 7. Lockstep VMs • Create 2 identical images at VM start time; • Ensure response to each VMexit is identical; − Force them to be identical if necessary (rdtsc) • Deliver interrupt at the identical instruction − At each VMexit, deliver interrupts if they are pending; − Use the count-down PMU counter to force a synchronization point if no VMexits happen and deliver interrupts at that point. • Log VMexit return values as necessary (for replay); • Log when interrupts are delivered (for replay) 7 24 February 2009
  • 8. I/O • Input is sent to both VMs (network and disk); − Input is logged as being part of a specific checkpoint so on replay inputs can come from the log; • Output is compared and if equal, is sent out; − If not equal, re-execute from the last good checkpoint − Blktap driver modified to allow disk output to be compared − Network backend driver modified to allow network output to be compared; − Output is counted so on replay the correct number of re- created outputs can be discarded and not output twice. 8 24 February 2009
  • 9. Checkpointing and Comparing Incremental, periodic checkpoints of each VM • − At exactly the same instruction; − Utilize COW or copy at checkpoint time; − Mark the checkpoint event in the input and output streams − Immediate continue execution after checkpoint done; After checkpoint X is done, compare the incremental • checkpoints − If equal, delete one of the checkpoint − If equal, delete checkpoint X-1 + any logs for x-1; − If not equal, then do a replay from checkpoint X-1; At checkpoint event, tell input to start a new log; • • At checkpoint event, tell output to record the output count and start a new count for the next checkpoint 9 24 February 2009
  • 10. Replay from Checkpoint X • Restore the registers and memory image • Tell input i/o to replay input from log for checkpoint X • Tell output that we are replaying from checkpoint X and it then throws away as many i/o’s as were done since checkpoint X. 10 24 February 2009
  • 11. Initial Limitations • Uniprocessor VMs • HVM guests • Single Node implementation 11 24 February 2009
  • 12. Performance Data to Date • Implementation to date includes lockstep VMs and disk i/o input funneling and output checking (network i/o is implemented but not yet tested) • Bonnie i/o benchmark didn’t show any degradation; • SpecCPU benchmark suite showed 2-5% degradation. 12 24 February 2009
  • 13. Plans • Implement the checkpoint and replay code • Work on performance of lockstep and checkpoint • Investigate UP, HVM and single-site restrictions. 13 24 February 2009