SlideShare una empresa de Scribd logo
1 de 55
Descargar para leer sin conexión
Problem Reporting and Analysis Linux on System z -
    How to survive a Linux Critical Situation

    Sven Schuetz
    Linux on System z Development and Service
    sven@de.ibm.com




1                                                        © 2011 IBM Corporation
IBM      Live Virtual Class – Linux on System z



Agenda



 Introduction
 How to help us to help you
 Systems monitoring
 How to dump a Linux on System z
 Real Customer cases




2                                                 © 2011 IBM Corporation
IBM           Live Virtual Class – Linux on System z



Introductory remarks



 Problem analysis looks straight forward on the charts but it might
  have taken weeks to get it done.
      A problem does not necessarily show up on the place of origin
 The more information is available, the sooner the problem can be
  solved, because gathering and submitting additional information
  again and again usually introduces delays.
 This presentation can only introduce some tools and how the tools
  can be used, comprehensive documentation on their capabilities is
  to be found in the documentation of the corresponding tool.
 Do not forget to update your systems




3                                                                     © 2011 IBM Corporation
IBM          Live Virtual Class – Linux on System z



Describe the problem



 Get as much information as possible about the circumstances:
      – What is the problem?
      – When did it happen? (date and time, important to dig into logs )
      – Where did it happen? One or more systems, production or test environment?
      – Is this a first time occurrence?
      – If occurred before: how frequently does it occur?
      – Is there any pattern?
      – Was anything changed recently?
      – Is the problem reproducible?

 Write down as much information as possible about the problem!



4                                                                         © 2011 IBM Corporation
IBM      Live Virtual Class – Linux on System z



Describe the environment



 Machine Setup
   – Machine type (z196, z10, z9, ...)
   – Storage Server (ESS800, DS8000, other vendors models)
   – Storage attachment (FICON, ESCON, FCP, how many channels)
   – Network (OSA (type, mode), Hipersocket) ...
 Infrastructure setup
   – Clients
   – Other Computer Systems
   – Network topologies
   – Disk configuration
 Middleware setup
   – Databases, web servers, SAP, TSM, (including version information)
5                                                                 © 2011 IBM Corporation
IBM            Live Virtual Class – Linux on System z



Trouble Shooting First-Aid Kit (1/2)



 Install packages required for debugging
   – s390-tools/s390-utils
         • dbginfo.sh
      – sysstat
         • sadc/sar
         • iostat
      – procps
         • vmstat, top, ps
      – net-tools
         • netstat
      – dump tools crash / lcrash
         • lcrash (lkcdutils) available with SLES9 and SLES10
         • crash available on SLES11
         • crash in all RHEL distributions

6                                                               © 2011 IBM Corporation
IBM          Live Virtual Class – Linux on System z



Trouble Shooting First-Aid Kit (2/2)



 Collect dbginfo.sh output
    – Proactively in healthy system
    – When problems occur – then compare with healthy system
 Collect system data
    – Always archive syslog (/var/log/messages)
    – Start sadc (System Activity Data Collection) service when appropriate (please
      include disk statistics)
    – Collect z/VM MONWRITE Data if running under z/VM when appropriate
 When System hangs
    – Take a dump
       • Include System.map, Kerntypes (if available) and vmlinux file
    – See “Using the dump tools” book on
      http://download.boulder.ibm.com/ibmdl/pub/software/dw/linux390/docu/l26ddt02.pdf
 Enable extended tracing in /sys/kernel/debug/s390dbf for subsystem
7                                                                                        © 2011 IBM Corporation
IBM           Live Virtual Class – Linux on System z



dbginfo Script (1/2)



 dbginfo.sh is a script to collect various system related files, for
  debugging purposes. It generates a tar-archive which can be
  attached to PMRs / Bugzilla entries
 part of the s390-tools package in SUSE and recent Red Hat
  distributions
      – dbginfo.sh gets continuously improved by service and development
    Can be downloaded at the developerWorks website directly
    http://www.ibm.com/developerworks/linux/linux390/s390-tools.html
 It is similar to the RedHat tool sosreport or supportconfig from Novell

      root@larsson:~> dbginfo.sh
      Create target directory /tmp/DBGINFO-2011-01-15-22-06-
      20-t6345057
      Change to target directory /tmp/DBGINFO-2011-01-15-22-
      06-20-t6345057
8     [...]                                               © 2011 IBM Corporation
IBM          Live Virtual Class – Linux on System z



dbginfo Script (2/2)

 Linux Information:
    – /proc/[version, cpu, meminfo, slabinfo, modules, partitions, devices ...]
    – System z specific device driver information: /proc/s390dbf (RHEL 4 only) or
      /sys/kernel/debug/s390dbf
    – Kernel messages /var/log/messages
    – Reads configuration files in directory /etc/ [ccwgroup.conf, modules.conf, fstab]
    – Uses several commands: ps, dmesg
    – Query setup scripts
      • lscss, lsdasd, lsqeth, lszfcp, lstape
  – And much more
 z/VM information:
  – Release and service Level: q cplevel
  – Network setup: q [lan, nic, vswitch, v osa]
  – Storage setup: q [set, v dasd, v fcp, q pav ...]
  – Configuration/memory setup: q [stor, v stor, xstore, cpus...]
  – When the system runs as z/VM guest, ensure that the guest has the appropriate privilege
    class authorities to issue the commands
9                                                                                         © 2011 IBM Corporation
IBM        Live Virtual Class – Linux on System z



SADC/SAR
       Capture Linux performance data with sadc/sar
        – CPU utilization
        – Disk I/O overview and on device level
        – Network I/O and errors on device level
        – Memory usage/Swapping
        – … and much more
        – Reports statistics data over time and creates average values for
          each item
       SADC example (for more see man sadc)
        – System Activity Data Collector (sadc) --> data gatherer
        – /usr/lib64/sa/sadc [options] [interval [count]] [binary outfile]
        – /usr/lib64/sa/sadc 10 20 sadc_outfile




10                                                                           © 2011 IBM Corporation
IBM            Live Virtual Class – Linux on System z



SADC/SAR (cont'd)

          – /usr/lib64/sa/sadc -d 10 sadc_outfile
          – -d option: statistics for disk
          – Should be started as a service during system start
      ✱   SAR example (for more see man sar)
          – System Activity Report (sar) command --> reporting tool
          –   sar -A
          – -A option: reports all the collected statistics
          – sar -A -f sadc_outfile >sar_outfile
       Please include the binary sadc data and sar -A output when
       submitting SADC/SAR information to IBM support




11                                                                    © 2011 IBM Corporation
IBM    Live Virtual Class – Linux on System z


CPU utilization

                                                Per CPU values:
                                                watch out for
                                                    system time (kernel time)
                                                    iowait time (slow I/O subsystem)
                                                    steal time (time taken by other guests)




12                                                                                      © 2011 IBM Corporation
IBM       Live Virtual Class – Linux on System z


Disk I/O rates




      read/write operations
      - per I/O device
      - tps: transactions
      - rd/wr_secs: sectors
      is your I/O balanced?
      Maybe you should stripe your LVs



13                                                 © 2011 IBM Corporation
IBM          Live Virtual Class – Linux on System z



Linux on System z dump tools

 DASD dump tool
     – Writes dump directly on DASD partition
     – Uses s390 standalone dump format
     – ECKD and FBA DASDs supported
     – Single volume and multiple volume (for large systems) dump possible
     – Works in z/VM and in LPAR
 SCSI dump tool
     – Writes dump into filesystem
     – Uses lckd dump format
     – Works in z/VM and in LPAR
 VMDUMP
     – Writes dump to vm spool space (VM reader)
     – z/VM specific dump format, dump must be converted
     – Only available when running under z/VM
 Tape dump tool
     – Writes dump directly on ESCON/FICON Tape device
14
     – Uses s390 standalone dump format                                      © 2011 IBM Corporation
IBM       Live Virtual Class – Linux on System z



DASD dump tool – general usage

1. Format and partition dump device
      root@larsson:~>  dasdfmt ­f /dev/dasd<x> ­b 4096


      root@larsson:~>  fdasd /dev/dasd<x>


2. Prepare dump device in Linux
      root@larsson:~>  zipl ­d /dev/dasd<x1>

3. Stop all CPUs
4. Store Status
5. IPL dump device
6. Copy dump to Linux

      root@larsson:~>  zgetdump /dev/<x1> > dump_file

15                                                       © 2011 IBM Corporation
IBM       Live Virtual Class – Linux on System z



DASD dump under z/VM

 Prepare dump device under Linux, if possible on 64Bit environment:
      root@larsson:~>  zipl ­d /dev/dasd<x1>

 After Linux crash issue these commands on 3270 console:
      #cp cpu all stop
      #cp cpu 0 store status
      #cp i <dasd_devno>

 Wait until dump is saved on device:
      00: zIPL v1.6.0 dump tool (64 bit)
      00: Dumping 64 bit OS
      00: 00000087 / 00000700 MB 0
      ...
      00: Dump successful

 Only disabled wait PSW on older Distributions
 Attach dump device to a linux system with dump tools installed
 Store dump to linux file system from dump device (e.g. zgetdump) Corporation
16                                                             © 2011 IBM
IBM    Live Virtual Class – Linux on System z



DASD dump on LPAR (1/2)




17                                              © 2011 IBM Corporation
IBM    Live Virtual Class – Linux on System z



DASD dump on LPAR (2/2)




18                                              © 2011 IBM Corporation
IBM        Live Virtual Class – Linux on System z



Multi volume dump

 zipl can now dump to multiple DASDs. It is now possible to dump system
  images, which are larger than a single DASD.
   – You can specify up to 32 ECKD DASD partitions for a multi-volume dump
 What are dumps good for?
  – Full snapshot of system state
    taken at any point in time
    (e.g. after a system has
    crashed, of or a running
    system)
  – Can be used to analyse
    system state beyond
    messages written to the
    syslog
  – Internal data structures not
    exported to anywhere

Obtain messages, which have not been written to the syslog due to a
crash
19                                                               © 2011 IBM Corporation
IBM          Live Virtual Class – Linux on System z



Multi volume dump (cont'd)
 How to prepare a set of ECKD DASD devices for a multi-volume
  dump? (64-bit systems only)
   – We use two DASDs in this example:
         root@larsson:~>  dasdfmt ­f /dev/dasdc ­b 4096     
         root@larsson:~>  dasdfmt ­f /dev/dasdd ­b 4096

      – Create the partitions with fdasd. The sum of the partition sizes must be
        sufficiently large (the memory size + 10 MB):
         root@larsson:~>  fdasd /dev/dasdc     
         root@larsson:~>  fdasd /dev/dasdd
      – Create a file called sample_dump_conf containing the device nodes
        (e.g. /dev/dasdc1) of the two partitions, separated by one or more line
        feed characters
      – Prepare the volumes using the zipl command.
         root@larsson:~>  zipl ­M sample_dump_conf        
         [...]

20                                                                      © 2011 IBM Corporation
IBM       Live Virtual Class – Linux on System z



Multi volume dump (cont'd)

 To obtain a dump with the multi-volume DASD dump tool, perform
  the following steps:
   – Stop all CPUs, Store status on the IPL CPU.
   – IPL the dump tool using one of the prepared volumes, either 4711 or
     4712.
   – After the dump tool is IPLed, you'll see a messages that indicates the
     progress of the dump. Then you can IPL Linux again
      #cp cpu all stop
      #cp cpu 0 store status
      #cp ipl 4711

 Copying a multi-volume dump to a file
   – Use zgetdump without any option to copy the dump parts to a file:
      root@larsson:~>  zgetdump /dev/dasdc > mv_dump_file


21                                                                  © 2011 IBM Corporation
IBM       Live Virtual Class – Linux on System z



Multi volume dump (cont'd)

 Display information of the involved volumes:

      root@larsson:~>  zgetdump ­d /dev/dasdc                  
      '/dev/dasdc' is part of Version 1 multi­volume dump,which is 
      spread along the following DASD volumes:         
      0.0.4711 (online, valid) 
      0.0.4712 (online, valid)
      [...]

 Display information about the dump itself:
      root@larsson:~>  zgetdump ­i /dev/dasdc                  
      Dump device: /dev/dasdc
      >>>  Dump header information  <<<
      Dump created on: Fri Aug  7 15:12:41 2009  [...]
      Multi­volume dump: Disk 1 (of 2)
      Reading dump contents from 
      0.0.4711.................................
      Dump ended on:   Fri Aug  7 15:12:52 2009
      Dump End Marker found: this dump is valid.

22                                                                © 2011 IBM Corporation
IBM          Live Virtual Class – Linux on System z



SCSI dump tool – general usage

1. Create partition with PCBIOS disk-layout (fdisk)
2. Format partition with ext2 or ext3 filesystem
3. Install dump tool:
   – mount and prepare disk :
        root@larsson:~>  mount /dev/sda1 /dumps
        root@larsson:~>  zipl ­D /dev/sda1 ­t dumps

      – Optional: /etc/zipl.conf:
        dumptofs=/dev/sda1
        target=/dumps

4. Stop all CPUs
5. Store Status
6. IPL dump device

      Dump tool creates dumps directly in filesystem
23    SCSI dump supported for LPARs and as of z/VM 5.4   © 2011 IBM Corporation
IBM       Live Virtual Class – Linux on System z



SCSI dump under z/VM


 SCSI dump from z/VM is supported as of z/VM 5.4
 Issue SCSI dump

      #cp cpu all stop
      #cp cpu 0 store status
      #cp set dumpdev portname 47120763 00ce93a7 lun 47120000 
      00000000 bootprog 0
      #cp ipl 4b49 dump


 To access the dump, mount the dump partition




24                                                               © 2011 IBM Corporation
IBM     Live Virtual Class – Linux on System z



SCSI dump on LPAR


 Select CPC image for LPAR to dump
 Goto Load panel
 Issue SCSI dump
   – FCP device
   – WWPN
   – LUN




25                                               © 2011 IBM Corporation
IBM         Live Virtual Class – Linux on System z



VMDUMP

 The only method to dump NSSes or DCSSes under z/VM
 Works nondisruptive
 Create dump:
        #cp vmdump to cmsguest
 Receive dump:
   – Store the dump from the reader into CMS dump file:
        #cp dumpload
      – Transfer the dump to linux from CMS e.g. FTP
      – NEW: vmur device driver:
       root@larsson:~>  vmur rec <spoolid> vmdump
 Linux tool to convert vmdump to lkcd format:
       root@larsson:~>  vmconvert vmdump linux.dump
 Problem: Dump process relatively slow

26                                                        © 2011 IBM Corporation
IBM        Live Virtual Class – Linux on System z



How to obtain information about a dump

 Display information of the involved volume:

      root@larsson:~>  zgetdump ­d /dev/dasdb                 
      '/dev/dasdb' is Version 0 dump device. 
      Dump size limit: none

 Display information about the dump itself:
      root@larsson:~>  zgetdump ­i /dev/dasdb1                  
      Dump device: /dev/dasdb1

      Dump created on: Thu Oct  8 15:44:49 2009

      Magic number:  0xa8190173618f23fd
      Version number:      3
      Header size:    4096
      Page size:      4096
      Dumped memory:  1073741824
      Dumped pages:  262144
      Real memory:    1073741824
      cpu id:         0xff00012320978000
      System Arch:    s390x (ESAME)
      Build Arch:     s390x (ESAME)
      >>>  End of Dump header  <<<

      Dump ended on:  Thu Oct  8 15:45:01 2009
27    Dump End Marker found: this dump is valid.                   © 2011 IBM Corporation
IBM       Live Virtual Class – Linux on System z



How to obtain information about a dump (cont'd)
 Display information about the dump itself (dump header) and check if
  the dump is valid, use lcrash with options
 ’-i’ and ’-d’.
      root@larsson:~>  lcrash ­i ­d /dev/dasdb1                  
             Dump Type: s390 standalone dump
                Machine: s390x (ESAME)
                 CPU ID: 0xff00012320978000

           Memory Start: 0x0
             Memory End: 0x40000000
            Memory Size: 1073741824

           Time of dump: Thu Oct  8 15:44:49 2009
        Number of pages: 262144
       Kernel page size: 4096
         Version number: 3
           Magic number: 0xa8190173618f23fd
       Dump header size: 4096
             Dump level: 0x4
             Build arch: s390x (ESAME)
       Time of dump end: Thu Oct  8 15:45:01 2009

28
      End Marker found! Dump is valid!                              © 2011 IBM Corporation
IBM      Live Virtual Class – Linux on System z



Automatic dump on panic (SLES 10/11, RHEL 5/6): dumpconf

 The dumpconf tool configures a dump device that is used for automatic
  dump in case of a kernel panic.
   – The command can be installed as service script under /etc/init.d/dumpconf
     or can be called manually.
   – Start service: # service dumpconf start
   – It reads the configuration file /etc/sysconfig/dumpconf.
   – Example configuration for CCW dump device (DASD) and reipl after dump:

       ON_PANIC=dump_reipl  
       DUMP_TYPE=ccw
       DEVICE=0.0.4711




29                                                                   © 2011 IBM Corporation
IBM         Live Virtual Class – Linux on System z



Automatic dump on panic (SLES 10/11, RHEL 5): dumpconf
(cont'd)
      – Example configuration for FCP dump device (SCSI disk):
           ON_PANIC=dump 
           DUMP_TYPE=fcp
           DEVICE=0.0.4714
           WWPN=0x5005076303004712 
           LUN=0x4047401300000000
           BOOTPROG=0
           BR_LBA=0

      – Example configuration for re-IPL without taking a dump, if a kernel
        panic occurs:
           ON_PANIC=reipl
      – Example of executing a CP command, and rebooting from device 4711
        if a kernel panic occurs:
           ON_PANIC=vmcmd    
           VMCMD_1="MSG <vmguest> Starting VMDUMP" 
           VMCMD_2="VMDUMP"
           VMCMD_3="IPL 4711"
30                                                                     © 2011 IBM Corporation
IBM          Live Virtual Class – Linux on System z



Get dump and send it to service organization

 DASD/Tape:
   – Store dump to Linux file system from dump device:
        root@larsson:~>  zgetdump /dev/<device node> > dump_file
      – Alternative: lcrash (Compression possible)
        root@larsson:~>  lcrash ­d /dev/dasdxx ­s <dir>

 SCSI:
   – Get dump from filesystem
 Additional files needed for dump analysis:
      – SUSE (lcrash tool): /boot/System.map-xxx and /boot/Kerntypes-xxx
      – Redhat & SUSE (crash tool): vmlinux file (kernel with debug info) contained in
        debug kernel rpms:
        • RedHat: kernel-debuginfo-xxx.rpm and kernel-debuginfo-common-xxx.rpm
        • SUSE: kernel-default-debuginfo-xxx.rpm



31                                                                               © 2011 IBM Corporation
IBM       Live Virtual Class – Linux on System z



Handling large dumps

 Compress the dump and split it into parts of 1 GB
      root@larsson:~>  zgetdump /dev/dasdc1 | gzip | split ­b 1G
 Several compressed files such as xaa, xab, xac, .... are created
 Create md5 sums of the compressed files
      root@larsson:~>  md5sum xa* > dump.md5     
 Upload all parts together with the md5 information

 Verification of the parts for a receiver
      root@larsson:~>  md5sum ­c dump.md5 
      xaa: OK
      [....]  

 Merge the parts and uncompress the dump
      root@larsson:~>  cat xa* | gunzip ­c > dump

32                                                              © 2011 IBM Corporation
IBM          Live Virtual Class – Linux on System z



Transferring dumps

 Transferring single volume dumps with ssh

        root@larsson:~>  zgetdump /dev/dasdc1 | ssh user@host "cat > 
        dump_file_on_target_host" 


 Transferring multi-volume dumps with ssh

        root@larsson:~>  zgetdump /dev/dasdc | ssh user@host "cat > 
        multi_volume_dump_file_on_target_host"


 Transferring a dump with ftp
      – Establish an ftp session with the target host, login and set the transfer mode to
        binary
      – Send the dump to the host

        root@larsson:~>  ftp> put |"zgetdump /dev/dasdc1" 
        <dump_file_on_target_host>
33                                                                              © 2011 IBM Corporation
IBM        Live Virtual Class – Linux on System z



Dump tool summary

Tool                                                Stand alone tools
                                                                                            VMDUMP
                          DASD                       Tape                    SCSI
Environment
                                      VM&LPAR                              VM&LPAR               VM

Preparation                                                       mkdir
                           zipl -d /dev/<dump_dev>                /dumps/mydumps                  ---
                                                                  zipl -D /dev/sda1 ...
Creation                                       Stop CPU & Store status
                                                                                             vmdump
                                                ipl <dump_dev_CUU>

Dump                    ECKD or                                         LINUX file system
                                                Tape cartridges                             VM reader
medium                   FBA                                             on a SCSI disk
Copy to                                                                                     Dumpload
                     zgetdump /dev/<dump_dev>
filesystem                                                                     ---          ftp ...
                     > dump_file
                                                                                            vmconvert ...
Viewing
                                                             lcrash or crash


See “Using the dump tools” book on
http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html
34                                                                                             © 2011 IBM Corporation
IBM   Live Virtual Class – Linux on System z




                                               Customer Cases




35                                                              © 2011 IBM Corporation
IBM          Live Virtual Class – Linux on System z



Availability: Guest spontaneously reboots

 Configuration:                                                    HA Cluster
      – Oracle RAC server or other HA
        solution under z/VM
                                                      Linux 1                        Linux 2
 Problem Description:
                                                                     communication
      – Occasionally guests spontaneously
        reboot without any notification or
        console message                               Oracle RAC                     Oracle RAC
 Tools used for problem                                Server                         Server

  determination:
      – cp instruction trace of (re)IPL code
      – Crash dump taken after trace was
        hit

                                                                   Oracle RAC Database



36                                                                                       © 2011 IBM Corporation
IBM        Live Virtual Class – Linux on System z



Availability: Guest spontaneously reboots - Steps to find root
cause

 Question: Who rebooted the system?
 Step 1
   – Find out address of (re)ipl code in the system map
   – Use this address to set instruction trace

      cd /boot
      grep machine_restart System.map­2.6.16.60­0.54.5­default 
      000000000010c364 T machine_restart
      00000000001171c8 t do_machine_restart
      0000000000603200 D _machine_restart




37                                                                © 2011 IBM Corporation
IBM        Live Virtual Class – Linux on System z



Availability: Guest spontaneously reboots - Steps to find root
cause (cont'd)

 Step 2
   – Set CP instruction trace on the reboot address
   – System is halted at that address, when a reboot is triggered

      CP CPU ALL TR IN R 10C364.4
      HCPTRI1027I An active trace set has turned RUN off

      CP Q TR

      NAME  INITIAL     (ACTIVE)
        1     INSTR   PSWA  0010C364­0010C367
              TERM    NOPRINT  NORUN SIM
              SKIP 00000  PASS 00000 STOP 00000  STEP 00000
              CMD  NONE

       ­> 000000000010C364'  STMF    EBCFF0780024 >> 000000003A557D48  
         CC 2
38                                                                  © 2011 IBM Corporation
IBM        Live Virtual Class – Linux on System z



Availability: Guest spontaneously reboots - Steps to find root
cause (cont'd)

 Step 3
   – Take a dump, when the (re)ipl code is hit
      cp cpu all stop
      cp store status
      Store complete.
       
      cp i 4fc6
      Tracing active at IPL
      HCPGSP2630I The virtual machine is placed in CP mode due to a 
      SOGP stop and store status from CPU 00.
      zIPL v1.6.3­0.24.5 dump tool (64bit)
      Dumping 64 bit OS
      00000128 / 00001024 MB
      ......
      00001024 / 00001024 MB
      Dump successful
      HCPIR450W CP entered, disabled wait PSW 00020000 80000000 
      00000000 00000000
39                                                              © 2011 IBM Corporation
IBM         Live Virtual Class – Linux on System z



Availability: Guest spontaneously reboots - Steps to find root
cause (cont'd)

 Step 4
   – Save dump in a file
      zgetdump /dev/dasdb1  > dump_file
      Dump device: /dev/dasdb1

      >>>  Dump header information  <<<
      Dump created on: Wed Oct 27 12:00:40 2010
      Magic number:     0xa8190173618f23fd
      Version number:  4
      Header size:  4096
      Page size:    4096
      Dumped memory:    1073741824
      Dumped pages:     262144
      Real memory:  1073741824
      cpu id:       0xff00012320948000
      System Arch:  s390x (ESAME)
      Build Arch:  s390x (ESAME)
      >>>  End of Dump header  <<<

      Reading dump content ................................
      Dump ended on:    Wed Oct 27 12:00:52 2010
40                                                            © 2011 IBM Corporation
      Dump End Marker found: this dump is valid.
IBM            Live Virtual Class – Linux on System z



Availability: Guest spontaneously reboots - Steps to find root
cause (cont'd)

 Step 5
   – Use (l)crash, to find out, which process has triggered the reboot
       STACK:
       0 start_kernel+950 [0x6a690e]
       1 _stext+32 [0x100020]
      ================================================================
      TASK HAS CPU (1): 0x3f720650 (oprocd.bin):
       LOWCORE INFO:
        ­psw      : 0x0704200180000000 0x000000000010c36a
        ­function : machine_restart+6
        ­prefix   : 0x3f438000
        ­cpu timer: 0x7fffffff 0xff9e6c00
        ­clock cmp: 0x00c6ca69 0x22337760
        ­general registers:

      <snip>                                            /var/opt/oracle/product/crs/bin/oprocd.bin

       STACK:
       0 __handle_sysrq+248 [0x361240]
       1 write_sysrq_trigger+98 [0x2be796]
       2 sys_write+392 [0x225a68]
41     3 sysc_noemu+16 [0x1179a8]                                                          © 2011 IBM Corporation
IBM          Live Virtual Class – Linux on System z



Availability: Guest spontaneously reboots (cont'd)



 Problem Origin:
      – HA component erroneously detected a system hang
        • hangcheck_timer module did not receive timer IRQ
        • z/VM 'time bomb' switch
        • TSA monitor
 z/VM cannot guarantee 'real-time' behavior if overloaded
      – Longest 'hang' observed: 37 seconds(!)
 Solution:
      – Offload HA workload from overloaded z/VM
        • e.g. use separate z/VM
        • or: run large Oracle RAC guests in LPAR




42                                                           © 2011 IBM Corporation
IBM      Live Virtual Class – Linux on System z



Network: network connection is too slow

 Configuration:
   – z/VSE running CICs, connection to DB2 in Linux on System z
   – Hipersocket connection from Linux to z/VSE
   – But also applies to hipersocket connections between Linux and z/OS
 Problem Description:
   – When CICS transactions were monitored, some transactions take a
     couple of seconds instead of milliseconds
 Tools used for problem determination:
   – dbginfo.sh
   – s390 debug feature
   – sadc/sar
   – CICS transaction monitor


43                                                               © 2011 IBM Corporation
IBM         Live Virtual Class – Linux on System z



Network: network connection is too slow (cont'd)


 s390 debug feature
   – Check for qeth errors:
      cat /sys/kernel/debug/s390dbf/qeth_qerr
      00 01282632346:099575 2 ­ 00 0000000180b20218  71 6f 75 74 65 72 72 00 | qouterr.
      00 01282632346:099575 2 ­ 00 0000000180b20298  20 46 31 35 3d 31 30 00 |  F15=10.
      00 01282632346:099576 2 ­ 00 0000000180b20318  20 46 31 34 3d 30 30 00 |  F14=00.
      00 01282632346:099576 2 ­ 00 0000000180b20390  20 71 65 72 72 3d 41 46 |  qerr=AF
      00 01282632346:099576 2 ­ 00 0000000180b20408  20 73 65 72 72 3d 32 00 |  serr=2.


 dbginfo file
   – Check for buffer count:
      cat /sys/devices/qeth/0.0.1e00/buffer_count
      16

 Problem Origin:
   – Too few inbound buffers

44                                                                                 © 2011 IBM Corporation
IBM           Live Virtual Class – Linux on System z



Network: network connection is too slow (cont'd)



 Solution:
      – Increase inbound buffer count (default: 16, max 128)
      – Check actual buffer count with 'lsqeth -p'
      – Set the inbound buffer count in the appropriate config file:
        • SUSE SLES10:
           - in /etc/sysconfig/hardware/hwcfg-qeth-bus-ccw-0.0.F200
           - add QETH_OPTIONS="buffer_count=128"
        • SUSE SLES11:
           - in /etc/udev/rules.d/51-qeth-0.0.f200.rules add ACTION=="add",
             SUBSYSTEM=="ccwgroup", KERNEL=="0.0.f200",
             ATTR{buffer_count}="128"
        • Red Hat:
           - in /etc/sysconfig/network-scripts/ifcfg-eth0
           - add OPTIONS="buffer_count=128"

45                                                                            © 2011 IBM Corporation
IBM     Live Virtual Class – Linux on System z




High Disk response times


 Configuration:
   – z10, HDS Storage Server (Hyper PAV enabled)
   – z/VM, Linux with Oracle Database
   – VM controlled Minidisks attached to Linux, LVM on top
 Problem description:
   – I/O throughput not matching expectations
   – Oracle Database shows poor performance because of that
   – One LVM volume showing significant stress
 Tools used for problem determination:
   – dbginfo.sh
   – sadc/sar
   – z/VM Monitor data
46                                                            © 2011 IBM Corporation
IBM     Live Virtual Class – Linux on System z



High Disk response times (cont'd)
Observation in Linux                                                              Response time


dm-9           0.00         0.00     49.75        0.00 19790.50   0.00   795.56   17.89   15.79      2.01 100.00




                                                   Throughput                                     Utilization


 Conclusion

 PAV not being utilized
 No Hyper PAV support in SLES10 SP2
 Static PAV not possible with current setup (VM controlled minidisks)
 Need to look for other ways for more parallel I/O
   – Link same minidisk multiple times to a guest
   – Use smaller minidisks and increase striping in Linux
 47                                                                                               © 2011 IBM Corporation
IBM                Live Virtual Class – Linux on System z


High Disk response times (cont'd)
Initial and proposed setup
                        Physical Disk                       Logical Disk(s) VM               Logical Disk(s) Linux

1. Initial Setup




                                                                       Link multiple times
                                                                                                               multipath
2. Link Minidisks
   to guest
   multiple times




                                                                                                               striping
3. Smaller disks,
   more stripes


                                                                                                           LVM /
                                                                                                           Device mapper

48                                                                                                      © 2011 IBM Corporation
IBM     Live Virtual Class – Linux on System z



High Disk response times (cont'd)
New Observation in Linux

 Response times stay equal
 Throughput equal
 No PAV being used!!




49                                               © 2011 IBM Corporation
IBM    Live Virtual Class – Linux on System z



High Disk response times (cont'd)
Solution: check PAV setup in VM




50                                              © 2011 IBM Corporation
IBM             Live Virtual Class – Linux on System z



High Disk response times (cont'd)
Finally:
     dasdaen         133.33     0.00  278.11    0.00 12298.51     0.00    88.44     0.79    2.83   1.48  41.29
     dasdcbt         161.19     0.00  248.26    0.00 12260.70     0.00    98.77     0.91    3.47   1.88  46.77
     dasdfwc         149.75     0.00  266.17    0.00 12374.13     0.00    92.98     1.88    7.07   2.54  67.66
     dasdael         162.19     0.00  250.25    0.00 12483.58     0.00    99.77     1.90    7.57   2.86  71.64
     dasddyz         134.83     0.00  277.61    0.00 12431.84     0.00    89.56     0.75    2.71   1.68  46.77
     dasdaem         151.24     0.00  266.17    0.00 12595.02     0.00    94.64     2.01    7.61   2.82  75.12
     dasdcbr         169.65     0.00  242.79    0.00 12386.07     0.00   102.03     1.72    7.05   2.83  68.66
     dasdfwd         162.69     0.00  249.25    0.00 12348.26     0.00    99.08     1.92    7.70   2.83  70.65
     dasddyy         157.21     0.00  259.70    0.00 12409.95     0.00    95.57     2.58    9.96   3.05  79.10
     dasddyx         174.63     0.00  237.81    0.00 12374.13     0.00   104.07     1.76    7.38   2.93  69.65
     dasdcbs         144.78     0.00  272.14    0.00 12264.68     0.00    90.14     2.53    9.31   2.89  78.61
     dasda             0.00     0.00    0.00    1.00     0.00     3.98     8.00     0.01   10.00   5.00   0.50
     dasdq             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
     dasdss            0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
     dasdadx           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.
     dasdawh           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
     dasdamk           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
     dasdaek         160.70     0.00  255.22    0.00 12382.09     0.00    97.03     2.27    8.95   2.88  73.63
     dasdcbq         148.76     0.00  265.67    0.00 12372.14     0.00    93.14     2.14    8.01   2.85  75.62
     dasddyw         162.19     0.00  254.23    0.00 12384.08     0.00    97.42     2.12    8.40   2.90  73.63
     dasdfwe         146.27     0.00  271.64    0.00 12419.90     0.00    91.44     2.63    9.71   2.80  76.12
     dasdfwf         162.19     0.00  249.75    0.00 12455.72     0.00    99.75     0.71    2.83   1.79  44.78
     dasdb             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
     dm­0              0.00     0.00 1646.77    0.00 49494.53     0.00    60.11     5.08    3.04   0.36  59.70
     dm­1              0.00     0.00 1665.17    0.00 49482.59     0.00    59.43    15.00    9.04   0.56  93.53
     dm­2              0.00     0.00 1660.70    0.00 49432.84     0.00    59.53    13.46    8.11   0.55  90.55
     dm­3              0.00     0.00 1647.26    0.00 49490.55     0.00    60.09    12.05    7.32   0.53  87.56
     dm­4              0.00     0.00 1646.77    0.00 49494.53     0.00    60.11     5.08    3.04   0.36  59.70
     dm­5              0.00     0.00 1665.17    0.00 49482.59     0.00    59.43    15.00    9.04   0.56  93.53
     dm­6              0.00     0.00 1660.70    0.00 49432.84     0.00    59.53    13.46    8.11   0.55  90.55
     dm­7              0.00     0.00 1647.26    0.00 49490.55     0.00    60.09    12.06    7.32   0.53  87.56
     dm­8              0.00     0.00    0.00    1.99     0.00     7.96     8.00     0.00    0.00   0.00   0.00
     dm­9              0.00     0.00  497.51    0.00 197900.50    0.00   795.56     7.89   15.79   2.01 100.00

51                                                                                                   © 2011 IBM Corporation
IBM           Live Virtual Class – Linux on System z



Bonding throughput not matching expectations



 Configuration:
      – SLES10 system, connected via OSA card and using bonding driver
 Problem Description:
      – Bonding only working with 100mbps
      – FTP also slow
 Tools used for problem determination:
      – dbginfo.sh, netperf
 Problem Origin:
      – ethtool cannot determine line speed correctly because qeth does not report it
 Solution:
      – Ignore the 100mbps message – upgrade to SLES11


        bonding: bond1: Warning: failed to get speed and duplex 
        from eth0, assumed to be 100Mb/sec and Full
52                                                                                      © 2011 IBM Corporation
IBM          Live Virtual Class – Linux on System z



Availability: Unable to mount file system after LVM changes


                                                                        Linux 2
 Configuration:
      – Linux HA cluster with two nodes
      – Accessing same dasds which are
        exported via ocfs2                            Linux 1
 Problem Description:
      – Added one node to cluster, brought
        Logical Volume online
      – Unable to mount the filesystem from                        OCFS2
        any node after that
 Tools used for problem
                                                                dasda b c d e f
  determination:
      – dbginfo.sh
                                                                Logical Volume



53                                                                           © 2011 IBM Corporation
IBM          Live Virtual Class – Linux on System z



Availability: Unable to mount file system after LVM changes
(cont'd)

                                                                Linux 2
 Problem Origin:
      – LVM metadata was
        overwritten when
        adding 3rd node                                                            Linux 3
                                                      Linux 1
      – e.g. superblock not
        found
 Solution:
      – Extract meta data from                                                     {pv|vg|lv}create
        running node                                                OCFS2
        (/etc/lvm/backup) and
        write to disk again
                                                                 dasdf e c a b d



                                                                 Logical Volume


54                                                                                     © 2011 IBM Corporation
IBM            Live Virtual Class – Linux on System z



Kernel panic: Low address protection


 Configuration:
      – z10 only
      – High work load
      – The more likely the more multithreaded applications are running
 Problem Description:
      – Concurrent access to pages to be removed from the page table
 Tools used for problem determination:
      – crash/lcrash
 Problem Origin:
      – Race condition in memory management in firmware
 Solution:
      – Upgrade to latest firmware!
      – Upgrade to latest kernels – fix to be integrated in all supported distributions


55                                                                                        © 2011 IBM Corporation

Más contenido relacionado

La actualidad más candente

Installing tivoli system automation for high availability of db2 udb bcu on a...
Installing tivoli system automation for high availability of db2 udb bcu on a...Installing tivoli system automation for high availability of db2 udb bcu on a...
Installing tivoli system automation for high availability of db2 udb bcu on a...Banking at Ho Chi Minh city
 
Performance characteristics of traditional v ms vs docker containers (dockerc...
Performance characteristics of traditional v ms vs docker containers (dockerc...Performance characteristics of traditional v ms vs docker containers (dockerc...
Performance characteristics of traditional v ms vs docker containers (dockerc...Boden Russell
 
Docker on Power Systems
Docker on Power SystemsDocker on Power Systems
Docker on Power SystemsCesar Maciel
 
Asiabsdcon14 lavigne
Asiabsdcon14 lavigneAsiabsdcon14 lavigne
Asiabsdcon14 lavigneDru Lavigne
 
Evoluation of Linux Container Virtualization
Evoluation of Linux Container VirtualizationEvoluation of Linux Container Virtualization
Evoluation of Linux Container VirtualizationImesh Gunaratne
 
GlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack IntegrationGlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack IntegrationEtsuji Nakai
 
Ceph Day Chicago - Brining Ceph Storage to the Enterprise
Ceph Day Chicago - Brining Ceph Storage to the Enterprise Ceph Day Chicago - Brining Ceph Storage to the Enterprise
Ceph Day Chicago - Brining Ceph Storage to the Enterprise Ceph Community
 
Lavigne bsdmag-jan2012
Lavigne bsdmag-jan2012Lavigne bsdmag-jan2012
Lavigne bsdmag-jan2012Dru Lavigne
 
Real-time soultion
Real-time soultionReal-time soultion
Real-time soultionNylon
 
bfarm-v2
bfarm-v2bfarm-v2
bfarm-v2Zeus G
 
Lavigne bsdmag apr13
Lavigne bsdmag apr13Lavigne bsdmag apr13
Lavigne bsdmag apr13Dru Lavigne
 

La actualidad más candente (20)

Installing tivoli system automation for high availability of db2 udb bcu on a...
Installing tivoli system automation for high availability of db2 udb bcu on a...Installing tivoli system automation for high availability of db2 udb bcu on a...
Installing tivoli system automation for high availability of db2 udb bcu on a...
 
Performance characteristics of traditional v ms vs docker containers (dockerc...
Performance characteristics of traditional v ms vs docker containers (dockerc...Performance characteristics of traditional v ms vs docker containers (dockerc...
Performance characteristics of traditional v ms vs docker containers (dockerc...
 
Docker on Power Systems
Docker on Power SystemsDocker on Power Systems
Docker on Power Systems
 
Ilf2013
Ilf2013Ilf2013
Ilf2013
 
Scale9x sun
Scale9x sunScale9x sun
Scale9x sun
 
Asiabsdcon14 lavigne
Asiabsdcon14 lavigneAsiabsdcon14 lavigne
Asiabsdcon14 lavigne
 
Evoluation of Linux Container Virtualization
Evoluation of Linux Container VirtualizationEvoluation of Linux Container Virtualization
Evoluation of Linux Container Virtualization
 
GlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack IntegrationGlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack Integration
 
Tlf2014
Tlf2014Tlf2014
Tlf2014
 
Asiabsdcon14
Asiabsdcon14Asiabsdcon14
Asiabsdcon14
 
Olf2013
Olf2013Olf2013
Olf2013
 
Handout2o
Handout2oHandout2o
Handout2o
 
Barrelfish OS
Barrelfish OS Barrelfish OS
Barrelfish OS
 
Ceph Day Chicago - Brining Ceph Storage to the Enterprise
Ceph Day Chicago - Brining Ceph Storage to the Enterprise Ceph Day Chicago - Brining Ceph Storage to the Enterprise
Ceph Day Chicago - Brining Ceph Storage to the Enterprise
 
Lavigne bsdmag-jan2012
Lavigne bsdmag-jan2012Lavigne bsdmag-jan2012
Lavigne bsdmag-jan2012
 
Real-time soultion
Real-time soultionReal-time soultion
Real-time soultion
 
bfarm-v2
bfarm-v2bfarm-v2
bfarm-v2
 
Asiabsdcon15
Asiabsdcon15Asiabsdcon15
Asiabsdcon15
 
Lavigne bsdmag apr13
Lavigne bsdmag apr13Lavigne bsdmag apr13
Lavigne bsdmag apr13
 
Fsoss12
Fsoss12Fsoss12
Fsoss12
 

Destacado (6)

Virtualization Into Cloud
Virtualization Into CloudVirtualization Into Cloud
Virtualization Into Cloud
 
Infrastructure Matters 2014 IBM systems and servers
Infrastructure Matters 2014 IBM systems and serversInfrastructure Matters 2014 IBM systems and servers
Infrastructure Matters 2014 IBM systems and servers
 
Current & Future Linux on System z Technology
Current & Future Linux on System z TechnologyCurrent & Future Linux on System z Technology
Current & Future Linux on System z Technology
 
IBM Smarter Systems : Solving Your Critical It Challenges: Executive Summit
IBM Smarter Systems  : Solving Your Critical It Challenges: Executive SummitIBM Smarter Systems  : Solving Your Critical It Challenges: Executive Summit
IBM Smarter Systems : Solving Your Critical It Challenges: Executive Summit
 
What's New in RHEL 6 for Linux on System z?
What's New in RHEL 6 for Linux on System z?What's New in RHEL 6 for Linux on System z?
What's New in RHEL 6 for Linux on System z?
 
Ibm blade center
Ibm blade centerIbm blade center
Ibm blade center
 

Similar a Problem Reporting and Analysis Linux on System z -How to survive a Linux Critical Situation

Andresen 8 21 02
Andresen 8 21 02Andresen 8 21 02
Andresen 8 21 02FNian
 
Securing Your Linux System
Securing Your Linux SystemSecuring Your Linux System
Securing Your Linux SystemNovell
 
Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Vincent Batts
 
operating system
operating systemoperating system
operating systemIbbad shah
 
Unix and Linux Common Boot Disk Disaster Recovery Tools by Dusan Baljevic
Unix and Linux Common Boot Disk Disaster Recovery Tools by Dusan BaljevicUnix and Linux Common Boot Disk Disaster Recovery Tools by Dusan Baljevic
Unix and Linux Common Boot Disk Disaster Recovery Tools by Dusan BaljevicCircling Cycle
 
Linux containers – next gen virtualization for cloud (atl summit) ar4 3 - copy
Linux containers – next gen virtualization for cloud (atl summit) ar4 3 - copyLinux containers – next gen virtualization for cloud (atl summit) ar4 3 - copy
Linux containers – next gen virtualization for cloud (atl summit) ar4 3 - copyBoden Russell
 
We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster We4IT Group
 
Linux Desktop Automation
Linux Desktop AutomationLinux Desktop Automation
Linux Desktop AutomationRui Lapa
 
Lamp ppt
Lamp pptLamp ppt
Lamp pptReka
 
Linux Disaster Recovery Made Easy
Linux Disaster Recovery Made EasyLinux Disaster Recovery Made Easy
Linux Disaster Recovery Made EasyNovell
 

Similar a Problem Reporting and Analysis Linux on System z -How to survive a Linux Critical Situation (20)

Ibm tivoli security and system z redp4355
Ibm tivoli security and system z redp4355Ibm tivoli security and system z redp4355
Ibm tivoli security and system z redp4355
 
Andresen 8 21 02
Andresen 8 21 02Andresen 8 21 02
Andresen 8 21 02
 
Problem Determination with Linux on System z
Problem Determination with Linux on System zProblem Determination with Linux on System z
Problem Determination with Linux on System z
 
Securing Your Linux System
Securing Your Linux SystemSecuring Your Linux System
Securing Your Linux System
 
Introduction to Linux
Introduction to LinuxIntroduction to Linux
Introduction to Linux
 
Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]
 
operating system
operating systemoperating system
operating system
 
Linux on System z the Toolchain in a Nutshell
Linux on System z the Toolchain in a NutshellLinux on System z the Toolchain in a Nutshell
Linux on System z the Toolchain in a Nutshell
 
Introduction to Linux
Introduction to LinuxIntroduction to Linux
Introduction to Linux
 
Unix and Linux Common Boot Disk Disaster Recovery Tools by Dusan Baljevic
Unix and Linux Common Boot Disk Disaster Recovery Tools by Dusan BaljevicUnix and Linux Common Boot Disk Disaster Recovery Tools by Dusan Baljevic
Unix and Linux Common Boot Disk Disaster Recovery Tools by Dusan Baljevic
 
Lamp ppt
Lamp pptLamp ppt
Lamp ppt
 
Linux containers – next gen virtualization for cloud (atl summit) ar4 3 - copy
Linux containers – next gen virtualization for cloud (atl summit) ar4 3 - copyLinux containers – next gen virtualization for cloud (atl summit) ar4 3 - copy
Linux containers – next gen virtualization for cloud (atl summit) ar4 3 - copy
 
.ppt
.ppt.ppt
.ppt
 
Linux on System z disk I/O performance
Linux on System z disk I/O performanceLinux on System z disk I/O performance
Linux on System z disk I/O performance
 
We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster
 
Linux Desktop Automation
Linux Desktop AutomationLinux Desktop Automation
Linux Desktop Automation
 
Lamp ppt
Lamp pptLamp ppt
Lamp ppt
 
linux
linuxlinux
linux
 
Linux Disaster Recovery Made Easy
Linux Disaster Recovery Made EasyLinux Disaster Recovery Made Easy
Linux Disaster Recovery Made Easy
 
File Systems
File SystemsFile Systems
File Systems
 

Más de IBM India Smarter Computing

TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...IBM India Smarter Computing
 
A Comparison of PowerVM and Vmware Virtualization Performance
A Comparison of PowerVM and Vmware Virtualization PerformanceA Comparison of PowerVM and Vmware Virtualization Performance
A Comparison of PowerVM and Vmware Virtualization PerformanceIBM India Smarter Computing
 
IBM pureflex system and vmware vcloud enterprise suite reference architecture
IBM pureflex system and vmware vcloud enterprise suite reference architectureIBM pureflex system and vmware vcloud enterprise suite reference architecture
IBM pureflex system and vmware vcloud enterprise suite reference architectureIBM India Smarter Computing
 

Más de IBM India Smarter Computing (20)

All-flash Needs End to End Storage Efficiency
All-flash Needs End to End Storage EfficiencyAll-flash Needs End to End Storage Efficiency
All-flash Needs End to End Storage Efficiency
 
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
 
IBM FlashSystem 840 Product Guide
IBM FlashSystem 840 Product GuideIBM FlashSystem 840 Product Guide
IBM FlashSystem 840 Product Guide
 
IBM System x3250 M5
IBM System x3250 M5IBM System x3250 M5
IBM System x3250 M5
 
IBM NeXtScale nx360 M4
IBM NeXtScale nx360 M4IBM NeXtScale nx360 M4
IBM NeXtScale nx360 M4
 
IBM System x3650 M4 HD
IBM System x3650 M4 HDIBM System x3650 M4 HD
IBM System x3650 M4 HD
 
IBM System x3300 M4
IBM System x3300 M4IBM System x3300 M4
IBM System x3300 M4
 
IBM System x iDataPlex dx360 M4
IBM System x iDataPlex dx360 M4IBM System x iDataPlex dx360 M4
IBM System x iDataPlex dx360 M4
 
IBM System x3500 M4
IBM System x3500 M4IBM System x3500 M4
IBM System x3500 M4
 
IBM System x3550 M4
IBM System x3550 M4IBM System x3550 M4
IBM System x3550 M4
 
IBM System x3650 M4
IBM System x3650 M4IBM System x3650 M4
IBM System x3650 M4
 
IBM System x3500 M3
IBM System x3500 M3IBM System x3500 M3
IBM System x3500 M3
 
IBM System x3400 M3
IBM System x3400 M3IBM System x3400 M3
IBM System x3400 M3
 
IBM System x3250 M3
IBM System x3250 M3IBM System x3250 M3
IBM System x3250 M3
 
IBM System x3200 M3
IBM System x3200 M3IBM System x3200 M3
IBM System x3200 M3
 
IBM PowerVC Introduction and Configuration
IBM PowerVC Introduction and ConfigurationIBM PowerVC Introduction and Configuration
IBM PowerVC Introduction and Configuration
 
A Comparison of PowerVM and Vmware Virtualization Performance
A Comparison of PowerVM and Vmware Virtualization PerformanceA Comparison of PowerVM and Vmware Virtualization Performance
A Comparison of PowerVM and Vmware Virtualization Performance
 
IBM pureflex system and vmware vcloud enterprise suite reference architecture
IBM pureflex system and vmware vcloud enterprise suite reference architectureIBM pureflex system and vmware vcloud enterprise suite reference architecture
IBM pureflex system and vmware vcloud enterprise suite reference architecture
 
X6: The sixth generation of EXA Technology
X6: The sixth generation of EXA TechnologyX6: The sixth generation of EXA Technology
X6: The sixth generation of EXA Technology
 
Stephen Leonard IBM Big Data and cloud
Stephen Leonard IBM Big Data and cloudStephen Leonard IBM Big Data and cloud
Stephen Leonard IBM Big Data and cloud
 

Problem Reporting and Analysis Linux on System z -How to survive a Linux Critical Situation

  • 1. Problem Reporting and Analysis Linux on System z - How to survive a Linux Critical Situation Sven Schuetz Linux on System z Development and Service sven@de.ibm.com 1 © 2011 IBM Corporation
  • 2. IBM Live Virtual Class – Linux on System z Agenda  Introduction  How to help us to help you  Systems monitoring  How to dump a Linux on System z  Real Customer cases 2 © 2011 IBM Corporation
  • 3. IBM Live Virtual Class – Linux on System z Introductory remarks  Problem analysis looks straight forward on the charts but it might have taken weeks to get it done. A problem does not necessarily show up on the place of origin  The more information is available, the sooner the problem can be solved, because gathering and submitting additional information again and again usually introduces delays.  This presentation can only introduce some tools and how the tools can be used, comprehensive documentation on their capabilities is to be found in the documentation of the corresponding tool.  Do not forget to update your systems 3 © 2011 IBM Corporation
  • 4. IBM Live Virtual Class – Linux on System z Describe the problem  Get as much information as possible about the circumstances: – What is the problem? – When did it happen? (date and time, important to dig into logs ) – Where did it happen? One or more systems, production or test environment? – Is this a first time occurrence? – If occurred before: how frequently does it occur? – Is there any pattern? – Was anything changed recently? – Is the problem reproducible?  Write down as much information as possible about the problem! 4 © 2011 IBM Corporation
  • 5. IBM Live Virtual Class – Linux on System z Describe the environment  Machine Setup – Machine type (z196, z10, z9, ...) – Storage Server (ESS800, DS8000, other vendors models) – Storage attachment (FICON, ESCON, FCP, how many channels) – Network (OSA (type, mode), Hipersocket) ...  Infrastructure setup – Clients – Other Computer Systems – Network topologies – Disk configuration  Middleware setup – Databases, web servers, SAP, TSM, (including version information) 5 © 2011 IBM Corporation
  • 6. IBM Live Virtual Class – Linux on System z Trouble Shooting First-Aid Kit (1/2)  Install packages required for debugging – s390-tools/s390-utils • dbginfo.sh – sysstat • sadc/sar • iostat – procps • vmstat, top, ps – net-tools • netstat – dump tools crash / lcrash • lcrash (lkcdutils) available with SLES9 and SLES10 • crash available on SLES11 • crash in all RHEL distributions 6 © 2011 IBM Corporation
  • 7. IBM Live Virtual Class – Linux on System z Trouble Shooting First-Aid Kit (2/2)  Collect dbginfo.sh output – Proactively in healthy system – When problems occur – then compare with healthy system  Collect system data – Always archive syslog (/var/log/messages) – Start sadc (System Activity Data Collection) service when appropriate (please include disk statistics) – Collect z/VM MONWRITE Data if running under z/VM when appropriate  When System hangs – Take a dump • Include System.map, Kerntypes (if available) and vmlinux file – See “Using the dump tools” book on http://download.boulder.ibm.com/ibmdl/pub/software/dw/linux390/docu/l26ddt02.pdf  Enable extended tracing in /sys/kernel/debug/s390dbf for subsystem 7 © 2011 IBM Corporation
  • 8. IBM Live Virtual Class – Linux on System z dbginfo Script (1/2)  dbginfo.sh is a script to collect various system related files, for debugging purposes. It generates a tar-archive which can be attached to PMRs / Bugzilla entries  part of the s390-tools package in SUSE and recent Red Hat distributions – dbginfo.sh gets continuously improved by service and development Can be downloaded at the developerWorks website directly http://www.ibm.com/developerworks/linux/linux390/s390-tools.html  It is similar to the RedHat tool sosreport or supportconfig from Novell root@larsson:~> dbginfo.sh Create target directory /tmp/DBGINFO-2011-01-15-22-06- 20-t6345057 Change to target directory /tmp/DBGINFO-2011-01-15-22- 06-20-t6345057 8 [...] © 2011 IBM Corporation
  • 9. IBM Live Virtual Class – Linux on System z dbginfo Script (2/2)  Linux Information: – /proc/[version, cpu, meminfo, slabinfo, modules, partitions, devices ...] – System z specific device driver information: /proc/s390dbf (RHEL 4 only) or /sys/kernel/debug/s390dbf – Kernel messages /var/log/messages – Reads configuration files in directory /etc/ [ccwgroup.conf, modules.conf, fstab] – Uses several commands: ps, dmesg – Query setup scripts • lscss, lsdasd, lsqeth, lszfcp, lstape – And much more  z/VM information: – Release and service Level: q cplevel – Network setup: q [lan, nic, vswitch, v osa] – Storage setup: q [set, v dasd, v fcp, q pav ...] – Configuration/memory setup: q [stor, v stor, xstore, cpus...] – When the system runs as z/VM guest, ensure that the guest has the appropriate privilege class authorities to issue the commands 9 © 2011 IBM Corporation
  • 10. IBM Live Virtual Class – Linux on System z SADC/SAR  Capture Linux performance data with sadc/sar – CPU utilization – Disk I/O overview and on device level – Network I/O and errors on device level – Memory usage/Swapping – … and much more – Reports statistics data over time and creates average values for each item  SADC example (for more see man sadc) – System Activity Data Collector (sadc) --> data gatherer – /usr/lib64/sa/sadc [options] [interval [count]] [binary outfile] – /usr/lib64/sa/sadc 10 20 sadc_outfile 10 © 2011 IBM Corporation
  • 11. IBM Live Virtual Class – Linux on System z SADC/SAR (cont'd) – /usr/lib64/sa/sadc -d 10 sadc_outfile – -d option: statistics for disk – Should be started as a service during system start ✱ SAR example (for more see man sar) – System Activity Report (sar) command --> reporting tool – sar -A – -A option: reports all the collected statistics – sar -A -f sadc_outfile >sar_outfile  Please include the binary sadc data and sar -A output when submitting SADC/SAR information to IBM support 11 © 2011 IBM Corporation
  • 12. IBM Live Virtual Class – Linux on System z CPU utilization Per CPU values: watch out for system time (kernel time) iowait time (slow I/O subsystem) steal time (time taken by other guests) 12 © 2011 IBM Corporation
  • 13. IBM Live Virtual Class – Linux on System z Disk I/O rates read/write operations - per I/O device - tps: transactions - rd/wr_secs: sectors is your I/O balanced? Maybe you should stripe your LVs 13 © 2011 IBM Corporation
  • 14. IBM Live Virtual Class – Linux on System z Linux on System z dump tools  DASD dump tool – Writes dump directly on DASD partition – Uses s390 standalone dump format – ECKD and FBA DASDs supported – Single volume and multiple volume (for large systems) dump possible – Works in z/VM and in LPAR  SCSI dump tool – Writes dump into filesystem – Uses lckd dump format – Works in z/VM and in LPAR  VMDUMP – Writes dump to vm spool space (VM reader) – z/VM specific dump format, dump must be converted – Only available when running under z/VM  Tape dump tool – Writes dump directly on ESCON/FICON Tape device 14 – Uses s390 standalone dump format © 2011 IBM Corporation
  • 15. IBM Live Virtual Class – Linux on System z DASD dump tool – general usage 1. Format and partition dump device root@larsson:~>  dasdfmt ­f /dev/dasd<x> ­b 4096 root@larsson:~>  fdasd /dev/dasd<x> 2. Prepare dump device in Linux root@larsson:~>  zipl ­d /dev/dasd<x1> 3. Stop all CPUs 4. Store Status 5. IPL dump device 6. Copy dump to Linux root@larsson:~>  zgetdump /dev/<x1> > dump_file 15 © 2011 IBM Corporation
  • 16. IBM Live Virtual Class – Linux on System z DASD dump under z/VM  Prepare dump device under Linux, if possible on 64Bit environment: root@larsson:~>  zipl ­d /dev/dasd<x1>  After Linux crash issue these commands on 3270 console: #cp cpu all stop #cp cpu 0 store status #cp i <dasd_devno>  Wait until dump is saved on device: 00: zIPL v1.6.0 dump tool (64 bit) 00: Dumping 64 bit OS 00: 00000087 / 00000700 MB 0 ... 00: Dump successful  Only disabled wait PSW on older Distributions  Attach dump device to a linux system with dump tools installed  Store dump to linux file system from dump device (e.g. zgetdump) Corporation 16 © 2011 IBM
  • 17. IBM Live Virtual Class – Linux on System z DASD dump on LPAR (1/2) 17 © 2011 IBM Corporation
  • 18. IBM Live Virtual Class – Linux on System z DASD dump on LPAR (2/2) 18 © 2011 IBM Corporation
  • 19. IBM Live Virtual Class – Linux on System z Multi volume dump  zipl can now dump to multiple DASDs. It is now possible to dump system images, which are larger than a single DASD. – You can specify up to 32 ECKD DASD partitions for a multi-volume dump  What are dumps good for? – Full snapshot of system state taken at any point in time (e.g. after a system has crashed, of or a running system) – Can be used to analyse system state beyond messages written to the syslog – Internal data structures not exported to anywhere Obtain messages, which have not been written to the syslog due to a crash 19 © 2011 IBM Corporation
  • 20. IBM Live Virtual Class – Linux on System z Multi volume dump (cont'd)  How to prepare a set of ECKD DASD devices for a multi-volume dump? (64-bit systems only) – We use two DASDs in this example: root@larsson:~>  dasdfmt ­f /dev/dasdc ­b 4096      root@larsson:~>  dasdfmt ­f /dev/dasdd ­b 4096 – Create the partitions with fdasd. The sum of the partition sizes must be sufficiently large (the memory size + 10 MB): root@larsson:~>  fdasd /dev/dasdc      root@larsson:~>  fdasd /dev/dasdd – Create a file called sample_dump_conf containing the device nodes (e.g. /dev/dasdc1) of the two partitions, separated by one or more line feed characters – Prepare the volumes using the zipl command. root@larsson:~>  zipl ­M sample_dump_conf         [...] 20 © 2011 IBM Corporation
  • 21. IBM Live Virtual Class – Linux on System z Multi volume dump (cont'd)  To obtain a dump with the multi-volume DASD dump tool, perform the following steps: – Stop all CPUs, Store status on the IPL CPU. – IPL the dump tool using one of the prepared volumes, either 4711 or 4712. – After the dump tool is IPLed, you'll see a messages that indicates the progress of the dump. Then you can IPL Linux again #cp cpu all stop #cp cpu 0 store status #cp ipl 4711  Copying a multi-volume dump to a file – Use zgetdump without any option to copy the dump parts to a file: root@larsson:~>  zgetdump /dev/dasdc > mv_dump_file 21 © 2011 IBM Corporation
  • 22. IBM Live Virtual Class – Linux on System z Multi volume dump (cont'd)  Display information of the involved volumes: root@larsson:~>  zgetdump ­d /dev/dasdc                   '/dev/dasdc' is part of Version 1 multi­volume dump,which is  spread along the following DASD volumes:          0.0.4711 (online, valid)  0.0.4712 (online, valid) [...]  Display information about the dump itself: root@larsson:~>  zgetdump ­i /dev/dasdc                   Dump device: /dev/dasdc >>>  Dump header information  <<< Dump created on: Fri Aug  7 15:12:41 2009  [...] Multi­volume dump: Disk 1 (of 2) Reading dump contents from  0.0.4711................................. Dump ended on:   Fri Aug  7 15:12:52 2009 Dump End Marker found: this dump is valid. 22 © 2011 IBM Corporation
  • 23. IBM Live Virtual Class – Linux on System z SCSI dump tool – general usage 1. Create partition with PCBIOS disk-layout (fdisk) 2. Format partition with ext2 or ext3 filesystem 3. Install dump tool: – mount and prepare disk : root@larsson:~>  mount /dev/sda1 /dumps root@larsson:~>  zipl ­D /dev/sda1 ­t dumps – Optional: /etc/zipl.conf: dumptofs=/dev/sda1 target=/dumps 4. Stop all CPUs 5. Store Status 6. IPL dump device Dump tool creates dumps directly in filesystem 23 SCSI dump supported for LPARs and as of z/VM 5.4 © 2011 IBM Corporation
  • 24. IBM Live Virtual Class – Linux on System z SCSI dump under z/VM  SCSI dump from z/VM is supported as of z/VM 5.4  Issue SCSI dump #cp cpu all stop #cp cpu 0 store status #cp set dumpdev portname 47120763 00ce93a7 lun 47120000  00000000 bootprog 0 #cp ipl 4b49 dump  To access the dump, mount the dump partition 24 © 2011 IBM Corporation
  • 25. IBM Live Virtual Class – Linux on System z SCSI dump on LPAR  Select CPC image for LPAR to dump  Goto Load panel  Issue SCSI dump – FCP device – WWPN – LUN 25 © 2011 IBM Corporation
  • 26. IBM Live Virtual Class – Linux on System z VMDUMP  The only method to dump NSSes or DCSSes under z/VM  Works nondisruptive  Create dump: #cp vmdump to cmsguest  Receive dump: – Store the dump from the reader into CMS dump file: #cp dumpload – Transfer the dump to linux from CMS e.g. FTP – NEW: vmur device driver: root@larsson:~>  vmur rec <spoolid> vmdump  Linux tool to convert vmdump to lkcd format: root@larsson:~>  vmconvert vmdump linux.dump  Problem: Dump process relatively slow 26 © 2011 IBM Corporation
  • 27. IBM Live Virtual Class – Linux on System z How to obtain information about a dump  Display information of the involved volume: root@larsson:~>  zgetdump ­d /dev/dasdb                  '/dev/dasdb' is Version 0 dump device.  Dump size limit: none  Display information about the dump itself: root@larsson:~>  zgetdump ­i /dev/dasdb1                   Dump device: /dev/dasdb1 Dump created on: Thu Oct  8 15:44:49 2009 Magic number:  0xa8190173618f23fd Version number:  3 Header size:  4096 Page size:  4096 Dumped memory:  1073741824 Dumped pages:  262144 Real memory:  1073741824 cpu id:  0xff00012320978000 System Arch:  s390x (ESAME) Build Arch:  s390x (ESAME) >>>  End of Dump header  <<< Dump ended on:  Thu Oct  8 15:45:01 2009 27 Dump End Marker found: this dump is valid. © 2011 IBM Corporation
  • 28. IBM Live Virtual Class – Linux on System z How to obtain information about a dump (cont'd)  Display information about the dump itself (dump header) and check if the dump is valid, use lcrash with options  ’-i’ and ’-d’. root@larsson:~>  lcrash ­i ­d /dev/dasdb1                      Dump Type: s390 standalone dump           Machine: s390x (ESAME)            CPU ID: 0xff00012320978000      Memory Start: 0x0        Memory End: 0x40000000       Memory Size: 1073741824      Time of dump: Thu Oct  8 15:44:49 2009   Number of pages: 262144  Kernel page size: 4096    Version number: 3      Magic number: 0xa8190173618f23fd  Dump header size: 4096        Dump level: 0x4        Build arch: s390x (ESAME)  Time of dump end: Thu Oct  8 15:45:01 2009 28 End Marker found! Dump is valid! © 2011 IBM Corporation
  • 29. IBM Live Virtual Class – Linux on System z Automatic dump on panic (SLES 10/11, RHEL 5/6): dumpconf  The dumpconf tool configures a dump device that is used for automatic dump in case of a kernel panic. – The command can be installed as service script under /etc/init.d/dumpconf or can be called manually. – Start service: # service dumpconf start – It reads the configuration file /etc/sysconfig/dumpconf. – Example configuration for CCW dump device (DASD) and reipl after dump: ON_PANIC=dump_reipl   DUMP_TYPE=ccw DEVICE=0.0.4711 29 © 2011 IBM Corporation
  • 30. IBM Live Virtual Class – Linux on System z Automatic dump on panic (SLES 10/11, RHEL 5): dumpconf (cont'd) – Example configuration for FCP dump device (SCSI disk): ON_PANIC=dump  DUMP_TYPE=fcp DEVICE=0.0.4714 WWPN=0x5005076303004712  LUN=0x4047401300000000 BOOTPROG=0 BR_LBA=0 – Example configuration for re-IPL without taking a dump, if a kernel panic occurs: ON_PANIC=reipl – Example of executing a CP command, and rebooting from device 4711 if a kernel panic occurs: ON_PANIC=vmcmd     VMCMD_1="MSG <vmguest> Starting VMDUMP"  VMCMD_2="VMDUMP" VMCMD_3="IPL 4711" 30 © 2011 IBM Corporation
  • 31. IBM Live Virtual Class – Linux on System z Get dump and send it to service organization  DASD/Tape: – Store dump to Linux file system from dump device: root@larsson:~>  zgetdump /dev/<device node> > dump_file – Alternative: lcrash (Compression possible) root@larsson:~>  lcrash ­d /dev/dasdxx ­s <dir>  SCSI: – Get dump from filesystem  Additional files needed for dump analysis: – SUSE (lcrash tool): /boot/System.map-xxx and /boot/Kerntypes-xxx – Redhat & SUSE (crash tool): vmlinux file (kernel with debug info) contained in debug kernel rpms: • RedHat: kernel-debuginfo-xxx.rpm and kernel-debuginfo-common-xxx.rpm • SUSE: kernel-default-debuginfo-xxx.rpm 31 © 2011 IBM Corporation
  • 32. IBM Live Virtual Class – Linux on System z Handling large dumps  Compress the dump and split it into parts of 1 GB root@larsson:~>  zgetdump /dev/dasdc1 | gzip | split ­b 1G  Several compressed files such as xaa, xab, xac, .... are created  Create md5 sums of the compressed files root@larsson:~>  md5sum xa* > dump.md5       Upload all parts together with the md5 information  Verification of the parts for a receiver root@larsson:~>  md5sum ­c dump.md5  xaa: OK [....]    Merge the parts and uncompress the dump root@larsson:~>  cat xa* | gunzip ­c > dump 32 © 2011 IBM Corporation
  • 33. IBM Live Virtual Class – Linux on System z Transferring dumps  Transferring single volume dumps with ssh root@larsson:~>  zgetdump /dev/dasdc1 | ssh user@host "cat >  dump_file_on_target_host"   Transferring multi-volume dumps with ssh root@larsson:~>  zgetdump /dev/dasdc | ssh user@host "cat >  multi_volume_dump_file_on_target_host"  Transferring a dump with ftp – Establish an ftp session with the target host, login and set the transfer mode to binary – Send the dump to the host root@larsson:~>  ftp> put |"zgetdump /dev/dasdc1"  <dump_file_on_target_host> 33 © 2011 IBM Corporation
  • 34. IBM Live Virtual Class – Linux on System z Dump tool summary Tool Stand alone tools VMDUMP DASD Tape SCSI Environment VM&LPAR VM&LPAR VM Preparation mkdir zipl -d /dev/<dump_dev> /dumps/mydumps --- zipl -D /dev/sda1 ... Creation Stop CPU & Store status vmdump ipl <dump_dev_CUU> Dump ECKD or LINUX file system Tape cartridges VM reader medium FBA on a SCSI disk Copy to Dumpload zgetdump /dev/<dump_dev> filesystem --- ftp ... > dump_file vmconvert ... Viewing lcrash or crash See “Using the dump tools” book on http://www.ibm.com/developerworks/linux/linux390/documentation_dev.html 34 © 2011 IBM Corporation
  • 35. IBM Live Virtual Class – Linux on System z Customer Cases 35 © 2011 IBM Corporation
  • 36. IBM Live Virtual Class – Linux on System z Availability: Guest spontaneously reboots  Configuration: HA Cluster – Oracle RAC server or other HA solution under z/VM Linux 1 Linux 2  Problem Description: communication – Occasionally guests spontaneously reboot without any notification or console message Oracle RAC Oracle RAC  Tools used for problem Server Server determination: – cp instruction trace of (re)IPL code – Crash dump taken after trace was hit Oracle RAC Database 36 © 2011 IBM Corporation
  • 37. IBM Live Virtual Class – Linux on System z Availability: Guest spontaneously reboots - Steps to find root cause  Question: Who rebooted the system?  Step 1 – Find out address of (re)ipl code in the system map – Use this address to set instruction trace cd /boot grep machine_restart System.map­2.6.16.60­0.54.5­default  000000000010c364 T machine_restart 00000000001171c8 t do_machine_restart 0000000000603200 D _machine_restart 37 © 2011 IBM Corporation
  • 38. IBM Live Virtual Class – Linux on System z Availability: Guest spontaneously reboots - Steps to find root cause (cont'd)  Step 2 – Set CP instruction trace on the reboot address – System is halted at that address, when a reboot is triggered CP CPU ALL TR IN R 10C364.4 HCPTRI1027I An active trace set has turned RUN off CP Q TR NAME  INITIAL     (ACTIVE)   1     INSTR   PSWA  0010C364­0010C367         TERM    NOPRINT  NORUN SIM         SKIP 00000  PASS 00000 STOP 00000  STEP 00000         CMD  NONE  ­> 000000000010C364'  STMF    EBCFF0780024 >> 000000003A557D48      CC 2 38 © 2011 IBM Corporation
  • 39. IBM Live Virtual Class – Linux on System z Availability: Guest spontaneously reboots - Steps to find root cause (cont'd)  Step 3 – Take a dump, when the (re)ipl code is hit cp cpu all stop cp store status Store complete.   cp i 4fc6 Tracing active at IPL HCPGSP2630I The virtual machine is placed in CP mode due to a  SOGP stop and store status from CPU 00. zIPL v1.6.3­0.24.5 dump tool (64bit) Dumping 64 bit OS 00000128 / 00001024 MB ...... 00001024 / 00001024 MB Dump successful HCPIR450W CP entered, disabled wait PSW 00020000 80000000  00000000 00000000 39 © 2011 IBM Corporation
  • 40. IBM Live Virtual Class – Linux on System z Availability: Guest spontaneously reboots - Steps to find root cause (cont'd)  Step 4 – Save dump in a file zgetdump /dev/dasdb1  > dump_file Dump device: /dev/dasdb1 >>>  Dump header information  <<< Dump created on: Wed Oct 27 12:00:40 2010 Magic number:  0xa8190173618f23fd Version number:  4 Header size:  4096 Page size:  4096 Dumped memory:  1073741824 Dumped pages:  262144 Real memory:  1073741824 cpu id:  0xff00012320948000 System Arch:  s390x (ESAME) Build Arch:  s390x (ESAME) >>>  End of Dump header  <<< Reading dump content ................................ Dump ended on:  Wed Oct 27 12:00:52 2010 40 © 2011 IBM Corporation Dump End Marker found: this dump is valid.
  • 41. IBM Live Virtual Class – Linux on System z Availability: Guest spontaneously reboots - Steps to find root cause (cont'd)  Step 5 – Use (l)crash, to find out, which process has triggered the reboot  STACK:  0 start_kernel+950 [0x6a690e]  1 _stext+32 [0x100020] ================================================================ TASK HAS CPU (1): 0x3f720650 (oprocd.bin):  LOWCORE INFO:   ­psw      : 0x0704200180000000 0x000000000010c36a   ­function : machine_restart+6   ­prefix   : 0x3f438000   ­cpu timer: 0x7fffffff 0xff9e6c00   ­clock cmp: 0x00c6ca69 0x22337760   ­general registers: <snip> /var/opt/oracle/product/crs/bin/oprocd.bin  STACK:  0 __handle_sysrq+248 [0x361240]  1 write_sysrq_trigger+98 [0x2be796]  2 sys_write+392 [0x225a68] 41  3 sysc_noemu+16 [0x1179a8] © 2011 IBM Corporation
  • 42. IBM Live Virtual Class – Linux on System z Availability: Guest spontaneously reboots (cont'd)  Problem Origin: – HA component erroneously detected a system hang • hangcheck_timer module did not receive timer IRQ • z/VM 'time bomb' switch • TSA monitor  z/VM cannot guarantee 'real-time' behavior if overloaded – Longest 'hang' observed: 37 seconds(!)  Solution: – Offload HA workload from overloaded z/VM • e.g. use separate z/VM • or: run large Oracle RAC guests in LPAR 42 © 2011 IBM Corporation
  • 43. IBM Live Virtual Class – Linux on System z Network: network connection is too slow  Configuration: – z/VSE running CICs, connection to DB2 in Linux on System z – Hipersocket connection from Linux to z/VSE – But also applies to hipersocket connections between Linux and z/OS  Problem Description: – When CICS transactions were monitored, some transactions take a couple of seconds instead of milliseconds  Tools used for problem determination: – dbginfo.sh – s390 debug feature – sadc/sar – CICS transaction monitor 43 © 2011 IBM Corporation
  • 44. IBM Live Virtual Class – Linux on System z Network: network connection is too slow (cont'd)  s390 debug feature – Check for qeth errors: cat /sys/kernel/debug/s390dbf/qeth_qerr 00 01282632346:099575 2 ­ 00 0000000180b20218  71 6f 75 74 65 72 72 00 | qouterr. 00 01282632346:099575 2 ­ 00 0000000180b20298  20 46 31 35 3d 31 30 00 |  F15=10. 00 01282632346:099576 2 ­ 00 0000000180b20318  20 46 31 34 3d 30 30 00 |  F14=00. 00 01282632346:099576 2 ­ 00 0000000180b20390  20 71 65 72 72 3d 41 46 |  qerr=AF 00 01282632346:099576 2 ­ 00 0000000180b20408  20 73 65 72 72 3d 32 00 |  serr=2.  dbginfo file – Check for buffer count: cat /sys/devices/qeth/0.0.1e00/buffer_count 16  Problem Origin: – Too few inbound buffers 44 © 2011 IBM Corporation
  • 45. IBM Live Virtual Class – Linux on System z Network: network connection is too slow (cont'd)  Solution: – Increase inbound buffer count (default: 16, max 128) – Check actual buffer count with 'lsqeth -p' – Set the inbound buffer count in the appropriate config file: • SUSE SLES10: - in /etc/sysconfig/hardware/hwcfg-qeth-bus-ccw-0.0.F200 - add QETH_OPTIONS="buffer_count=128" • SUSE SLES11: - in /etc/udev/rules.d/51-qeth-0.0.f200.rules add ACTION=="add", SUBSYSTEM=="ccwgroup", KERNEL=="0.0.f200", ATTR{buffer_count}="128" • Red Hat: - in /etc/sysconfig/network-scripts/ifcfg-eth0 - add OPTIONS="buffer_count=128" 45 © 2011 IBM Corporation
  • 46. IBM Live Virtual Class – Linux on System z High Disk response times  Configuration: – z10, HDS Storage Server (Hyper PAV enabled) – z/VM, Linux with Oracle Database – VM controlled Minidisks attached to Linux, LVM on top  Problem description: – I/O throughput not matching expectations – Oracle Database shows poor performance because of that – One LVM volume showing significant stress  Tools used for problem determination: – dbginfo.sh – sadc/sar – z/VM Monitor data 46 © 2011 IBM Corporation
  • 47. IBM Live Virtual Class – Linux on System z High Disk response times (cont'd) Observation in Linux Response time dm-9 0.00 0.00 49.75 0.00 19790.50 0.00 795.56 17.89 15.79 2.01 100.00 Throughput Utilization Conclusion  PAV not being utilized  No Hyper PAV support in SLES10 SP2  Static PAV not possible with current setup (VM controlled minidisks)  Need to look for other ways for more parallel I/O – Link same minidisk multiple times to a guest – Use smaller minidisks and increase striping in Linux 47 © 2011 IBM Corporation
  • 48. IBM Live Virtual Class – Linux on System z High Disk response times (cont'd) Initial and proposed setup Physical Disk Logical Disk(s) VM Logical Disk(s) Linux 1. Initial Setup Link multiple times multipath 2. Link Minidisks to guest multiple times striping 3. Smaller disks, more stripes LVM / Device mapper 48 © 2011 IBM Corporation
  • 49. IBM Live Virtual Class – Linux on System z High Disk response times (cont'd) New Observation in Linux  Response times stay equal  Throughput equal  No PAV being used!! 49 © 2011 IBM Corporation
  • 50. IBM Live Virtual Class – Linux on System z High Disk response times (cont'd) Solution: check PAV setup in VM 50 © 2011 IBM Corporation
  • 51. IBM Live Virtual Class – Linux on System z High Disk response times (cont'd) Finally: dasdaen         133.33     0.00  278.11    0.00 12298.51     0.00    88.44     0.79    2.83   1.48  41.29 dasdcbt         161.19     0.00  248.26    0.00 12260.70     0.00    98.77     0.91    3.47   1.88  46.77 dasdfwc         149.75     0.00  266.17    0.00 12374.13     0.00    92.98     1.88    7.07   2.54  67.66 dasdael         162.19     0.00  250.25    0.00 12483.58     0.00    99.77     1.90    7.57   2.86  71.64 dasddyz         134.83     0.00  277.61    0.00 12431.84     0.00    89.56     0.75    2.71   1.68  46.77 dasdaem         151.24     0.00  266.17    0.00 12595.02     0.00    94.64     2.01    7.61   2.82  75.12 dasdcbr         169.65     0.00  242.79    0.00 12386.07     0.00   102.03     1.72    7.05   2.83  68.66 dasdfwd         162.69     0.00  249.25    0.00 12348.26     0.00    99.08     1.92    7.70   2.83  70.65 dasddyy         157.21     0.00  259.70    0.00 12409.95     0.00    95.57     2.58    9.96   3.05  79.10 dasddyx         174.63     0.00  237.81    0.00 12374.13     0.00   104.07     1.76    7.38   2.93  69.65 dasdcbs         144.78     0.00  272.14    0.00 12264.68     0.00    90.14     2.53    9.31   2.89  78.61 dasda             0.00     0.00    0.00    1.00     0.00     3.98     8.00     0.01   10.00   5.00   0.50 dasdq             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00 dasdss            0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00 dasdadx           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0. dasdawh           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00 dasdamk           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00 dasdaek         160.70     0.00  255.22    0.00 12382.09     0.00    97.03     2.27    8.95   2.88  73.63 dasdcbq         148.76     0.00  265.67    0.00 12372.14     0.00    93.14     2.14    8.01   2.85  75.62 dasddyw         162.19     0.00  254.23    0.00 12384.08     0.00    97.42     2.12    8.40   2.90  73.63 dasdfwe         146.27     0.00  271.64    0.00 12419.90     0.00    91.44     2.63    9.71   2.80  76.12 dasdfwf         162.19     0.00  249.75    0.00 12455.72     0.00    99.75     0.71    2.83   1.79  44.78 dasdb             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00 dm­0              0.00     0.00 1646.77    0.00 49494.53     0.00    60.11     5.08    3.04   0.36  59.70 dm­1              0.00     0.00 1665.17    0.00 49482.59     0.00    59.43    15.00    9.04   0.56  93.53 dm­2              0.00     0.00 1660.70    0.00 49432.84     0.00    59.53    13.46    8.11   0.55  90.55 dm­3              0.00     0.00 1647.26    0.00 49490.55     0.00    60.09    12.05    7.32   0.53  87.56 dm­4              0.00     0.00 1646.77    0.00 49494.53     0.00    60.11     5.08    3.04   0.36  59.70 dm­5              0.00     0.00 1665.17    0.00 49482.59     0.00    59.43    15.00    9.04   0.56  93.53 dm­6              0.00     0.00 1660.70    0.00 49432.84     0.00    59.53    13.46    8.11   0.55  90.55 dm­7              0.00     0.00 1647.26    0.00 49490.55     0.00    60.09    12.06    7.32   0.53  87.56 dm­8              0.00     0.00    0.00    1.99     0.00     7.96     8.00     0.00    0.00   0.00   0.00 dm­9              0.00     0.00  497.51    0.00 197900.50    0.00   795.56     7.89   15.79   2.01 100.00 51 © 2011 IBM Corporation
  • 52. IBM Live Virtual Class – Linux on System z Bonding throughput not matching expectations  Configuration: – SLES10 system, connected via OSA card and using bonding driver  Problem Description: – Bonding only working with 100mbps – FTP also slow  Tools used for problem determination: – dbginfo.sh, netperf  Problem Origin: – ethtool cannot determine line speed correctly because qeth does not report it  Solution: – Ignore the 100mbps message – upgrade to SLES11 bonding: bond1: Warning: failed to get speed and duplex  from eth0, assumed to be 100Mb/sec and Full 52 © 2011 IBM Corporation
  • 53. IBM Live Virtual Class – Linux on System z Availability: Unable to mount file system after LVM changes Linux 2  Configuration: – Linux HA cluster with two nodes – Accessing same dasds which are exported via ocfs2 Linux 1  Problem Description: – Added one node to cluster, brought Logical Volume online – Unable to mount the filesystem from OCFS2 any node after that  Tools used for problem dasda b c d e f determination: – dbginfo.sh Logical Volume 53 © 2011 IBM Corporation
  • 54. IBM Live Virtual Class – Linux on System z Availability: Unable to mount file system after LVM changes (cont'd) Linux 2  Problem Origin: – LVM metadata was overwritten when adding 3rd node Linux 3 Linux 1 – e.g. superblock not found  Solution: – Extract meta data from {pv|vg|lv}create running node OCFS2 (/etc/lvm/backup) and write to disk again dasdf e c a b d Logical Volume 54 © 2011 IBM Corporation
  • 55. IBM Live Virtual Class – Linux on System z Kernel panic: Low address protection  Configuration: – z10 only – High work load – The more likely the more multithreaded applications are running  Problem Description: – Concurrent access to pages to be removed from the page table  Tools used for problem determination: – crash/lcrash  Problem Origin: – Race condition in memory management in firmware  Solution: – Upgrade to latest firmware! – Upgrade to latest kernels – fix to be integrated in all supported distributions 55 © 2011 IBM Corporation