SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Linux on System z –
       performance update
      zLG12             Mario Held




©2012 IBM Corporation
Trademarks
      IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International
      Business Machines Corp., registered in many jurisdictions worldwide. Other product and
      service names might be trademarks of IBM or other companies. A current list of IBM
      trademarks is available on the Web at “Copyright and trademark information” at
      www.ibm.com/legal/copytrade.shtml.


      Linux is a registered trademark of Linus Torvalds in the United States, other countries, or
      both.


      Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc.
      in the United States, other countries, or both.


      Other product and service names might be trademarks of IBM or other companies.




©2012 IBM Corporation                                                                               2
Agenda
         zEnterprise EC12 design
         Linux performance comparison zEC12 and
          z196
         Performance improvements in other areas
                   Java JRE 1.7.0




©2012 IBM Corporation                               3
zEC12 – Under the covers




              zEC12)

©2012 IBM Corporation            4
zEC12 Continues the Mainframe Heritage
                                                                              5.5
                                                                      5.2
                                                                              GHz
        6000                                                          GHz


                                                             4.4
        5000                                                 GHz



        4000



        3000
                                                    1.7
  MHz




                                                    GHz
                                             1.2
        2000
                                      770    GHz
                               550    MHz
               300      420
                               MHz
               MHz      MHz
        1000



          0
               1997     1998   1999   2000   2003   2005     2008    2010    2012
                G4       G5     G6    z900   z990   Z9 EC   Z10 EC   z196   zEC12




©2012 IBM Corporation                                                               5
The evolution of mainframe generations




                                      zEC12




©2012 IBM Corporation                         6
Processor Design Basics
      CPU (core)
               Cycle time
               Pipeline, execution order
                                              Generic Hierarchy example
               Branch prediction
               Hardware versus millicode
        Memory subsystem                                 Memory
               High speed buffers (caches)
                On chip, on book
                                                         Shared Cache
                Private, shared
                Coherency required
                                               Private     Private       Private
               Buses                          Cache       Cache         Cache
                Number
                Bandwidth
                                                CPU         CPU      …    CPU

               Limits
                Distance + speed of light
                Space



©2012 IBM Corporation                                                              7
zEC12 vs. z196 Hardware Comparison
       z196
             CPU                                                           Memory
               5.2 Ghz
               Out-of-Order execution                                       L4 Cache
             Caches
               L1 private 64k instr, 128k data                      L3 Cache      …         L3 Cache

               L2 private 1.5 MiB
                                                                L2          L2          L2          L2
               L3 shared 24 MiB per chip
               L4 shared 192 MiB per book
                                                                L1
                                                                       …    L1          L1
                                                                                               …    L1

                                                               CPU 1       CPU 4       CPU 1       CPU 4
       zEC12
             CPU
               5.5 GHz                                                          Memory
               Improved Out-of-Order execution
             Caches
                                                                                 L4 Cache
               L1 private 64k instr, 96k data
               L1+ 1 MiB (acts as second level data cache)
               L2 private 1 MiB (acts as second instruction
                                                                      L3 Cache      …         L3 Cache

                cache)                                           L2          L2          L2          L2
               L3 shared 48 MiB per chip                        L1+
                                                                        …    L1+
                                                                             L1
                                                                                         L1+
                                                                                                …    L1+

               L4 shared 2 x 192 MiB => 384 MiB per book         L1
                                                                CPU 1       CPU 6
                                                                                         L1
                                                                                        CPU 1
                                                                                                        L1
                                                                                                    CPU 6

©2012 IBM Corporation                                                                                        8
Agenda
       zEnterprise zEC12 design
       Linux performance comparison zEC12 and
        z196
       Performance improvements in other areas
                  Java JRE 1.7.0




©2012 IBM Corporation                             9
zEC12 vs z196 comparison Environment

           Hardware
                           zEC12                 2827-708 H66 with pre-GA
                                                  microcode, pre-GA hardware
                           z196                  2817-766 M66
                           (z10                  2097-726 E26)
           Linux distribution with recent kernel
                           SLES11 SP2: 3.0.13-0.27-default
                           Linux in LPAR
                           Shared processors
                           Other LPARs deactivated

  Source: If applicable, describe source origin


©2012 IBM Corporation                                                          10
File server benchmark description
       Dbench 3
               Emulation of Netbench benchmark
               Generates file system load on the Linux VFS
               Does the same I/O calls like the smbd server in Samba
                (without networking calls)
               Mixed file operations workload for each process: create,
                write, read, append, delete
               Measures throughput of transferred data
       Configuration
               2 GiB memory, mainly memory operations
               Scaling processors 1, 2, 4, 8, 16
               For each processor configuration scaling
                processes 1, 4, 8, 12, 16, 20, 26, 32, 40



©2012 IBM Corporation                                                      11
Dbench3 (IBM internal driver)                                          20
        Throughput improves by 40 percent in this                              10
         scaling experiment comparing z196 to z10

                                 Dbench throughput




                                                                      4 CPUs z10
                                                                      8 CPUs z10
                                                                      16 CPUs z10

                                                                      4 CPUs z196
                                                                      8 CPUs z196
                                                                      16 CPUs z196




             4          8   12    16         20        26   32   40



                                 Number of processes


©2012 IBM Corporation                                                                12
Dbench3
      Throughput improves by 38 to 68 percent in this
       scaling experiment comparing zEC12 to z196

                                     Dbench Throughput




                                                                        1 CPU z196
                                                                        2 CPUs z196
                                                                        4 CPUs z196
                                                                        8 CPUs z196
                                                                        16 CPUs z196
                                                                        1 CPUs zEC12
                                                                        2 CPUs zEC12
                                                                        4 CPUs zEC12
                                                                        8 CPUs zEC12
                                                                        16 CPUs zEC12




             1          4   8   12      16       20      26   32   40
                                Number of processes


©2012 IBM Corporation                                                                   13
Kernel benchmark description
         Lmbench 3
                 Suite of operating system micro-benchmarks
                 Focuses on interactions between the operating system
                  and the hardware architecture
                 Latency measurements for process handling and
                  communication
                 Latency measurements for basic system calls
                 Bandwidth measurements for memory and file access,
                  operations and movement
                 Configuration
                  2 GB memory
                  4 processors




©2012 IBM Corporation                                                    14
Lmbench3                                                                              20
     Most benefits in L3 and L4 cache, overall +40%                                         10
     Measured operation                                                Deviation z196 to z10 in %
     simple syscall                                                                  -30
     simple read/write                                                                 0
     select of file descriptors                                                       35
     signal handler                                                                  -22
     process fork                                                                     25
     libc bcopy aligned L1 / L2 / L3 / L4 cache / main memory            0 / 20 / 100 / 300 / n/a
     libc bcopy unaligned L1 / L2 / L3 / L4 cache / main memory             15 / 0 / 0 / 40 / n/a
     memory bzero L1 / L2 / L3 / L4 cache / main memory                  35 / 90 / 300 / 800 / n/a
     memory partial read L1 / L2 / L3 / L4 cache / main memory           45 / 25 / 130 / 500 / n/a
     memory partial read/write L1 / L2 / L3 / L4 cache / main memory      15 / 15 / 10 / 120 / n/a
     memory partial write L1 / L2 / L3 / L4 cache / main memory           80 / 30 / 60 / 300 / n/a
     memory read L1 / L2 / L3 / L4 cache / main memory                    10 / 30 / 40 / 300 / n/a
     memory write L1 / L2 / L3 / L4 cache / main memory                   50 / 30 / 30 / 180 / n/a
     Mmap read L1 / L2 / L3 / L4 cache / main memory                      50 / 35 / 85 / 300 / n/a
     Mmap read open2close L1 / L2 / L3 / L4 cache / main memory           40 / 35 / 50 / 200 / n/a
     Read L1 / L2 / L3 / L4 cache / main memory                           20 / 40 / 90 / 300 / n/a
     Read open2close L1 / L2 / L3 / L4 cache / main memory                25 / 35 / 90 / 300 / n/a
     Unrolled bcopy bcopy unaligned L1L3L2L4L3 / L4 cache memory
              partial unaligned L1 / L2 / / / / cache / main / main      100 / 75 / 75 / 200 / n/a
     memory                                                                70 / 0 / 80 / 300 / n/a
     mappings                                                                         40

©2012 IBM Corporation                                                                                15
Lmbench3
      Benefits seen in the very most operations

      Measured operation                                                     Deviation zEC12 to z196 in %
      simple syscall                                                                         52
      simple read/write                                                                   46 /43
      select of file descriptors                                                             32
      signal handler                                                                         55
      process fork                                                                           25
      libc bcopy aligned L1 / L2 / L3 / L4 cache / main memory                    0 / 12 / 25 / 10 / n/a
      libc bcopy unaligned L1 / L2 / L3 / L4 cache / main memory                   0 / 26 / 25 / 35 / n/a
      memory bzero L1 / L2 / L3 / L4 cache / main memory                          40 / 13 / 20 / 45 / n/a
      memory partial read L1 / L2 / L3 / L4 cache / main memory                 -10 / 25 / 45 / 105 / n/a
      memory partial read/write L1 / L2 / L3 / L4 cache / main memory            75 / 75 / 90 / 180 / n/a
      memory partial write L1 / L2 / L3 / L4 cache / main memory                 45 / 50 / 62 / 165 / n/a
      memory read L1 / L2 / L3 / L4 cache / main memory                           5 / 10 / 45 / 120 / n/a
      memory write L1 / L2 / L3 / L4 cache / main memory                        80 / 92 / 120 / 250 / n/a
      Mmap read L1 / L2 / L3 / L4 cache / main memory                             0 / 13 / 35 / 110 / n/a
      Mmap read open2close L1 / L2 / L3 / L4 cache / main memory                  23 / 18 / 19 / 55 / n/a
      Read L1 / L2 / L3 / L4 cache / main memory                                  60 / 30 / 35 / 50 / n/a
      Read open2close L1 / L2 / L3 / L4 cache / main memory                       27 / 30 / 35 / 60 / n/a
      Unrolled bcopy unaligned L1 / L2 / L3 / L4 cache / main memory              35 / 28 / 60 / 35 / n/a
      Unrolled partial bcopy unaligned L1 / L2 / L3 / L4 cache / main memory      35 / 13 / 45 / 20 / n/a
      mappings                                                                             34-41

©2012 IBM Corporation                                                                                       16
Java benchmark description
      Java server benchmark
              Evaluates the performance of server side Java
              Exercises
               Java Virtual Machine (JVM)
               Just-In-Time compiler (JIT)
               Garbage collection
               Multiple threads
               Simulates real-world applications including XML processing or
                floating point operations
              Can be used to measure performance of processors,
               memory hierarchy and scalability


      Configurations
              8 processors, 2 GiB memory, 1 JVM

©2012 IBM Corporation                                                           17
Java benchmark
        Business operation throughput improved by                                                    20
         approximately 44%
                        IBM J9 JRE 1.6.0 SR9 64-bit                                                     10
                        8 processors, 2 GiB memory, 1 JVM
                                                  SLES11-SP1 results
            Throughput




                                                                                                   z10
                                                                                                   z196




                          1   2   3   4   5   6   7   8   9   10 11 12 13 14 15   16 17 18 19 20

                                                      Number of Warehouses


©2012 IBM Corporation                                                                                         18
Java benchmark
       Business operation throughput improved by
        approximately 65%
             IBM J9 JRE 1.6.0 SR9 64-bit
             8 processors, 2 GiB memory, 1 JVM
       Results seen with a single LPAR active on the machine
       On a fully utilized machine we expect approximately 30%
                                                        SLES11-SP2 results
              Throughput




                                                                                                          z196
                                                                                                          zEC12




                           1   2   3   4   5   6   7   8   9   10   11   12   13 14 15 16 17 18 19   20
                                                       Number of Warehouses

©2012 IBM Corporation                                                                                             19
CPU-intense benchmark suite
     Stressing a system's processor, memory
      subsystem and compiler
     Workloads developed from real user applications
     Exercising integer and floating point in C, C++,
      and Fortran programs
     Can be used to evaluate compile options
     Can be used to optimize the compiler's code
      generation for a given target system
     Configuration
                1 processor, 2 GiB memory, executing one test
                 case at a time


©2012 IBM Corporation                                            20
Single-threaded, compute-intense                                                                              20
       workload                                                                                                         10
    Linux: Internal driver (kernel 2.6.29) gcc 4.5, glibc 2.9.3
                Integer suite improves by 76% (geometric mean)
                Floating Point suite improves by 86% (geometric mean)


      Integer cases z196 (march=z196) versus z10 (march=z10)         Floating point cases z196 (march=z196) versus z10 (march=z10)
                          improvements [%]                                                  improvements [%]
                 0   20      40      60      80      100       120                 0   20   40   60    80      100   120   140   160

    testcase 1                                                        testcase 1
                                                                      testcase 2
    testcase 2
                                                                      testcase 3
    testcase 3                                                        testcase 4
                                                                      testcase 5
    testcase 4
                                                                      testcase 6
    testcase 5                                                        testcase 7
    testcase 6                                                        testcase 8
                                                                      testcase 9
    testcase 7                                                       testcase 10
    testcase 8                                                       testcase 11
                                                                     testcase 12
    testcase 9
                                                                     testcase 13
   testcase 10                                                       testcase 14
                                                                     testcase 15
   testcase 11
                                                                     testcase 16
   testcase 12                                                       testcase 17



©2012 IBM Corporation                                                                                                            21
Single-threaded, compute-intense workload
     SLES11 SP2 GA, gcc-4.3-62.198, glibc-2.11.3-17.31.1 using
      default machine optimization options as in gcc-4.3 s390x
                    Integer suite improves by 28% (geometric mean)
                    Floating Point suite improves by 31% (geometric mean)

      Integer zEC12 versus z196 (march=z9-109 mtune=z10)                Floating-Point zEC12 versus z196 (march=z9-109 mtune=z10)
                               improvements [%]                                               improvements [%]
                 0    5   10      15   20   25    30   35   40   45                 -20   0   20     40     60    80     100    120

    testcase 1                                                         testcase 1
                                                                       testcase 2
    testcase 2                                                         testcase 3
    testcase 3                                                         testcase 4
                                                                       testcase 5
    testcase 4
                                                                       testcase 6
    testcase 5                                                         testcase 7
                                                                       testcase 8
    testcase 6
                                                                       testcase 9
    testcase 7                                                        testcase 10
    testcase 8                                                        testcase 11
                                                                      testcase 12
    testcase 9
                                                                      testcase 13
   testcase 10                                                        testcase 14
                                                                      testcase 15
   testcase 11
                                                                      testcase 16
   testcase 12                                                        testcase 17




©2012 IBM Corporation                                                                                                          22
Benchmark description – Network
       Network Benchmark which simulates several workloads
       Transactional Workloads
              2 types
               RR – A connection to the server is opened once for a 5 minute time frame
               CRR – A connection is opened and closed for every request/response
              4 sizes
               RR 1x1 – Simulating low latency keepalives
               RR 200x1000 – Simulating online transactions
               RR 200x32k – Simulating database query
               CRR 64x8k – Simulating website access
       Streaming Workloads – 2 types
              STRP/STRG – Simulating incoming/outgoing large file transfers
               (20mx20)
       All tests are done with 1, 10 and 50 simultaneous connections
       All that across on multiple connection types (different cards and
        MTU configurations)


©2012 IBM Corporation                                                                      23
AWM Hipersockets MTU-32k IPv4
            LPAR-LPAR
             More transactions / throughput with 1, 10 and 50 connections
             More data transferred at 20 to 30 percent lower processor consumption

             RR/CRR Transactions per second                                                                                               STREAM throughput
                                            Deviation in percent                                                                               Deviation in percent
            60                                                                                                                   60

            50                                                                                                                   50

            40                                                                                                                   40
  Percent




            30                                                                                                                   30



                                                                                                                       Percent
            20                                                                                                                   20
                                                                                                       Deviation zEC12 -                                                                        Deviation zEC12 -
                                                                                                       z196                                                                                     z196
            10                                                                                                                   10

             0                                                                                                                   0
                                              rr1x1_10
                                              rr1x1_50
                              crr64x8k_10
                              crr64x8k_50




                                                         rr200x1000_10
                                                         rr200x1000_50



                                                                         rr200x32k_10
                                                                                        rr200x32k_50
                 crr64x8k_1




                                                          rr200x1000_1
                                               rr1x1_1




                                                                          rr200x32k_1




                                                                                                                                               strg_10




                                                                                                                                                                            strp_10

                                                                                                                                                                                      strp_50
                                                                                                                                                         strg_50
                                                                                                                                      strg_1




                                                                                                                                                                   strp_1
                                              Workload name                                                                                                Workload name

©2012 IBM Corporation                                                                                                                                                                                        24
Benchmark description – Re-Aim 7
       Scalability benchmark Re-Aim-7
             Open Source equivalent to the AIM Multiuser benchmark
             Workload patterns describe system call ratios (patterns can be more ipc, disk
              or calculation intensive)
             The benchmark then scales concurrent jobs until the overall throughput drops
               Starts with one job, continuously increases that number
               Overall throughput usually increases until #threads ≈ #CPUs
               Then threads are further increased until a drop in throughput occurs
               Scales up to thousands of concurrent threads stressing the same components
             Often a good check for non-scaling interfaces
               Some interfaces don't scale at all (1 Job throughput ≈ multiple jobs throughput,
                despite >1 CPUs)
               Some interfaces only scale in certain ranges (throughput suddenly drops earlier as
                expected)
             Measures the amount of jobs per minute a single thread and all the threads
              can achieve
       Our Setup
             2, 8, 16 CPUs, 4 GiB memory, scaling until overall performance drops
             Using a journaled file system on an xpram device (stress FS code, but not be
              I/O bound)
             Using fserver, new-db and compute workload patterns
©2012 IBM Corporation                                                                                25
Re-Aim Fserver
      Higher throughput with 4, 8, and 16 PUs (25 to 50
       percent) at 30 percent lower processor consumption

                                                                                        Reaim Fserver profile - 16CPU
              Throughput in jobs per minute




                                                                                                                                                   z196
                                                                                               8 processors, 2 GiB memory, 1 JVM
                                                                                                                                                   zEC12




                                                  2       4       6       8       10 16 24 40 65 87 132 193 302 487 651 110618032426284837664848
                                              1       3       5       7       9     12 20 32 48 72 119 159 244 360 540 762 13252004273232484374
                                                                                               Number of processes


©2012 IBM Corporation                                                                                                                                      26
Re-Aim Newdb
      Higher throughput with 4, 8, and 16 CPUs (42 to 66
       percent) at 35 percent lower processor consumption

                                                                                           Reaim NEWDB profile - 16CPU
              Throughput in Jobs per minute




                                                                                                                                                             z196
                                                                                                                                                             zEC12




                                                  3       7       12        24        45        82        122   215   330   474   697   979 1584 2260 3308
                                              1       5       9        21        29        61        97      163   258   402   602   838 1282 1922 2994
                                                                                                 Number of processes


©2012 IBM Corporation                                                                                                                                                27
Re-Aim Compute
      Higher throughput with 4, 8, and 16 CPUs (25to 45 percent)
       at 20 to 30 percent lower processor consumption
                                                                                  Reaim Compute profile - 16CPUs
               Throughput in Jobs per minute




                                                                                                                                                                                z196
                                                                                                                                                                                zEC12




                                                   3       7       11        17        33        66        112         207         327         475         758      1032 1832
                                               1       5       9        13        25        51        87         135         231         408         567         853   1360
                                                                                            Number of processes


©2012 IBM Corporation                                                                                                                                                                   28
DB2 database workload
        Benchmark: complex database warehouse application running on DB2 V10.1
        Upgrade to from z196 to z12EC provides
              Improvements of throughput by 30.4 percent
              Reduction of processor load
        Another 50.2 percent performance improvement we see when comparing
         z196 to z10

                                                        Database warehouse performance nomalized
                                                 150%
               Transactional Througput per CPU




                                                 125%


                                                 100%


                                                 75%


                                                 50%


                                                 25%


                                                  0%
                                                           z196                         zEC12

©2012 IBM Corporation                                                                              29
Agenda

         zEnterprise zEC12 design
         Linux performance comparison zEC12 and
          z196
         Performance improvements in other areas
                   Java JRE 1.7.0




©2012 IBM Corporation                               30
Java – JRE 1.6.0 SR9 vs. JRE 1.7.0 SR1
       Business operation throughput improved by 29%
                  2 GiB, 8CPU, 1 JVM, only Java versions substituted
                JRE 1.6.0 IBM J9 2.4 SR9 20110624_85526 (JIT enabled, AOT enabled)
                JRE 1.7.0 IBM J9 2.6 SR1 20120322_106209 (JIT enabled, AOT enabled)
       Similar improvements seen over the last years when upgrading to newer Java
        versions
                  Some software products are bundled with a particular Java version
                  In this case the software product needs an upgrade to profit of the improved
                   performance
                                          Java benchmark
             Throughput




                                                                                                  JRE 1.6.0 IBM J9 SR9
                                                                                                  zEC12
                                                                                                   JRE 1.7.0 IBM J9 SR1
                                                                                                  zEC12




                          1   2   3   4   5   6   7    8   9   10 11 12 13 14 15 16 17 18 19 20
                                                      Number of Warehouses


©2012 IBM Corporation                                                                                                     31
Summary
   Tremendous performance gains
          Performance improvement seen in close to all areas measured yet
          Often combined with processor consumption reduction
          More improvement than just from higher rate to expect
           Rate is up from 5.2 GHz to 5.5 GHz which means close to 6 percent higher
           New cache setup with much bigger caches
           Out-of-order execution of the second generation
           Better branch prediction
   Some exemplary performance gains with Linux workloads
          About 30 to 67 percent for Java
          Up to 30 percent for complex database
          Up to 31 percent for single threaded CPU intense
          About 38 to 68 percent when scaling processors and/or processes
   Performance team has to measure more scenarios with intense disk
    access and network access when an exclusive z12EC GA
    measurement environment with required I/O options gets available
   No new zEC12 instructions exploited yet because no machine
    optimized GCC available in a supported distribution yet

©2012 IBM Corporation                                                                  32
Questions

         Further information is located at
                   Linux on System z – Tuning hints and tips
                    http://www.ibm.com/developerworks/linux/linux390/perf/index.html

                   Live Virtual Classes for z/VM and Linux
                    http://www.vm.ibm.com/education/lvc/




                                                 Mario Held               IBM Deutschland Research
                                                                          & Development
                                                 Linux on System z
                                                                          Schoenaicher Strasse 220
                                                 System Software
                                                                          71032 Boeblingen, Germany
                                                 Performance Engineer
                                                                          Phone +49 (0)7031–16–4257
                                                                          Email mario.held@de.ibm.com




©2012 IBM Corporation                                                                                   33

Más contenido relacionado

La actualidad más candente

iMinds The Conference: Jan Lemeire
iMinds The Conference: Jan LemeireiMinds The Conference: Jan Lemeire
iMinds The Conference: Jan Lemeireimec
 
Avnet & Rorke Data - Open Compute Summit '13
Avnet & Rorke Data - Open Compute Summit '13Avnet & Rorke Data - Open Compute Summit '13
Avnet & Rorke Data - Open Compute Summit '13DaWane Wanek
 
Hardware assisted Virtualization in Embedded
Hardware assisted Virtualization in EmbeddedHardware assisted Virtualization in Embedded
Hardware assisted Virtualization in EmbeddedThe Linux Foundation
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsShinya Takamaeda-Y
 
Windows server 2008 r2
Windows server 2008 r2Windows server 2008 r2
Windows server 2008 r2PTS-Ian
 
Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...
Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...
Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...Enkitec
 
MSI N480GTX Lightning Infokit
MSI N480GTX Lightning InfokitMSI N480GTX Lightning Infokit
MSI N480GTX Lightning InfokitMSI
 
Poser pro reference manual
Poser pro reference manualPoser pro reference manual
Poser pro reference manualSykrayo
 
Erlang factory slides
Erlang factory slidesErlang factory slides
Erlang factory slidesNoah Linden
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009Randall Hand
 

La actualidad más candente (17)

iMinds The Conference: Jan Lemeire
iMinds The Conference: Jan LemeireiMinds The Conference: Jan Lemeire
iMinds The Conference: Jan Lemeire
 
Seven deadly
Seven deadly Seven deadly
Seven deadly
 
Vigor Ex
Vigor ExVigor Ex
Vigor Ex
 
Avnet & Rorke Data - Open Compute Summit '13
Avnet & Rorke Data - Open Compute Summit '13Avnet & Rorke Data - Open Compute Summit '13
Avnet & Rorke Data - Open Compute Summit '13
 
Brochure NAS LG
Brochure NAS LGBrochure NAS LG
Brochure NAS LG
 
Hardware assisted Virtualization in Embedded
Hardware assisted Virtualization in EmbeddedHardware assisted Virtualization in Embedded
Hardware assisted Virtualization in Embedded
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
 
Implement Checkpointing for Android (ELCE2012)
Implement Checkpointing for Android (ELCE2012)Implement Checkpointing for Android (ELCE2012)
Implement Checkpointing for Android (ELCE2012)
 
Implement Checkpointing for Android
Implement Checkpointing for AndroidImplement Checkpointing for Android
Implement Checkpointing for Android
 
Windows server 2008 r2
Windows server 2008 r2Windows server 2008 r2
Windows server 2008 r2
 
Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...
Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...
Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...
 
MSI N480GTX Lightning Infokit
MSI N480GTX Lightning InfokitMSI N480GTX Lightning Infokit
MSI N480GTX Lightning Infokit
 
Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11
 
Flash memory
Flash memoryFlash memory
Flash memory
 
Poser pro reference manual
Poser pro reference manualPoser pro reference manual
Poser pro reference manual
 
Erlang factory slides
Erlang factory slidesErlang factory slides
Erlang factory slides
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 

Similar a Linux on System z – performance update

Sun sparc enterprise t5120 and t5220 servers technical presentation
Sun sparc enterprise t5120 and t5220 servers technical presentationSun sparc enterprise t5120 and t5220 servers technical presentation
Sun sparc enterprise t5120 and t5220 servers technical presentationxKinAnx
 
MIPS Assembly Language I
MIPS Assembly Language IMIPS Assembly Language I
MIPS Assembly Language ILiEdo
 
Sun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentationSun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentationxKinAnx
 
ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be Slow
ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be SlowELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be Slow
ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be SlowBenjamin Zores
 
Sun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationSun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationxKinAnx
 
2013 02 08 annunci power 7 plus sito cta
2013 02 08 annunci power 7 plus sito cta2013 02 08 annunci power 7 plus sito cta
2013 02 08 annunci power 7 plus sito ctaLorenzo Corbetta
 
Trend - HPC-29mai2012
Trend - HPC-29mai2012Trend - HPC-29mai2012
Trend - HPC-29mai2012Agora Group
 
I3 multicore processor
I3 multicore processorI3 multicore processor
I3 multicore processorAmol Barewar
 
IBM System x3850 X5 Technical Presenation abbrv.
IBM System x3850 X5 Technical Presenation abbrv.IBM System x3850 X5 Technical Presenation abbrv.
IBM System x3850 X5 Technical Presenation abbrv.meye0611
 
Porting Android ABS 2011
Porting Android ABS 2011Porting Android ABS 2011
Porting Android ABS 2011Opersys inc.
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architecturesinside-BigData.com
 
Application optimization with POWER 7
Application optimization with POWER 7Application optimization with POWER 7
Application optimization with POWER 7COMMON Europe
 
Core 2 processors
Core 2 processorsCore 2 processors
Core 2 processorsArun Kumar
 
Processors with Advanced Technologies
Processors with Advanced TechnologiesProcessors with Advanced Technologies
Processors with Advanced TechnologiesCherukuriGopikrishna
 
Processor powerpoint
Processor powerpointProcessor powerpoint
Processor powerpointbrennan_jame
 
Hadoop on a personal supercomputer
Hadoop on a personal supercomputerHadoop on a personal supercomputer
Hadoop on a personal supercomputerPaul Dingman
 

Similar a Linux on System z – performance update (20)

Sun sparc enterprise t5120 and t5220 servers technical presentation
Sun sparc enterprise t5120 and t5220 servers technical presentationSun sparc enterprise t5120 and t5220 servers technical presentation
Sun sparc enterprise t5120 and t5220 servers technical presentation
 
MIPS Assembly Language I
MIPS Assembly Language IMIPS Assembly Language I
MIPS Assembly Language I
 
Sun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentationSun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentation
 
ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be Slow
ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be SlowELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be Slow
ELCE 2011 - BZ - Embedded Linux Optimization Techniques - How Not To Be Slow
 
Sun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationSun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentation
 
2013 02 08 annunci power 7 plus sito cta
2013 02 08 annunci power 7 plus sito cta2013 02 08 annunci power 7 plus sito cta
2013 02 08 annunci power 7 plus sito cta
 
Trend - HPC-29mai2012
Trend - HPC-29mai2012Trend - HPC-29mai2012
Trend - HPC-29mai2012
 
I3 multicore processor
I3 multicore processorI3 multicore processor
I3 multicore processor
 
I3
I3I3
I3
 
IBM System x3850 X5 Technical Presenation abbrv.
IBM System x3850 X5 Technical Presenation abbrv.IBM System x3850 X5 Technical Presenation abbrv.
IBM System x3850 X5 Technical Presenation abbrv.
 
Porting Android
Porting AndroidPorting Android
Porting Android
 
Porting Android
Porting AndroidPorting Android
Porting Android
 
Porting Android ABS 2011
Porting Android ABS 2011Porting Android ABS 2011
Porting Android ABS 2011
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
 
Application optimization with POWER 7
Application optimization with POWER 7Application optimization with POWER 7
Application optimization with POWER 7
 
Core 2 processors
Core 2 processorsCore 2 processors
Core 2 processors
 
Processors with Advanced Technologies
Processors with Advanced TechnologiesProcessors with Advanced Technologies
Processors with Advanced Technologies
 
Fpga technology
Fpga technologyFpga technology
Fpga technology
 
Processor powerpoint
Processor powerpointProcessor powerpoint
Processor powerpoint
 
Hadoop on a personal supercomputer
Hadoop on a personal supercomputerHadoop on a personal supercomputer
Hadoop on a personal supercomputer
 

Más de IBM India Smarter Computing

Using the IBM XIV Storage System in OpenStack Cloud Environments
Using the IBM XIV Storage System in OpenStack Cloud Environments Using the IBM XIV Storage System in OpenStack Cloud Environments
Using the IBM XIV Storage System in OpenStack Cloud Environments IBM India Smarter Computing
 
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...IBM India Smarter Computing
 
A Comparison of PowerVM and Vmware Virtualization Performance
A Comparison of PowerVM and Vmware Virtualization PerformanceA Comparison of PowerVM and Vmware Virtualization Performance
A Comparison of PowerVM and Vmware Virtualization PerformanceIBM India Smarter Computing
 
IBM pureflex system and vmware vcloud enterprise suite reference architecture
IBM pureflex system and vmware vcloud enterprise suite reference architectureIBM pureflex system and vmware vcloud enterprise suite reference architecture
IBM pureflex system and vmware vcloud enterprise suite reference architectureIBM India Smarter Computing
 

Más de IBM India Smarter Computing (20)

Using the IBM XIV Storage System in OpenStack Cloud Environments
Using the IBM XIV Storage System in OpenStack Cloud Environments Using the IBM XIV Storage System in OpenStack Cloud Environments
Using the IBM XIV Storage System in OpenStack Cloud Environments
 
All-flash Needs End to End Storage Efficiency
All-flash Needs End to End Storage EfficiencyAll-flash Needs End to End Storage Efficiency
All-flash Needs End to End Storage Efficiency
 
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
 
IBM FlashSystem 840 Product Guide
IBM FlashSystem 840 Product GuideIBM FlashSystem 840 Product Guide
IBM FlashSystem 840 Product Guide
 
IBM System x3250 M5
IBM System x3250 M5IBM System x3250 M5
IBM System x3250 M5
 
IBM NeXtScale nx360 M4
IBM NeXtScale nx360 M4IBM NeXtScale nx360 M4
IBM NeXtScale nx360 M4
 
IBM System x3650 M4 HD
IBM System x3650 M4 HDIBM System x3650 M4 HD
IBM System x3650 M4 HD
 
IBM System x3300 M4
IBM System x3300 M4IBM System x3300 M4
IBM System x3300 M4
 
IBM System x iDataPlex dx360 M4
IBM System x iDataPlex dx360 M4IBM System x iDataPlex dx360 M4
IBM System x iDataPlex dx360 M4
 
IBM System x3500 M4
IBM System x3500 M4IBM System x3500 M4
IBM System x3500 M4
 
IBM System x3550 M4
IBM System x3550 M4IBM System x3550 M4
IBM System x3550 M4
 
IBM System x3650 M4
IBM System x3650 M4IBM System x3650 M4
IBM System x3650 M4
 
IBM System x3500 M3
IBM System x3500 M3IBM System x3500 M3
IBM System x3500 M3
 
IBM System x3400 M3
IBM System x3400 M3IBM System x3400 M3
IBM System x3400 M3
 
IBM System x3250 M3
IBM System x3250 M3IBM System x3250 M3
IBM System x3250 M3
 
IBM System x3200 M3
IBM System x3200 M3IBM System x3200 M3
IBM System x3200 M3
 
IBM PowerVC Introduction and Configuration
IBM PowerVC Introduction and ConfigurationIBM PowerVC Introduction and Configuration
IBM PowerVC Introduction and Configuration
 
A Comparison of PowerVM and Vmware Virtualization Performance
A Comparison of PowerVM and Vmware Virtualization PerformanceA Comparison of PowerVM and Vmware Virtualization Performance
A Comparison of PowerVM and Vmware Virtualization Performance
 
IBM pureflex system and vmware vcloud enterprise suite reference architecture
IBM pureflex system and vmware vcloud enterprise suite reference architectureIBM pureflex system and vmware vcloud enterprise suite reference architecture
IBM pureflex system and vmware vcloud enterprise suite reference architecture
 
X6: The sixth generation of EXA Technology
X6: The sixth generation of EXA TechnologyX6: The sixth generation of EXA Technology
X6: The sixth generation of EXA Technology
 

Linux on System z – performance update

  • 1. Linux on System z – performance update zLG12 Mario Held ©2012 IBM Corporation
  • 2. Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Other product and service names might be trademarks of IBM or other companies. ©2012 IBM Corporation 2
  • 3. Agenda  zEnterprise EC12 design  Linux performance comparison zEC12 and z196  Performance improvements in other areas  Java JRE 1.7.0 ©2012 IBM Corporation 3
  • 4. zEC12 – Under the covers zEC12) ©2012 IBM Corporation 4
  • 5. zEC12 Continues the Mainframe Heritage 5.5 5.2 GHz 6000 GHz 4.4 5000 GHz 4000 3000 1.7 MHz GHz 1.2 2000 770 GHz 550 MHz 300 420 MHz MHz MHz 1000 0 1997 1998 1999 2000 2003 2005 2008 2010 2012 G4 G5 G6 z900 z990 Z9 EC Z10 EC z196 zEC12 ©2012 IBM Corporation 5
  • 6. The evolution of mainframe generations zEC12 ©2012 IBM Corporation 6
  • 7. Processor Design Basics CPU (core)  Cycle time  Pipeline, execution order Generic Hierarchy example  Branch prediction  Hardware versus millicode  Memory subsystem Memory  High speed buffers (caches)  On chip, on book Shared Cache  Private, shared  Coherency required Private Private Private  Buses Cache Cache Cache  Number  Bandwidth CPU CPU … CPU  Limits  Distance + speed of light  Space ©2012 IBM Corporation 7
  • 8. zEC12 vs. z196 Hardware Comparison  z196  CPU Memory  5.2 Ghz  Out-of-Order execution L4 Cache  Caches  L1 private 64k instr, 128k data L3 Cache … L3 Cache  L2 private 1.5 MiB L2 L2 L2 L2  L3 shared 24 MiB per chip  L4 shared 192 MiB per book L1 … L1 L1 … L1 CPU 1 CPU 4 CPU 1 CPU 4  zEC12  CPU  5.5 GHz Memory  Improved Out-of-Order execution  Caches L4 Cache  L1 private 64k instr, 96k data  L1+ 1 MiB (acts as second level data cache)  L2 private 1 MiB (acts as second instruction L3 Cache … L3 Cache cache) L2 L2 L2 L2  L3 shared 48 MiB per chip L1+ … L1+ L1 L1+ … L1+  L4 shared 2 x 192 MiB => 384 MiB per book L1 CPU 1 CPU 6 L1 CPU 1 L1 CPU 6 ©2012 IBM Corporation 8
  • 9. Agenda  zEnterprise zEC12 design  Linux performance comparison zEC12 and z196  Performance improvements in other areas  Java JRE 1.7.0 ©2012 IBM Corporation 9
  • 10. zEC12 vs z196 comparison Environment  Hardware  zEC12 2827-708 H66 with pre-GA microcode, pre-GA hardware  z196 2817-766 M66  (z10 2097-726 E26)  Linux distribution with recent kernel  SLES11 SP2: 3.0.13-0.27-default  Linux in LPAR  Shared processors  Other LPARs deactivated Source: If applicable, describe source origin ©2012 IBM Corporation 10
  • 11. File server benchmark description  Dbench 3  Emulation of Netbench benchmark  Generates file system load on the Linux VFS  Does the same I/O calls like the smbd server in Samba (without networking calls)  Mixed file operations workload for each process: create, write, read, append, delete  Measures throughput of transferred data  Configuration  2 GiB memory, mainly memory operations  Scaling processors 1, 2, 4, 8, 16  For each processor configuration scaling processes 1, 4, 8, 12, 16, 20, 26, 32, 40 ©2012 IBM Corporation 11
  • 12. Dbench3 (IBM internal driver) 20  Throughput improves by 40 percent in this 10 scaling experiment comparing z196 to z10 Dbench throughput 4 CPUs z10 8 CPUs z10 16 CPUs z10 4 CPUs z196 8 CPUs z196 16 CPUs z196 4 8 12 16 20 26 32 40 Number of processes ©2012 IBM Corporation 12
  • 13. Dbench3  Throughput improves by 38 to 68 percent in this scaling experiment comparing zEC12 to z196 Dbench Throughput 1 CPU z196 2 CPUs z196 4 CPUs z196 8 CPUs z196 16 CPUs z196 1 CPUs zEC12 2 CPUs zEC12 4 CPUs zEC12 8 CPUs zEC12 16 CPUs zEC12 1 4 8 12 16 20 26 32 40 Number of processes ©2012 IBM Corporation 13
  • 14. Kernel benchmark description  Lmbench 3  Suite of operating system micro-benchmarks  Focuses on interactions between the operating system and the hardware architecture  Latency measurements for process handling and communication  Latency measurements for basic system calls  Bandwidth measurements for memory and file access, operations and movement  Configuration  2 GB memory  4 processors ©2012 IBM Corporation 14
  • 15. Lmbench3 20  Most benefits in L3 and L4 cache, overall +40% 10 Measured operation Deviation z196 to z10 in % simple syscall -30 simple read/write 0 select of file descriptors 35 signal handler -22 process fork 25 libc bcopy aligned L1 / L2 / L3 / L4 cache / main memory 0 / 20 / 100 / 300 / n/a libc bcopy unaligned L1 / L2 / L3 / L4 cache / main memory 15 / 0 / 0 / 40 / n/a memory bzero L1 / L2 / L3 / L4 cache / main memory 35 / 90 / 300 / 800 / n/a memory partial read L1 / L2 / L3 / L4 cache / main memory 45 / 25 / 130 / 500 / n/a memory partial read/write L1 / L2 / L3 / L4 cache / main memory 15 / 15 / 10 / 120 / n/a memory partial write L1 / L2 / L3 / L4 cache / main memory 80 / 30 / 60 / 300 / n/a memory read L1 / L2 / L3 / L4 cache / main memory 10 / 30 / 40 / 300 / n/a memory write L1 / L2 / L3 / L4 cache / main memory 50 / 30 / 30 / 180 / n/a Mmap read L1 / L2 / L3 / L4 cache / main memory 50 / 35 / 85 / 300 / n/a Mmap read open2close L1 / L2 / L3 / L4 cache / main memory 40 / 35 / 50 / 200 / n/a Read L1 / L2 / L3 / L4 cache / main memory 20 / 40 / 90 / 300 / n/a Read open2close L1 / L2 / L3 / L4 cache / main memory 25 / 35 / 90 / 300 / n/a Unrolled bcopy bcopy unaligned L1L3L2L4L3 / L4 cache memory partial unaligned L1 / L2 / / / / cache / main / main 100 / 75 / 75 / 200 / n/a memory 70 / 0 / 80 / 300 / n/a mappings 40 ©2012 IBM Corporation 15
  • 16. Lmbench3  Benefits seen in the very most operations Measured operation Deviation zEC12 to z196 in % simple syscall 52 simple read/write 46 /43 select of file descriptors 32 signal handler 55 process fork 25 libc bcopy aligned L1 / L2 / L3 / L4 cache / main memory 0 / 12 / 25 / 10 / n/a libc bcopy unaligned L1 / L2 / L3 / L4 cache / main memory 0 / 26 / 25 / 35 / n/a memory bzero L1 / L2 / L3 / L4 cache / main memory 40 / 13 / 20 / 45 / n/a memory partial read L1 / L2 / L3 / L4 cache / main memory -10 / 25 / 45 / 105 / n/a memory partial read/write L1 / L2 / L3 / L4 cache / main memory 75 / 75 / 90 / 180 / n/a memory partial write L1 / L2 / L3 / L4 cache / main memory 45 / 50 / 62 / 165 / n/a memory read L1 / L2 / L3 / L4 cache / main memory 5 / 10 / 45 / 120 / n/a memory write L1 / L2 / L3 / L4 cache / main memory 80 / 92 / 120 / 250 / n/a Mmap read L1 / L2 / L3 / L4 cache / main memory 0 / 13 / 35 / 110 / n/a Mmap read open2close L1 / L2 / L3 / L4 cache / main memory 23 / 18 / 19 / 55 / n/a Read L1 / L2 / L3 / L4 cache / main memory 60 / 30 / 35 / 50 / n/a Read open2close L1 / L2 / L3 / L4 cache / main memory 27 / 30 / 35 / 60 / n/a Unrolled bcopy unaligned L1 / L2 / L3 / L4 cache / main memory 35 / 28 / 60 / 35 / n/a Unrolled partial bcopy unaligned L1 / L2 / L3 / L4 cache / main memory 35 / 13 / 45 / 20 / n/a mappings 34-41 ©2012 IBM Corporation 16
  • 17. Java benchmark description  Java server benchmark  Evaluates the performance of server side Java  Exercises  Java Virtual Machine (JVM)  Just-In-Time compiler (JIT)  Garbage collection  Multiple threads  Simulates real-world applications including XML processing or floating point operations  Can be used to measure performance of processors, memory hierarchy and scalability  Configurations  8 processors, 2 GiB memory, 1 JVM ©2012 IBM Corporation 17
  • 18. Java benchmark  Business operation throughput improved by 20 approximately 44%  IBM J9 JRE 1.6.0 SR9 64-bit 10  8 processors, 2 GiB memory, 1 JVM SLES11-SP1 results Throughput z10 z196 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Warehouses ©2012 IBM Corporation 18
  • 19. Java benchmark  Business operation throughput improved by approximately 65%  IBM J9 JRE 1.6.0 SR9 64-bit  8 processors, 2 GiB memory, 1 JVM  Results seen with a single LPAR active on the machine  On a fully utilized machine we expect approximately 30% SLES11-SP2 results Throughput z196 zEC12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Warehouses ©2012 IBM Corporation 19
  • 20. CPU-intense benchmark suite  Stressing a system's processor, memory subsystem and compiler  Workloads developed from real user applications  Exercising integer and floating point in C, C++, and Fortran programs  Can be used to evaluate compile options  Can be used to optimize the compiler's code generation for a given target system  Configuration  1 processor, 2 GiB memory, executing one test case at a time ©2012 IBM Corporation 20
  • 21. Single-threaded, compute-intense 20 workload 10  Linux: Internal driver (kernel 2.6.29) gcc 4.5, glibc 2.9.3  Integer suite improves by 76% (geometric mean)  Floating Point suite improves by 86% (geometric mean) Integer cases z196 (march=z196) versus z10 (march=z10) Floating point cases z196 (march=z196) versus z10 (march=z10) improvements [%] improvements [%] 0 20 40 60 80 100 120 0 20 40 60 80 100 120 140 160 testcase 1 testcase 1 testcase 2 testcase 2 testcase 3 testcase 3 testcase 4 testcase 5 testcase 4 testcase 6 testcase 5 testcase 7 testcase 6 testcase 8 testcase 9 testcase 7 testcase 10 testcase 8 testcase 11 testcase 12 testcase 9 testcase 13 testcase 10 testcase 14 testcase 15 testcase 11 testcase 16 testcase 12 testcase 17 ©2012 IBM Corporation 21
  • 22. Single-threaded, compute-intense workload  SLES11 SP2 GA, gcc-4.3-62.198, glibc-2.11.3-17.31.1 using default machine optimization options as in gcc-4.3 s390x  Integer suite improves by 28% (geometric mean)  Floating Point suite improves by 31% (geometric mean) Integer zEC12 versus z196 (march=z9-109 mtune=z10) Floating-Point zEC12 versus z196 (march=z9-109 mtune=z10) improvements [%] improvements [%] 0 5 10 15 20 25 30 35 40 45 -20 0 20 40 60 80 100 120 testcase 1 testcase 1 testcase 2 testcase 2 testcase 3 testcase 3 testcase 4 testcase 5 testcase 4 testcase 6 testcase 5 testcase 7 testcase 8 testcase 6 testcase 9 testcase 7 testcase 10 testcase 8 testcase 11 testcase 12 testcase 9 testcase 13 testcase 10 testcase 14 testcase 15 testcase 11 testcase 16 testcase 12 testcase 17 ©2012 IBM Corporation 22
  • 23. Benchmark description – Network  Network Benchmark which simulates several workloads  Transactional Workloads  2 types  RR – A connection to the server is opened once for a 5 minute time frame  CRR – A connection is opened and closed for every request/response  4 sizes  RR 1x1 – Simulating low latency keepalives  RR 200x1000 – Simulating online transactions  RR 200x32k – Simulating database query  CRR 64x8k – Simulating website access  Streaming Workloads – 2 types  STRP/STRG – Simulating incoming/outgoing large file transfers (20mx20)  All tests are done with 1, 10 and 50 simultaneous connections  All that across on multiple connection types (different cards and MTU configurations) ©2012 IBM Corporation 23
  • 24. AWM Hipersockets MTU-32k IPv4 LPAR-LPAR  More transactions / throughput with 1, 10 and 50 connections  More data transferred at 20 to 30 percent lower processor consumption RR/CRR Transactions per second STREAM throughput Deviation in percent Deviation in percent 60 60 50 50 40 40 Percent 30 30 Percent 20 20 Deviation zEC12 - Deviation zEC12 - z196 z196 10 10 0 0 rr1x1_10 rr1x1_50 crr64x8k_10 crr64x8k_50 rr200x1000_10 rr200x1000_50 rr200x32k_10 rr200x32k_50 crr64x8k_1 rr200x1000_1 rr1x1_1 rr200x32k_1 strg_10 strp_10 strp_50 strg_50 strg_1 strp_1 Workload name Workload name ©2012 IBM Corporation 24
  • 25. Benchmark description – Re-Aim 7  Scalability benchmark Re-Aim-7  Open Source equivalent to the AIM Multiuser benchmark  Workload patterns describe system call ratios (patterns can be more ipc, disk or calculation intensive)  The benchmark then scales concurrent jobs until the overall throughput drops  Starts with one job, continuously increases that number  Overall throughput usually increases until #threads ≈ #CPUs  Then threads are further increased until a drop in throughput occurs  Scales up to thousands of concurrent threads stressing the same components  Often a good check for non-scaling interfaces  Some interfaces don't scale at all (1 Job throughput ≈ multiple jobs throughput, despite >1 CPUs)  Some interfaces only scale in certain ranges (throughput suddenly drops earlier as expected)  Measures the amount of jobs per minute a single thread and all the threads can achieve  Our Setup  2, 8, 16 CPUs, 4 GiB memory, scaling until overall performance drops  Using a journaled file system on an xpram device (stress FS code, but not be I/O bound)  Using fserver, new-db and compute workload patterns ©2012 IBM Corporation 25
  • 26. Re-Aim Fserver  Higher throughput with 4, 8, and 16 PUs (25 to 50 percent) at 30 percent lower processor consumption Reaim Fserver profile - 16CPU Throughput in jobs per minute z196 8 processors, 2 GiB memory, 1 JVM zEC12 2 4 6 8 10 16 24 40 65 87 132 193 302 487 651 110618032426284837664848 1 3 5 7 9 12 20 32 48 72 119 159 244 360 540 762 13252004273232484374 Number of processes ©2012 IBM Corporation 26
  • 27. Re-Aim Newdb  Higher throughput with 4, 8, and 16 CPUs (42 to 66 percent) at 35 percent lower processor consumption Reaim NEWDB profile - 16CPU Throughput in Jobs per minute z196 zEC12 3 7 12 24 45 82 122 215 330 474 697 979 1584 2260 3308 1 5 9 21 29 61 97 163 258 402 602 838 1282 1922 2994 Number of processes ©2012 IBM Corporation 27
  • 28. Re-Aim Compute  Higher throughput with 4, 8, and 16 CPUs (25to 45 percent) at 20 to 30 percent lower processor consumption Reaim Compute profile - 16CPUs Throughput in Jobs per minute z196 zEC12 3 7 11 17 33 66 112 207 327 475 758 1032 1832 1 5 9 13 25 51 87 135 231 408 567 853 1360 Number of processes ©2012 IBM Corporation 28
  • 29. DB2 database workload  Benchmark: complex database warehouse application running on DB2 V10.1  Upgrade to from z196 to z12EC provides  Improvements of throughput by 30.4 percent  Reduction of processor load  Another 50.2 percent performance improvement we see when comparing z196 to z10 Database warehouse performance nomalized 150% Transactional Througput per CPU 125% 100% 75% 50% 25% 0% z196 zEC12 ©2012 IBM Corporation 29
  • 30. Agenda  zEnterprise zEC12 design  Linux performance comparison zEC12 and z196  Performance improvements in other areas  Java JRE 1.7.0 ©2012 IBM Corporation 30
  • 31. Java – JRE 1.6.0 SR9 vs. JRE 1.7.0 SR1  Business operation throughput improved by 29%  2 GiB, 8CPU, 1 JVM, only Java versions substituted  JRE 1.6.0 IBM J9 2.4 SR9 20110624_85526 (JIT enabled, AOT enabled)  JRE 1.7.0 IBM J9 2.6 SR1 20120322_106209 (JIT enabled, AOT enabled)  Similar improvements seen over the last years when upgrading to newer Java versions  Some software products are bundled with a particular Java version  In this case the software product needs an upgrade to profit of the improved performance Java benchmark Throughput JRE 1.6.0 IBM J9 SR9 zEC12 JRE 1.7.0 IBM J9 SR1 zEC12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Warehouses ©2012 IBM Corporation 31
  • 32. Summary  Tremendous performance gains  Performance improvement seen in close to all areas measured yet  Often combined with processor consumption reduction  More improvement than just from higher rate to expect  Rate is up from 5.2 GHz to 5.5 GHz which means close to 6 percent higher  New cache setup with much bigger caches  Out-of-order execution of the second generation  Better branch prediction  Some exemplary performance gains with Linux workloads  About 30 to 67 percent for Java  Up to 30 percent for complex database  Up to 31 percent for single threaded CPU intense  About 38 to 68 percent when scaling processors and/or processes  Performance team has to measure more scenarios with intense disk access and network access when an exclusive z12EC GA measurement environment with required I/O options gets available  No new zEC12 instructions exploited yet because no machine optimized GCC available in a supported distribution yet ©2012 IBM Corporation 32
  • 33. Questions  Further information is located at  Linux on System z – Tuning hints and tips http://www.ibm.com/developerworks/linux/linux390/perf/index.html  Live Virtual Classes for z/VM and Linux http://www.vm.ibm.com/education/lvc/ Mario Held IBM Deutschland Research & Development Linux on System z Schoenaicher Strasse 220 System Software 71032 Boeblingen, Germany Performance Engineer Phone +49 (0)7031–16–4257 Email mario.held@de.ibm.com ©2012 IBM Corporation 33