SlideShare una empresa de Scribd logo
1 de 35
A SEMINAR REPORT

                                  ON

                           “BLUE GENEL”

Submitted in the partial fulfillment for the requirement of the award of
                                degree in

                   BACHELOR OF TECHNOLOGY

                                  IN

              COMPUTER SCIENCE AND ENGINEERING

                                  BY

                 M.S.RAMA KRISHNA ( 06M11A05A3)




          Department of Computer Science and Engineering
    BANDARI SRINIVAS INSTITUTE OF TECHNOLOGY, CHEVELLA
        Affiliated to JNTU, HYDERABAD, Approved by AICTE

                               2009-2010
BANDARI SRINIVAS INSTITUTE OF TECHNOLOGY
                              Chevella, R.R.Dist., A.P
      ( Approved by AICTE , New Delhi, Affiliated to J.N.T.U HYDERABAD )



                                                          Date: 30-03-2010




                              CERTIFICATE


This is to certify that it is a confide record of dissertation work entitled
“BLUE GENE/L” done by M.S.RAMA KRISHNA bearing                               roll
no.06M11A05A3 which has been submitted as a partial fulfillment of the
requirements for the award of degree in BACHELOR Of Technology in
Computer Science and Engineering , from Jawaharlal Nehru Technological
University, Hyderabad for the academic year 2009 – 2010. This work is not
submitted to any University for the award of any degree / diploma.




 ( Mr. VIJAY KUMAR)                                  ( Mr. CH.RAJ KISHORE )
ACKNOWLEDGEMENT


        I whole heartedly thank the BANDARI SRINIVAS INSTITUTE OF TECHNOLOGY for
giving me an opportunity to present the seminar report on technical paper in the college.


        I express my deep sense of gratitude to CH RAJA KISHORE Head of the Department
Computer Science and Engineering in BANDARI SRINIVAS INSTITUTE OF TECHNOLOGY
for his valuable guidance, inspiration and encouragement in presenting my seminar report.


        I am also thankful to (Mr. VIJAY KUMAR), internal guide for his valuable academic
guidance and support.


        I thank whole –heartedly to all the staff members of Department of Computer Science
and Engineering, for their support and encouragement in doing my seminar work.


        Lastly , I thank all those who have helped me directly and indirectly in doing this
seminar work successfully.




                                                                  NAME: M.S.RAMAKRISHNA
                                                                  ROLL NO: 06M11A05A3
ABSTRACT
Blue Gene is a massively parallel computer being
developed at the IBMThomas J. Watson Research Center.
Blue Gene represents a hundred-fold improvement on
performance compared with the fastest supercomputers of
today. It will achieve 1 PetaFLOP/sec through
unprecedented levels of parallelism in excess of 4,0000,000
threads of execution. The Blue Gene project has two
important goals, in which understanding of biologically
import processes will be advanced, as well as advancement
of knowledge of cellular architectures (massively parallel
system built of single chip cells that integrated
processors,memory and communication) , and of the
software needed to exploit those effectively. This massively
parallel system of 65,536 nodes is based on anew
architecture that exploits system-on-a-chip technology to
deliver target peak processing power of 360 teraFLOPS
(trillion floating-point operations per second). The machine
is scheduled to be operational in the 2004-2005 time frame,
at performance and power consumption/performance
targets unobtainable with conventional architectures.
In November 2001 IBM announced partnership with
LawrenceLivermoreNationalLaboratory to build the Blue
Gene/L (BG/L) supercomputer, a 65,536-node machine
designed around embedded PowerPC processors. Through
the use of system-on-a-chip integration coupled with a
highly scalable cellular architecture, Blue Gene/L will
deliver 180 or 360 Teraflops of peak computing power,
depending on the utilization mode. Blue Gene/L represents
a new level of scalability for parallel systems. Whereas
existing large scale systems range in size from hundreds to
a few of compute nodes,Blue Gene/L makes a jump of
almost two orders of magnitude.It is reasonably clear that
such machines, in the near future at least, will require a
departure from the architectures of the current parallel
supercomputers,which use few thousand commodity
microprocessors. With the current technology, it would
take around a million microprocessors to achieve a
petaFLOPS performance.Clearly, power requirements and
cost considerations alone preclude this option.Using such a
design, petaFLOPS performance will be reached within the
next 2-3 years, especially since IBM hasannounced the
Blue Gene project aimed at building such a machine.
Index
Contents                       Page No.

Chapter 1    Introduction

Chapter 2    Detailed report

Chapter 3    Applications

Chapter 4    Advantages

Chapter 5    Disadvantages

Chapter 6    Conclusions

References
1.Introduction
Blue Gene is a computer architecture project designed to
produce several supercomputers, designed to reach
operating speeds in the PFLOPS (petaFLOPS) range, and
currently reaching sustained speeds of nearly 500 TFLOPS
(teraFLOPS). It is a cooperative project among IBM
(particularly IBM Rochester and the Thomas J. Watson
Research Center), the Lawrence Livermore National
Laboratory, the United States Department of Energy (which
is partially funding the project), and academia. There are
four Blue Gene projects in development: Blue Gene/L,
Blue Gene/C, Blue Gene/P, and Blue Gene/Q.
The project was awarded the National Medal of
Technology and Innovation by U.S. President Barack
Obama on September 18, 2009. The president bestowed the
award on October 7, 2009.[1]
2.DESCPRITION
The first computer in the Blue Gene series, Blue Gene/L,
developed through a partnership with Lawrence Livermore
National Laboratory (LLNL), originally had a theoretical
peak performance of 360 TFLOPS, and scored over 280
TFLOPS sustained on the Linpack benchmark. After an
upgrade in 2007 the performance increased to 478 TFLOPS
sustained and 596 TFLOPS peak.The term Blue Gene/L
sometimes refers to the computer installed at LLNL; and
sometimes refers to the architecture of that computer. As of
November 2006, there are 27 computers on the Top500 list
using the Blue Gene/L architecture. All these computers are
listed as having an architecture of eServer Blue Gene
Solution.
In December 1999, IBM announced a $100 million
research initiative for a five-year effort to build a massively
parallel computer, to be applied to the study of
biomolecular phenomena such as protein folding. The
project has two main goals: to advance our understanding
of the mechanisms behind protein folding via large-scale
simulation, and to explore novel ideas in massively parallel
machine architecture and software. This project should
enable biomolecular simulations that are orders of
magnitude larger than current technology permits. Major
areas of investigation include: how to use this novel
platform to effectively meet its scientific goals, how to
make such massively parallel machines more usable, and
how to achieve performance targets at a reasonable cost,
through novel machine architectures. The design is built
largely around the previous QCDSP and QCDOC
supercomputers.
In November 2001, Lawrence Livermore National
Laboratory joined IBM as a research partner for Blue Gene.
On September 29, 2004, IBM announced that a Blue Gene/
L prototype at IBM Rochester (Minnesota) had overtaken
NEC's Earth Simulator as the fastest computer in the world,
with a speed of 36.01 TFLOPS on the Linpack benchmark,
beating Earth Simulator's 35.86 TFLOPS. This was
achieved with an 8-cabinet system, with each cabinet
holding 1,024 compute nodes. Upon doubling this
configuration to 16 cabinets, the machine reached a speed
of 70.72 TFLOPS by November 2004 , taking first place in
the Top500 list.
On March 24, 2005, the US Department of Energy
announced that the Blue Gene/L installation at LLNL broke
its speed record, reaching 135.5 TFLOPS. This feat was
possible because of doubling the number of cabinets to 32.
On the Top500 list,[2] Blue Gene/L installations across
several sites worldwide took 3 out of the 10 top positions,
and 13 out of the top 64. Three racks of Blue Gene/L are
housed at the San Diego Supercomputer Center and are
available for academic research.
On October 27, 2005, LLNL and IBM announced that Blue
Gene/L had once again broken its speed record, reaching
280.6 TFLOPS on Linpack, upon reaching its final
configuration of 65,536 "compute nodes" (i.e., 216 nodes)
and an additional 1024 "I/O nodes" in 64 air-cooled
cabinets. The LLNL Blue Gene/L uses Lustre to access
multiple filesystems in the 600TB-1PB range[3].
Blue Gene/L is also the first supercomputer ever to run
over 100 TFLOPS sustained on a real world application,
namely a three-dimensional molecular dynamics code
(ddcMD), simulating solidification (nucleation and growth
processes) of molten metal under high pressure and
temperature conditions. This achievement won the 2005
Gordon Bell Prize.
On June 22, 2006, NNSA and IBM announced that Blue
Gene/L has achieved 207.3 TFLOPS on a quantum
chemical application (Qbox).[4] On November 14, 2006, at
Supercomputing 2006,[5] Blue Gene/L was awarded the
winning prize in all HPC Challenge Classes of awards.[6]
A team from the IBM Almaden Research Center and the
University of Nevada on April 27, 2007 ran an artificial
neural network almost half as complex as the brain of a
mouse for the equivalent of a second (the network was run
at 1/10 of normal speed for 10 seconds).[7]
In November 2007, the LLNL Blue Gene/L remained at the
number one spot as the world's fastest supercomputer. It
had been upgraded since the previous measurement, and
was then almost three times as fast as the second fastest, a
Blue Gene/P system.
On June 18, 2008, the new Top500 List marked the first
time a Blue Gene system was not the leader in the Top500
since it had assumed that position, being topped by IBM's
Cell-based Roadrunner system which was the only system
to surpass the mythical petaflops mark. Top500 announced
that the Cray XT5 Jaguar housed at OCLF is currently the
fastest supercomputer in the world for open science.
Major features
The Blue Gene/L supercomputer is unique in the following
aspects:
1.Trading the speed of processors for lower power
consumption.
2.Dual processors per node with two working modes: co-
processor (1 user process/node: computation and
communication work is shared by two processors) and
virtual node (2 user processes/node)
3.System-on-a-chip design
4.A large number of nodes (scalable in increments of 1024
up to at least 65,536)
5.Three-dimensional torus interconnect with auxiliary
networks for global communications, I/O, and management
6.Lightweight OS per node for minimum system overhead
(computational noise)
Architecture
One Blue Gene/L node boardA schematic overview of a
Blue Gene/L supercomputer
Each Compute or I/O node is a single ASIC with associated
DRAM memory chips. The ASIC integrates two 700 MHz
PowerPC 440 embedded processors, each with a double-
pipeline-double-precision Floating Point Unit (FPU), a
cache sub-system with built-in DRAM controller and the
logic to support multiple communication sub-systems. The
dual FPUs give each Blue Gene/L node a theoretical peak
performance of 5.6 GFLOPS (gigaFLOPS). Node CPUs are
not cache coherent with one another.
Compute nodes are packaged two per compute card, with
16 compute cards plus up to 2 I/O nodes per node board.
There are 32 node boards per cabinet/rack.[9] By
integration of all essential sub-systems on a single chip,
each Compute or I/O node dissipates low power (about 17
watts, including DRAMs). This allows very aggressive
packaging of up to 1024 compute nodes plus additional I/O
nodes in the standard 19" cabinet, within reasonable limits
of electrical power supply and air cooling. The
performance metrics in terms of FLOPS per watt, FLOPS
per m2 of floorspace and FLOPS per unit cost allow scaling
up to very high performance.
Each Blue Gene/L node is attached to three parallel
communications networks: a 3D toroidal network for peer-
to-peer communication between compute nodes, a
collective network for collective communication, and a
global interrupt network for fast barriers. The I/O nodes,
which run the Linux operating system, provide
communication with the world via an Ethernet network.
The I/O nodes also handle the filesystem operations on
behalf of the compute nodes. Finally, a separate and private
Ethernet network provides access to any node for
configuration, booting and diagnostics.
Blue Gene/L compute nodes use a minimal operating
system supporting a single user program. Only a subset of
POSIX calls are supported, and only one process may be
run at a time. Programmers need to implement green
threads in order to simulate local concurrency.
Application development is usually performed in C, C++,
or Fortran using MPI for communication. However, some
scripting languages such as Ruby have been ported to the
compute nodes.[10]
To allow multiple programs to run concurrently, a Blue
Gene/L system can be partitioned into electronically
isolated sets of nodes. The number of nodes in a partition
must be a positive integer power of 2, and must contain at
least 25 = 32 nodes. The maximum partition is all nodes in
the computer. To run a program on Blue Gene/L, a partition
of the computer must first be reserved. The program is then
run on all the nodes within the partition, and no other
program may access nodes within the partition while it is in
use. Upon completion, the partition nodes are released for
future programs to use.
With so many nodes, component failures are inevitable.
The system is able to electrically isolate faulty hardware to
allow    the    machine     to    continue     to    run.




OPERATING SYSTEMS
Front-end nodes are commodity PCs running Linux,I/O
nodes run a customized Linux kernel,Compute nodes use
an extremely lightweight custom kernel,Service node is a
single multiprocessor machine running a custom OS
COMPUTER NODE KERNEL
Single user, dual-threaded,Flat address space, no paging,
Physical resources are memory-mapped,Provides standard
POSIX functionality (mostly),Two execution modes:
1.Virtual node mode
2.Coprocessor mode
SERVICE NODE OS
Core Management and Control System (CMCS),BG/L’s
“global” operating system.,MMCS - Midplane Monitoring
and Control System,CIOMAN - Control and I/O
Manager,DB2 relational database
Programming modes
This chapter provides information about the way in which
Message Passing Interface (MPI) isimplemented and used
on Blue Gene/L.There are two main modes in which you
can use Blue Gene/L:
1.Communication Coprocessor Mode
2. Virtual Node Mode
1.Communication Coprocessor Mode
In the default mode of operation of Blue Gene/L, named
Communication Coprocessor Mode,each physical compute
node executes a single compute process. The Blue Gene/L
system software treats those two processors in a compute
node asymmetrically. One of theprocessors (CPU 0)
behaves as a main processor, running the main thread of
thecomputer process. The other processor (CPU 1) behaves
as an offload engine (coprocessor) that only executes
specific operations.The coprocessor is used primarily for
offloading communication functions. It can also be usedfor
running application-level coroutines.
2.Virtual Node Mode
The Compute Node Kernel in the compute nodes also
supports a Virtual Node Mode of operation for the
machine. In that mode, the kernel runs two separate
processes in each compute node. Node resources (primarily
the memory and the torus network) are shared by both
processes.
In Virtual Node Mode, an application can use both
processors in a node simply by doubling itsnumber of MPI
tasks, without explicitly handling cache coherence issues.
The now distinct MPI tasks running in the two CPUs of a
compute node have to communicate to each other.
This problem was solved by implementing a virtual torus
device, serviced by a virtual packetlayer, in the scratchpad
memory.In Virtual Node Mode, the two cores of a compute
node act as different processes. Each has its own rank in the
message layer. The message layer supports Virtual Node
Mode byproviding a correct torus to rank mapping and first
in, first out (FIFO) pinning in this mode.
The hardware FIFOs are shared equally between the
processes. Torus coordinates are expressed by quadruplets
instead of triplets. In Virtual Node Mode, communication
between the two processors in a compute node cannot be
done over the network hardware. Instead, it is done via a
region of memory, called the scratchpad to which both
processors have access. Virtual FIFOs make portions of the
scratchpad look like a send FIFO to one of the processors
and a receive FIFO to the other. Access to the virtual FIFOs
is mediated with help from the hardware lockboxes. From
an application perspective, virtual nodes behave like
physical nodes, but with less memory. Each virtual node
executes one compute process. Processes in different
virtual nodes, even those allocated in the same compute
node, only communicate through messages. Processes
running in virtual node mode cannot invoke coroutines.
The Blue Gene/L MPI implementation supports Virtual
Node Mode operations by sharing the systems
communications resources of a physical compute node
between the two compute processes that execute on that
physical node. The low-level communications library of
Blue Gene/L, that is the message layer, virtualizes these
communications resources into logical units that each
process can use independently.
 Deciding which mode to use
Whether you choose to use Communication Coprocessor
Mode or Virtual Node Mode depends largely on the type of
application you plan to execute.
I/O intensive tasks that require a relatively large amount of
data interchange between compute nodes benefit more by
using Communication Coprocessor Mode. Those
applications that are primarily CPU bound, and do not have
large working memory requirements (the application only
gets half of the node memory), run more quickly in
VirtualNode Mode




System calls supported by theCompute Node Kernel
This chapter discusses the system calls (syscalls) that are
supported by the Compute Node Kernel. It is important for
you to understand which functions can be called, and
perhaps moreimportantly, which ones cannot be called, by
your application running on Blue Gene/L.

Introduction to the Compute Node Kernel
The role of the kernel on the Compute Node is to create an
environment for the execution of a user process which is
“Linux-like.” It is not a full Linux kernel implementation,
but rather implements a subset of POSIX functionality.
The Compute Node Kernel is a single-process operating
system. It is designed to provide the services that are
needed by applications which are expected to run on Blue
Gene/L, but notfor all applications. The Compute Node
Kernel is not intended to run system administration
functions from the compute node.To achieve the best
reliability, a small and simple kernel is a design goal. This
enables a simpler checkpoint function.The compute node
application never runs as the root user. In fact, it runs as the
same user(uid) and group (gid) under which the job was
submitted.

System calls
The Compute Node Kernel system calls are subdivided into
the following categories:
1 .File I/O
2 .Directory operations
3 .Time
 4.Process information
 5.Signals
 6.Miscellaneous
 7.Sockets
 8.Compute Node Kernel
Additional Compute Node Kernel application support
This section provides details about additional support
provided to application developers by the Compute Node
Kernel.
3.3.1 Allocating memory regions with specific L1 cache
attributes
Each PowerPC® 440 core on a compute node has a 32 KB
L1 data cache that is 64-way set associative and uses a 32-
byte cache line. A load or store operation to a virtual
address is translated by the hardware to a real address using
the Translation Lookaside Buffer (TLB). The real address
is used to select the “set.” If the address is available in the
data cache, it is returned to the processor without needing
to access lower levels of the memory subsystem. If the
address is not in the data cache (a cache miss), a cache line
is evicted from the data cache and is replaced with the
cache line containing the address. The “way” to evict from
the data cache is selected using a round-robin algorithm.
The L1 data cache can be divided into two regions: a
normal region and a transient region. The number of
“ways” to use for each region can be configured by the
application. The Blue Gene/L memory subsystem supports
the following L1 data cache attributes:_ Cache-inhibited or
cached Memory with the cache-inhibited attribute causes
all load and store operations to access the data from lower
levels of the memory subsystem. Memory with the cached
attribute might use the data cache for load and store
operations. The default attribute for application memory is
cached. _ Store without allocate (SWOA) or store with
allocate (SWA) Memory with the SWOA attribute
bypasses the L1 data cache on a cache miss for a store
operation, and the data is stored directly to lower levels of
the memory subsystem. Memory with the SWA attribute
allocates a line in the L1 data cache when there is a cache
miss for a store operation on the memory. The default
attribute for application memory is SWA.
   Write-through or write-back memory with the write-
through attribute is written through to the lower levels of
the memory subsystem for store operations. If the memory
also exists in the L1 data cache, it is written to the data
cache and the cache line is marked as clean. Memory with
the write-back attribute is written to the L1 data cache, and
the cache line is mar
Transient or normal
Memory with the transient attribute uses the transient
region of the L1 data cache. Memory with the normal
attribute uses the normal region of the L1 data cache. By
default, the L1 data cache is configured without a transient
region and all application memory uses the normal region.
Checkpoint and restart support
Why use checkpoint and restart Given the scale of the Blue
Gene/L system, faults are expected to be the norm rather
than the exception. This is unfortunately inevitable, given
the vast number of individual hardware processors and
other components involved in running the system
Checkpoint and restart are among the primary techniques
for fault recovery. A special user-level checkpoint library
has been developed for Blue Gene/L applications. Using
this library, application programs can take a checkpoint of
their program state at appropriate stages and can be
restarted later from their last successful checkpoint
Why should you be interested in this support? Among the
following examples, numerous
scenarios indicate that use of this support is warranted:
1.Your application is a long-running one. You do not want
it to fail a long time into a run, losing all the calculations
made up until the failure. Checkpoint and restart allow you
to restart the application at the last checkpoint position,
losing a much smaller slice ofprocessing time.
2. You are given access to a Blue Gene/L system for
relatively small increments of time, and you know that your
application run will take longer than your allotted amount
of processing
time. Checkpoint and restart allows you to execute your
application to completion indistinct “chunks,” rather than in
one continuous period of time.these are just two of many
reasons to use checkpoint and restart support in you Blue
Gene/L applications.
 7.2.1 Input/output considerations
All the external I/O calls made from a program are shipped
to the corresponding I/O Node using a function shipping
procedure implemented in the Compute Node Kernel The
checkpoint library intercepts calls to the five main file I/O
functions: open, close, read, write, and lseek. The function
name open is a weak alias that maps to the _libc_open
function. The checkpoint library intercepts this call and
provides its own implementation of open that internally
uses the _libc_open function The library maintains a file
state table that stores the file name and current file position
and the mode of all the files that are currently open. The
table also maintains a translation that translates the file
descriptors used by the Compute Node Kernel to another
set of file descriptors to be used by the application. While
taking a checkpoint, the file state table is also stored in the
checkpoint file. Upon a restart, these tables are read. Also
the corresponding files are opened in the required mode,
and the file pointers are positioned at the desired locations
as given in the checkpoint file. The current design assumes
that the programs either always read the file or write the
files sequentially. A read followed by an overlapping write,
or a write followed by an overlapping read, is not
supported.
7.2.2 Signal considerations
Applications can register handlers for signals using the
signal() function call. The checkpoint library intercepts
calls to signal() and installs its own signal handler instead.
It also updates a signal-state table that stores the address of
the signal handler function (sighandler) registered for each
signal (signum). When a signal is raised, the checkpoint
signal handler calls the appropriate application handler
given in the signal-state table. While taking checkpoints,
the signal-state table is also stored in the checkpoint file in
its signal-state section. At the time of restart, the signal-
state table is read, and the checkpoint signal handler is
installed for all the signals listed in the signal state table.
The checkpoint handler calls the required application
handlers when needed. Signals during checkpoint The
application can potentially receive signals while the
checkpoint is in progress. If the application signal handlers
are called while a checkpoint is in progress, it can change
the state of the memory being checkpointed. This might
make the checkpoint inconsistent. Therefore, the signals
arriving while a checkpoint is under progress need to be
handled carefully. 86 IBM System Blue Gene Solution:
Application Development For certain signals, such as
SIGKILL and SIGSTOP, the action is fixed and the
application terminates without much choice. The signals
without any registered handler are simply ignored. For
signals with installed handlers, there are two choices:
1. Deliver the signal immediately
2.Postpone the signal delivery until the checkpoint is
complete
All signals are classified into one of these two categories as
shown in Table 7-1. If the signal must be delivered
immediately, the memory state of the application might
change, making the current checkpoint file inconsistent.
Therefore, the current checkpoint must be aborted. The
checkpoint routine periodically checks if a signal has been
delivered since the currentcheckpoint began. In case a
signal has been delivered, it aborts the current checkpoint
and returns to the application.
Checkpoint API
The checkpoint interface consists of the following items:
1.A set of library functions that are used by the application
developer to “checkpoint enable”the application
1.A set of conventions used to name and store the
checkpoint files
 2.A set of environment variables used to communicate
with the application
The following sections describe each of these components
in detai lRestart A transparent restart mechanism is
provided through the use of the BGLCheckpointInit()
function and the BGL_CHKPT_RESTART_SEQNO
environment variable. Upon startup, an application is
expected to make a call to BGLCheckpointInit(). The
BGLCheckpointInit() function initializes the checkpoint
library data structures. Moreover the BGLCheckpointInit()
function checks for the environment variable
BGL_CHKPT_RESTART_SEQNO. If the variable is not
set, a job launch is assumed and the function returns
normally. In case the environment variable is set to zero,
the individual processes restart from their individual latest
consistent global checkpoint. If the variable is set to a
positive integer, the application is started from the specified
checkpoint sequence number
Checkpoint and restart functionality
It is often desirable to enable or disable the checkpoint
functionality at the time of job launch.Application
developers are not required to provide two versions of their
programs: one with checkpoint enabled and another with
checkpoint disabled. We have used environment variables
to transparently enable and disable the checkpoint and
restart functionality. The checkpoint library calls check for
the environment variable BGL_CHKPT_ENABLED. The
checkpoint functionality is invoked only if this
environment variable is set to a value of “1.”
High Throughput Computing on
Blue Gene/L
The original focus of the Blue Gene project was to create a
high performance computer with a small footprint that
consumed a relatively small amount of power. The model
of running parallel applications (typically using Message
Passing Interface (MPI)) on Blue Gene/L is known as High
Performance Computing (HPC). Recent research has
shown that Blue Gene/L also provides a good platform for
High Throughput Computing (HTC).
3.Applications: MPI or HTC
As previously stated, the Blue Gene architecture targeted
MPI applications for optimal execution. These applications
are characterized as Single Instruction Multiple Data
(SIMD)with synchronized communication and execution.
The tasks in an MPI program arecooperating to solve a
single problem. Because of this close cooperation, a failure
in a single node, software or hardware, requires the
termination of all nodes. HTC applications have different
characteristics. The code that executes on each node is
independent of work that is being done on another node;
communication between nodes is not required. At any
specific time, each node is solving its own problem. As a
result, a failure on a single node, software or hardware,
does not necessitate the termination of all nodes. Initially,
in order to run these applications on a Blue Gene
supercomputer, a port to the MPI programming model was
required. This was done successfully for several
applications, but was not an optimal solution in some cases.
Some MPI applications may benefit by being ported to the
HTC model. In particular, some embarrassingly parallel
MPI applications may be good candidates for HTC mode
because they do not require communication between the
nodes. Also the failure of one node does not invalidate the
work being done on other nodes. A key advantage of the
MPI model is a reduction of extraneous booking by the
application. An MPI program is coded to handle data
distribution, minimize I/O by having all I/O done by one
node (typically rank 0), and distribute the work to the other
MPI ranks. When running in HTC mode, the application
data needs to be manually split up and distributed to the
nodes. This sporting effort may be justified to achieve
better application reliability and throughput than could be
achieved with an MPI model.
 HTC mode
In HTC mode, the compute nodes in a partition are running
independent programs that do not communicate with each
other. A launcher program (running on a compute node)
requests work from a dispatcher that is running on a remote
system that is connected to the functional network. Based
on information provided by the dispatcher, the launcher
program spawns a worker program that performs some
task. When running in HTC mode, there are two basic
operational differences over the default HPC mode. First,
after the worker program completes, the launcher program
is reloaded so it can handle additional work requests.
Second, if a compute node encounters a soft error, such as a
parity error, the entire partition is not terminated. Rather,
the control system attempts to reset the single compute
node while other compute nodes continue to operate. The
control system polls hardware on a regular interval looking
for compute nodes in a reset state. If a failed node is
discovered and the node failure is not due to a network
hardware error, a software reboot is attempted to recover
the compute node.
On a remote system that is connected to the functional
network, which could be a Service Node or a Front End
Node, there is a dispatcher that manages a queue of jobs to
be run.
There is a client/server relationship between the launcher
program and the dispatcher program. After the launcher
program is started on a compute node, it connects back to
the dispatcher and indicates that it is ready to receive work
requests. When the dispatcher has a job for the launcher, it
responds to the work request by sending the launcher
program the job related information, such as the name of
the worker program executable, arguments, and
environment variables. The launcher program then spawns
off the worker program. When the worker program
completes, the launcher is reloaded and repeats the process
of requesting work from experiencing
Launcher program
In the default HPC mode, when a program ends on the
compute node, the Compute Node Kernel sends a message
to the I/O node that reports how the node ended. The
message indicates if the program ended normally or by a
signal and the exit value or signal number, respectively.
The I/O node then forwards the message to the control
system. When the control system has received messages for
all of the compute nodes, it then ends the job. In HTC
mode, the Compute Node Kernel handles a program ending
differently depending on the program that ended. The
Compute Node Kernel records the path of the program that
is first submitted with the job, which is the launcher
program. When a program other than the launcher program
(or the worker program) ends, the Compute Node Kernel
records the exit status of the worker program, and then
reloads and restarts the launcher program If the worker
program ended by a signal, the Compute Node Kernel
generates an event to record the signal number that ended
the program. If the worker program ended normally, no
information is logged. The launcher program can retrieve
the exit status of the worker program using a Compute
Node Kernel system call.
Launcher run sequence
Since no message is sent to the control system indicating
that a program ended, the job continues running. The effect
is to have a continually running program on the compute
nodes. To reduce the load on the file system, the launcher
program is cached in memory on the I/O node. When the
Compute Node Kernel requests to reload the launcher
program, it does not need to be read from the file system
but can be sent directly to the compute node from memory.
Since the launcher is typically a small executable, it does
not require much additional memory to cache it. When the
launcher program ends, the Compute Node Kernel reports
that a program ended to the control system as it does for
HPC mode. This allows the launcher program to cleanly
end on the compute node and for the control system to end
the job

Template for an asynchronous task dispatch subsystem
This section gives a high level overview of one possible
method to implement an asynchronous task dispatch
subsystem. client, dispatcher program, and the launcher
program in our example.
Asynchronous task dispatch subsystem model
Clients
The client implements a task submission thread that
publishes task submission messages onto a work queue. It
also implements a task verification thread that listens for
task completion messages. When the dispatch system
informs the client that the task has terminated, and
optionally supplies an exit status, the client is then
responsible for resolvingtask completion status and for
taking actions, including relaunching tasks. Keep in mind
that the client is ultimately responsible for ensuring
successful completion of the job.
Message queue
The design is based on publishing and subscribing to
message queues. Clients publish task submission messages
onto a single work queue. Dispatcher programs subscribe to
the work queue and process task submission messages.
Clients subscribe to task completion messages.Messages
consist of text data that is comprised of the work to be
performed, a job identifier, a task identifier, and a message
type. Job identifiers are generated by the client process and
are required to be globally unique. Task identifiers are
unique within the job session. Themessage type field for
task submission messages is used by the dispatcher to
distinguish work of high priority versus work of normal
priority. Responsibility for reliable messagedelivery
belongs to the message queueing system.
Dispatcher program
The dispatcher program first pulls a task submission
message off the work queue. Then it waits on a socket for a
launcher connection and reads the launcher ID from the
socket. It writes the task into the socket, and the association
between task and launcher is stored in a table. The table
stores the last task dispatched to the launcher program. This
connection is an indication that the last task has completed
and the task completion message can be published back to
the client. Figure 12-3 shows the entire cycle of a job
submitted in HTC mode.
HTC job cycle
The intention of this design is to optimize the launcher
program. The dispatcher program spends little time
between connect and dispatch, so latency volatility is
mainly due to the waiting time for dispatcher program
connections. After rebooting, the launcher program
connects to the dispatcher program and passes the
completion information back to thedispatcher program. To
assist task status resolution, the Compute Node Kernel
stores the exit status of the last running process in a buffer.
After the launcher program restarts, the contents of this
buffer can be written to the dispatcher and stored in the task
completion message.


Launcher program
The launcher program is intentionally kept simple.
Arguments to the launcher program describe a socket
connection to the dispatcher. When the launcher program
starts, it connects to this socket, writes its identity into the
socket, and waits for a task message. Upon receipt of the
task message, the launcher parses the message and calls the
execve system call to execute the task. When the task exits
(for any reason), the Compute Node Kernel restarts the
launcher program again. The launcher program is not a
container for the application. Therefore, regardless of what
happens to the application, the launcher program will not
fail to restart.
I/O considerations
Running in HTC mode changes the I/O patterns. When a
program reads and writes to the file system, it is typically
done with small buffer sizes. While the administrator can
configure the buffer size, the most common sizes are 256K
or 512K. When running in HTC mode, loading the worker
program requires reading the complete executable into
memory and sending it to a compute node. An executable is
at least several megabytes and can be many megabytes. To
achieve the fastest I/O throughput, low compute node to I/
O node ratio is the best.
4.ADVANTAGES
1.scalable
2.Less space (half of the tennis court)
3.Heat problems most supercomputers face
4.Speed
5.DISADVANTAGES
Important safety notices
Here are some important general comments about the Blue
Gene/L system regarding safety.
1.This equipment must be installed by trained service
personnel in a restricted access location as defined by the
NEC (U.S. National Electric Code) and IEC 60950, The
Standard for Safety of Information Technology Equipment.
(C033)                2.The doors and covers to the product
are to be closed at all times except for service bytrained
service personnel. All covers must be replaced and doors
locked at the conclusion of the service operation. (C013)
3.Servicing of this product or unit is to be performed by
trained service personnel only.(C032)
4.This product is equipped with a 4 wire (three-phase and
ground) power cable. Use thispower cable with a properly
grounded electrical outlet to avoid electrical shock. (C019)
5.To prevent a possible electric shock from touching two
surfaces with different protective ground (earth), use one
hand, when possible, to connect or disconnect signal
cables)        6.An electrical outlet that is not correctly
wired could place hazardous voltage on the metal parts of
the system or the devices that attach to the system. It is the
responsibility of thecustomer to ensure that the outlet is
correctly wired and grounded to prevent an electrical
shock. (D004)
6.CONCLUSIONS
1.BG/L shows that a cell architecture for supercomputers is
feasible.
2.Higher performance with a much smaller size and power
requirements.
3.In theory, no limits to scalability of a BlueGene system.
7.REFERENCES
1.IBM Redbooks publications
2.General Parallel File System (GPFS) for Clusters:
Concepts, Planning, and Installation,GA22-7968
3.IBM General Information Manual, Installation Manual-
Physical Planning, GC22-7072
4. LoadLeveler for AIX 5L and Linux V3.2 Using and
Administering, SA22-7881
5. PPC440x5 CPU Core User's
Manualhttp://www-306.ibm.com/chips/techlib/techlib.nsf/p
roducts/PowerPC_440_Embedded_Core

Más contenido relacionado

La actualidad más candente

OpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroOpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroNumenta
 
Why study computer science
Why study computer scienceWhy study computer science
Why study computer scienceMassoud Bromand
 
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1DianaGray10
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxColleen Farrelly
 
Computer Network MAC Layer Notes as per RGPV syllabus
Computer Network MAC Layer Notes as per RGPV syllabusComputer Network MAC Layer Notes as per RGPV syllabus
Computer Network MAC Layer Notes as per RGPV syllabusNANDINI SHARMA
 
Generative AI: Past, Present, and Future – A Practitioner's Perspective
Generative AI: Past, Present, and Future – A Practitioner's PerspectiveGenerative AI: Past, Present, and Future – A Practitioner's Perspective
Generative AI: Past, Present, and Future – A Practitioner's PerspectiveHuahai Yang
 
A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3Ishan Jain
 
Big data analysis using map/reduce
Big data analysis using map/reduceBig data analysis using map/reduce
Big data analysis using map/reduceRenuSuren
 
GENERATIVE AI, THE FUTURE OF PRODUCTIVITY
GENERATIVE AI, THE FUTURE OF PRODUCTIVITYGENERATIVE AI, THE FUTURE OF PRODUCTIVITY
GENERATIVE AI, THE FUTURE OF PRODUCTIVITYAndre Muscat
 
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdfManaging-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdfQuantUniversity
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023HyunJoon Jung
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAnant Corporation
 
Differences Between Machine Learning Ml Artificial Intelligence AI And Deep L...
Differences Between Machine Learning Ml Artificial Intelligence AI And Deep L...Differences Between Machine Learning Ml Artificial Intelligence AI And Deep L...
Differences Between Machine Learning Ml Artificial Intelligence AI And Deep L...SlideTeam
 

La actualidad más candente (20)

Blue Brain ppt
Blue Brain pptBlue Brain ppt
Blue Brain ppt
 
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroOpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
 
Why study computer science
Why study computer scienceWhy study computer science
Why study computer science
 
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptx
 
Computer Network MAC Layer Notes as per RGPV syllabus
Computer Network MAC Layer Notes as per RGPV syllabusComputer Network MAC Layer Notes as per RGPV syllabus
Computer Network MAC Layer Notes as per RGPV syllabus
 
Generative AI: Past, Present, and Future – A Practitioner's Perspective
Generative AI: Past, Present, and Future – A Practitioner's PerspectiveGenerative AI: Past, Present, and Future – A Practitioner's Perspective
Generative AI: Past, Present, and Future – A Practitioner's Perspective
 
A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3
 
Big data analysis using map/reduce
Big data analysis using map/reduceBig data analysis using map/reduce
Big data analysis using map/reduce
 
GENERATIVE AI, THE FUTURE OF PRODUCTIVITY
GENERATIVE AI, THE FUTURE OF PRODUCTIVITYGENERATIVE AI, THE FUTURE OF PRODUCTIVITY
GENERATIVE AI, THE FUTURE OF PRODUCTIVITY
 
Big Data ppt
Big Data pptBig Data ppt
Big Data ppt
 
Intro to LLMs
Intro to LLMsIntro to LLMs
Intro to LLMs
 
Blue brain
Blue brainBlue brain
Blue brain
 
What is langchain
What is langchainWhat is langchain
What is langchain
 
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdfManaging-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
 
Best Ever PPT Of Bluebrain
Best Ever PPT Of BluebrainBest Ever PPT Of Bluebrain
Best Ever PPT Of Bluebrain
 
ChatGPT in Education
ChatGPT in EducationChatGPT in Education
ChatGPT in Education
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
 
Differences Between Machine Learning Ml Artificial Intelligence AI And Deep L...
Differences Between Machine Learning Ml Artificial Intelligence AI And Deep L...Differences Between Machine Learning Ml Artificial Intelligence AI And Deep L...
Differences Between Machine Learning Ml Artificial Intelligence AI And Deep L...
 

Similar a Blue Gene/L supercomputer seminar report

OpenACC and Open Hackathons Monthly Highlights May 2023.pdf
OpenACC and Open Hackathons Monthly Highlights May  2023.pdfOpenACC and Open Hackathons Monthly Highlights May  2023.pdf
OpenACC and Open Hackathons Monthly Highlights May 2023.pdfOpenACC
 
High Performance Computing Infrastructure as a Key Enabler to Engineering Des...
High Performance Computing Infrastructure as a Key Enabler to Engineering Des...High Performance Computing Infrastructure as a Key Enabler to Engineering Des...
High Performance Computing Infrastructure as a Key Enabler to Engineering Des...NSEAkure
 
Adaptive Computing Seminar Report - Suyog Potdar
Adaptive Computing Seminar Report - Suyog PotdarAdaptive Computing Seminar Report - Suyog Potdar
Adaptive Computing Seminar Report - Suyog PotdarSuyog Potdar
 
Introducation of CPLDS and Design of Combinational circuit using CPLD
Introducation of CPLDS and Design of Combinational circuit using CPLDIntroducation of CPLDS and Design of Combinational circuit using CPLD
Introducation of CPLDS and Design of Combinational circuit using CPLDHemantChaurasia8
 
Implementing Saas as Cloud controllers using Mobile Agent based technology wi...
Implementing Saas as Cloud controllers using Mobile Agent based technology wi...Implementing Saas as Cloud controllers using Mobile Agent based technology wi...
Implementing Saas as Cloud controllers using Mobile Agent based technology wi...Sunil Rajput
 
Lambda data grid: communications architecture in support of grid computing
Lambda data grid: communications architecture in support of grid computingLambda data grid: communications architecture in support of grid computing
Lambda data grid: communications architecture in support of grid computingTal Lavian Ph.D.
 
Design of an IT Capstone Subject - Cloud Robotics
Design of an IT Capstone Subject - Cloud RoboticsDesign of an IT Capstone Subject - Cloud Robotics
Design of an IT Capstone Subject - Cloud RoboticsITIIIndustries
 
Design of an IT Capstone Subject - Cloud Robotics
Design of an IT Capstone Subject - Cloud RoboticsDesign of an IT Capstone Subject - Cloud Robotics
Design of an IT Capstone Subject - Cloud RoboticsITIIIndustries
 
Application of machine learning in chemical engineering: outlook and perspect...
Application of machine learning in chemical engineering: outlook and perspect...Application of machine learning in chemical engineering: outlook and perspect...
Application of machine learning in chemical engineering: outlook and perspect...IAESIJAI
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC
 
21st Century e-Knowledge Requires a High Performance e-Infrastructure
21st Century e-Knowledge Requires a High Performance e-Infrastructure21st Century e-Knowledge Requires a High Performance e-Infrastructure
21st Century e-Knowledge Requires a High Performance e-InfrastructureLarry Smarr
 
LOAD BALANCED CLUSTERING WITH MIMO UPLOADING TECHNIQUE FOR MOBILE DATA GATHER...
LOAD BALANCED CLUSTERING WITH MIMO UPLOADING TECHNIQUE FOR MOBILE DATA GATHER...LOAD BALANCED CLUSTERING WITH MIMO UPLOADING TECHNIQUE FOR MOBILE DATA GATHER...
LOAD BALANCED CLUSTERING WITH MIMO UPLOADING TECHNIQUE FOR MOBILE DATA GATHER...Munisekhar Gunapati
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Larry Smarr
 
Ramesh_11-05-2016
Ramesh_11-05-2016Ramesh_11-05-2016
Ramesh_11-05-2016studies
 
Arduino Based Collision Prevention Warning System
Arduino Based Collision Prevention Warning SystemArduino Based Collision Prevention Warning System
Arduino Based Collision Prevention Warning SystemMadhav Reddy Chintapalli
 
Ramprakash Resume
Ramprakash ResumeRamprakash Resume
Ramprakash ResumeRam Prakash
 

Similar a Blue Gene/L supercomputer seminar report (20)

blue gene report
blue gene  reportblue gene  report
blue gene report
 
OpenACC and Open Hackathons Monthly Highlights May 2023.pdf
OpenACC and Open Hackathons Monthly Highlights May  2023.pdfOpenACC and Open Hackathons Monthly Highlights May  2023.pdf
OpenACC and Open Hackathons Monthly Highlights May 2023.pdf
 
High Performance Computing Infrastructure as a Key Enabler to Engineering Des...
High Performance Computing Infrastructure as a Key Enabler to Engineering Des...High Performance Computing Infrastructure as a Key Enabler to Engineering Des...
High Performance Computing Infrastructure as a Key Enabler to Engineering Des...
 
RESUME_
RESUME_RESUME_
RESUME_
 
Adaptive Computing Seminar Report - Suyog Potdar
Adaptive Computing Seminar Report - Suyog PotdarAdaptive Computing Seminar Report - Suyog Potdar
Adaptive Computing Seminar Report - Suyog Potdar
 
Introducation of CPLDS and Design of Combinational circuit using CPLD
Introducation of CPLDS and Design of Combinational circuit using CPLDIntroducation of CPLDS and Design of Combinational circuit using CPLD
Introducation of CPLDS and Design of Combinational circuit using CPLD
 
Implementing Saas as Cloud controllers using Mobile Agent based technology wi...
Implementing Saas as Cloud controllers using Mobile Agent based technology wi...Implementing Saas as Cloud controllers using Mobile Agent based technology wi...
Implementing Saas as Cloud controllers using Mobile Agent based technology wi...
 
Lambda data grid: communications architecture in support of grid computing
Lambda data grid: communications architecture in support of grid computingLambda data grid: communications architecture in support of grid computing
Lambda data grid: communications architecture in support of grid computing
 
kiran edited
kiran editedkiran edited
kiran edited
 
Design of an IT Capstone Subject - Cloud Robotics
Design of an IT Capstone Subject - Cloud RoboticsDesign of an IT Capstone Subject - Cloud Robotics
Design of an IT Capstone Subject - Cloud Robotics
 
Design of an IT Capstone Subject - Cloud Robotics
Design of an IT Capstone Subject - Cloud RoboticsDesign of an IT Capstone Subject - Cloud Robotics
Design of an IT Capstone Subject - Cloud Robotics
 
Application of machine learning in chemical engineering: outlook and perspect...
Application of machine learning in chemical engineering: outlook and perspect...Application of machine learning in chemical engineering: outlook and perspect...
Application of machine learning in chemical engineering: outlook and perspect...
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019
 
21st Century e-Knowledge Requires a High Performance e-Infrastructure
21st Century e-Knowledge Requires a High Performance e-Infrastructure21st Century e-Knowledge Requires a High Performance e-Infrastructure
21st Century e-Knowledge Requires a High Performance e-Infrastructure
 
Bluegene
BluegeneBluegene
Bluegene
 
LOAD BALANCED CLUSTERING WITH MIMO UPLOADING TECHNIQUE FOR MOBILE DATA GATHER...
LOAD BALANCED CLUSTERING WITH MIMO UPLOADING TECHNIQUE FOR MOBILE DATA GATHER...LOAD BALANCED CLUSTERING WITH MIMO UPLOADING TECHNIQUE FOR MOBILE DATA GATHER...
LOAD BALANCED CLUSTERING WITH MIMO UPLOADING TECHNIQUE FOR MOBILE DATA GATHER...
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​
 
Ramesh_11-05-2016
Ramesh_11-05-2016Ramesh_11-05-2016
Ramesh_11-05-2016
 
Arduino Based Collision Prevention Warning System
Arduino Based Collision Prevention Warning SystemArduino Based Collision Prevention Warning System
Arduino Based Collision Prevention Warning System
 
Ramprakash Resume
Ramprakash ResumeRamprakash Resume
Ramprakash Resume
 

Blue Gene/L supercomputer seminar report

  • 1. A SEMINAR REPORT ON “BLUE GENEL” Submitted in the partial fulfillment for the requirement of the award of degree in BACHELOR OF TECHNOLOGY IN COMPUTER SCIENCE AND ENGINEERING BY M.S.RAMA KRISHNA ( 06M11A05A3) Department of Computer Science and Engineering BANDARI SRINIVAS INSTITUTE OF TECHNOLOGY, CHEVELLA Affiliated to JNTU, HYDERABAD, Approved by AICTE 2009-2010
  • 2. BANDARI SRINIVAS INSTITUTE OF TECHNOLOGY Chevella, R.R.Dist., A.P ( Approved by AICTE , New Delhi, Affiliated to J.N.T.U HYDERABAD ) Date: 30-03-2010 CERTIFICATE This is to certify that it is a confide record of dissertation work entitled “BLUE GENE/L” done by M.S.RAMA KRISHNA bearing roll no.06M11A05A3 which has been submitted as a partial fulfillment of the requirements for the award of degree in BACHELOR Of Technology in Computer Science and Engineering , from Jawaharlal Nehru Technological University, Hyderabad for the academic year 2009 – 2010. This work is not submitted to any University for the award of any degree / diploma. ( Mr. VIJAY KUMAR) ( Mr. CH.RAJ KISHORE )
  • 3. ACKNOWLEDGEMENT I whole heartedly thank the BANDARI SRINIVAS INSTITUTE OF TECHNOLOGY for giving me an opportunity to present the seminar report on technical paper in the college. I express my deep sense of gratitude to CH RAJA KISHORE Head of the Department Computer Science and Engineering in BANDARI SRINIVAS INSTITUTE OF TECHNOLOGY for his valuable guidance, inspiration and encouragement in presenting my seminar report. I am also thankful to (Mr. VIJAY KUMAR), internal guide for his valuable academic guidance and support. I thank whole –heartedly to all the staff members of Department of Computer Science and Engineering, for their support and encouragement in doing my seminar work. Lastly , I thank all those who have helped me directly and indirectly in doing this seminar work successfully. NAME: M.S.RAMAKRISHNA ROLL NO: 06M11A05A3
  • 4. ABSTRACT Blue Gene is a massively parallel computer being developed at the IBMThomas J. Watson Research Center. Blue Gene represents a hundred-fold improvement on performance compared with the fastest supercomputers of today. It will achieve 1 PetaFLOP/sec through unprecedented levels of parallelism in excess of 4,0000,000 threads of execution. The Blue Gene project has two important goals, in which understanding of biologically import processes will be advanced, as well as advancement of knowledge of cellular architectures (massively parallel system built of single chip cells that integrated processors,memory and communication) , and of the software needed to exploit those effectively. This massively parallel system of 65,536 nodes is based on anew architecture that exploits system-on-a-chip technology to deliver target peak processing power of 360 teraFLOPS (trillion floating-point operations per second). The machine is scheduled to be operational in the 2004-2005 time frame, at performance and power consumption/performance targets unobtainable with conventional architectures.
  • 5. In November 2001 IBM announced partnership with LawrenceLivermoreNationalLaboratory to build the Blue Gene/L (BG/L) supercomputer, a 65,536-node machine designed around embedded PowerPC processors. Through the use of system-on-a-chip integration coupled with a highly scalable cellular architecture, Blue Gene/L will deliver 180 or 360 Teraflops of peak computing power, depending on the utilization mode. Blue Gene/L represents a new level of scalability for parallel systems. Whereas existing large scale systems range in size from hundreds to a few of compute nodes,Blue Gene/L makes a jump of almost two orders of magnitude.It is reasonably clear that such machines, in the near future at least, will require a departure from the architectures of the current parallel supercomputers,which use few thousand commodity microprocessors. With the current technology, it would take around a million microprocessors to achieve a petaFLOPS performance.Clearly, power requirements and cost considerations alone preclude this option.Using such a design, petaFLOPS performance will be reached within the
  • 6. next 2-3 years, especially since IBM hasannounced the Blue Gene project aimed at building such a machine.
  • 7. Index Contents Page No. Chapter 1 Introduction Chapter 2 Detailed report Chapter 3 Applications Chapter 4 Advantages Chapter 5 Disadvantages Chapter 6 Conclusions References
  • 8. 1.Introduction Blue Gene is a computer architecture project designed to produce several supercomputers, designed to reach operating speeds in the PFLOPS (petaFLOPS) range, and currently reaching sustained speeds of nearly 500 TFLOPS (teraFLOPS). It is a cooperative project among IBM (particularly IBM Rochester and the Thomas J. Watson Research Center), the Lawrence Livermore National Laboratory, the United States Department of Energy (which is partially funding the project), and academia. There are four Blue Gene projects in development: Blue Gene/L, Blue Gene/C, Blue Gene/P, and Blue Gene/Q. The project was awarded the National Medal of Technology and Innovation by U.S. President Barack Obama on September 18, 2009. The president bestowed the award on October 7, 2009.[1] 2.DESCPRITION The first computer in the Blue Gene series, Blue Gene/L, developed through a partnership with Lawrence Livermore National Laboratory (LLNL), originally had a theoretical peak performance of 360 TFLOPS, and scored over 280 TFLOPS sustained on the Linpack benchmark. After an upgrade in 2007 the performance increased to 478 TFLOPS sustained and 596 TFLOPS peak.The term Blue Gene/L sometimes refers to the computer installed at LLNL; and sometimes refers to the architecture of that computer. As of November 2006, there are 27 computers on the Top500 list using the Blue Gene/L architecture. All these computers are
  • 9. listed as having an architecture of eServer Blue Gene Solution. In December 1999, IBM announced a $100 million research initiative for a five-year effort to build a massively parallel computer, to be applied to the study of biomolecular phenomena such as protein folding. The project has two main goals: to advance our understanding of the mechanisms behind protein folding via large-scale simulation, and to explore novel ideas in massively parallel machine architecture and software. This project should enable biomolecular simulations that are orders of magnitude larger than current technology permits. Major areas of investigation include: how to use this novel platform to effectively meet its scientific goals, how to make such massively parallel machines more usable, and how to achieve performance targets at a reasonable cost, through novel machine architectures. The design is built largely around the previous QCDSP and QCDOC supercomputers. In November 2001, Lawrence Livermore National Laboratory joined IBM as a research partner for Blue Gene. On September 29, 2004, IBM announced that a Blue Gene/ L prototype at IBM Rochester (Minnesota) had overtaken NEC's Earth Simulator as the fastest computer in the world, with a speed of 36.01 TFLOPS on the Linpack benchmark, beating Earth Simulator's 35.86 TFLOPS. This was achieved with an 8-cabinet system, with each cabinet holding 1,024 compute nodes. Upon doubling this configuration to 16 cabinets, the machine reached a speed of 70.72 TFLOPS by November 2004 , taking first place in the Top500 list.
  • 10. On March 24, 2005, the US Department of Energy announced that the Blue Gene/L installation at LLNL broke its speed record, reaching 135.5 TFLOPS. This feat was possible because of doubling the number of cabinets to 32. On the Top500 list,[2] Blue Gene/L installations across several sites worldwide took 3 out of the 10 top positions, and 13 out of the top 64. Three racks of Blue Gene/L are housed at the San Diego Supercomputer Center and are available for academic research. On October 27, 2005, LLNL and IBM announced that Blue Gene/L had once again broken its speed record, reaching 280.6 TFLOPS on Linpack, upon reaching its final configuration of 65,536 "compute nodes" (i.e., 216 nodes) and an additional 1024 "I/O nodes" in 64 air-cooled cabinets. The LLNL Blue Gene/L uses Lustre to access multiple filesystems in the 600TB-1PB range[3]. Blue Gene/L is also the first supercomputer ever to run over 100 TFLOPS sustained on a real world application, namely a three-dimensional molecular dynamics code (ddcMD), simulating solidification (nucleation and growth processes) of molten metal under high pressure and temperature conditions. This achievement won the 2005 Gordon Bell Prize. On June 22, 2006, NNSA and IBM announced that Blue Gene/L has achieved 207.3 TFLOPS on a quantum chemical application (Qbox).[4] On November 14, 2006, at Supercomputing 2006,[5] Blue Gene/L was awarded the winning prize in all HPC Challenge Classes of awards.[6] A team from the IBM Almaden Research Center and the University of Nevada on April 27, 2007 ran an artificial neural network almost half as complex as the brain of a
  • 11. mouse for the equivalent of a second (the network was run at 1/10 of normal speed for 10 seconds).[7] In November 2007, the LLNL Blue Gene/L remained at the number one spot as the world's fastest supercomputer. It had been upgraded since the previous measurement, and was then almost three times as fast as the second fastest, a Blue Gene/P system. On June 18, 2008, the new Top500 List marked the first time a Blue Gene system was not the leader in the Top500 since it had assumed that position, being topped by IBM's Cell-based Roadrunner system which was the only system to surpass the mythical petaflops mark. Top500 announced that the Cray XT5 Jaguar housed at OCLF is currently the fastest supercomputer in the world for open science. Major features The Blue Gene/L supercomputer is unique in the following aspects: 1.Trading the speed of processors for lower power consumption. 2.Dual processors per node with two working modes: co- processor (1 user process/node: computation and communication work is shared by two processors) and virtual node (2 user processes/node) 3.System-on-a-chip design 4.A large number of nodes (scalable in increments of 1024 up to at least 65,536) 5.Three-dimensional torus interconnect with auxiliary networks for global communications, I/O, and management 6.Lightweight OS per node for minimum system overhead (computational noise) Architecture
  • 12. One Blue Gene/L node boardA schematic overview of a Blue Gene/L supercomputer Each Compute or I/O node is a single ASIC with associated DRAM memory chips. The ASIC integrates two 700 MHz PowerPC 440 embedded processors, each with a double- pipeline-double-precision Floating Point Unit (FPU), a cache sub-system with built-in DRAM controller and the logic to support multiple communication sub-systems. The dual FPUs give each Blue Gene/L node a theoretical peak performance of 5.6 GFLOPS (gigaFLOPS). Node CPUs are not cache coherent with one another. Compute nodes are packaged two per compute card, with 16 compute cards plus up to 2 I/O nodes per node board. There are 32 node boards per cabinet/rack.[9] By integration of all essential sub-systems on a single chip, each Compute or I/O node dissipates low power (about 17 watts, including DRAMs). This allows very aggressive packaging of up to 1024 compute nodes plus additional I/O nodes in the standard 19" cabinet, within reasonable limits of electrical power supply and air cooling. The performance metrics in terms of FLOPS per watt, FLOPS per m2 of floorspace and FLOPS per unit cost allow scaling up to very high performance. Each Blue Gene/L node is attached to three parallel communications networks: a 3D toroidal network for peer- to-peer communication between compute nodes, a collective network for collective communication, and a global interrupt network for fast barriers. The I/O nodes, which run the Linux operating system, provide communication with the world via an Ethernet network. The I/O nodes also handle the filesystem operations on
  • 13. behalf of the compute nodes. Finally, a separate and private Ethernet network provides access to any node for configuration, booting and diagnostics. Blue Gene/L compute nodes use a minimal operating system supporting a single user program. Only a subset of POSIX calls are supported, and only one process may be run at a time. Programmers need to implement green threads in order to simulate local concurrency. Application development is usually performed in C, C++, or Fortran using MPI for communication. However, some scripting languages such as Ruby have been ported to the compute nodes.[10] To allow multiple programs to run concurrently, a Blue Gene/L system can be partitioned into electronically isolated sets of nodes. The number of nodes in a partition must be a positive integer power of 2, and must contain at least 25 = 32 nodes. The maximum partition is all nodes in the computer. To run a program on Blue Gene/L, a partition of the computer must first be reserved. The program is then run on all the nodes within the partition, and no other program may access nodes within the partition while it is in use. Upon completion, the partition nodes are released for future programs to use. With so many nodes, component failures are inevitable. The system is able to electrically isolate faulty hardware to
  • 14. allow the machine to continue to run. OPERATING SYSTEMS Front-end nodes are commodity PCs running Linux,I/O nodes run a customized Linux kernel,Compute nodes use an extremely lightweight custom kernel,Service node is a single multiprocessor machine running a custom OS COMPUTER NODE KERNEL Single user, dual-threaded,Flat address space, no paging, Physical resources are memory-mapped,Provides standard POSIX functionality (mostly),Two execution modes: 1.Virtual node mode 2.Coprocessor mode SERVICE NODE OS Core Management and Control System (CMCS),BG/L’s “global” operating system.,MMCS - Midplane Monitoring
  • 15. and Control System,CIOMAN - Control and I/O Manager,DB2 relational database Programming modes This chapter provides information about the way in which Message Passing Interface (MPI) isimplemented and used on Blue Gene/L.There are two main modes in which you can use Blue Gene/L: 1.Communication Coprocessor Mode 2. Virtual Node Mode 1.Communication Coprocessor Mode In the default mode of operation of Blue Gene/L, named Communication Coprocessor Mode,each physical compute node executes a single compute process. The Blue Gene/L system software treats those two processors in a compute node asymmetrically. One of theprocessors (CPU 0) behaves as a main processor, running the main thread of thecomputer process. The other processor (CPU 1) behaves as an offload engine (coprocessor) that only executes specific operations.The coprocessor is used primarily for offloading communication functions. It can also be usedfor running application-level coroutines. 2.Virtual Node Mode The Compute Node Kernel in the compute nodes also supports a Virtual Node Mode of operation for the machine. In that mode, the kernel runs two separate processes in each compute node. Node resources (primarily the memory and the torus network) are shared by both processes. In Virtual Node Mode, an application can use both processors in a node simply by doubling itsnumber of MPI tasks, without explicitly handling cache coherence issues.
  • 16. The now distinct MPI tasks running in the two CPUs of a compute node have to communicate to each other. This problem was solved by implementing a virtual torus device, serviced by a virtual packetlayer, in the scratchpad memory.In Virtual Node Mode, the two cores of a compute node act as different processes. Each has its own rank in the message layer. The message layer supports Virtual Node Mode byproviding a correct torus to rank mapping and first in, first out (FIFO) pinning in this mode. The hardware FIFOs are shared equally between the processes. Torus coordinates are expressed by quadruplets instead of triplets. In Virtual Node Mode, communication between the two processors in a compute node cannot be done over the network hardware. Instead, it is done via a region of memory, called the scratchpad to which both processors have access. Virtual FIFOs make portions of the scratchpad look like a send FIFO to one of the processors and a receive FIFO to the other. Access to the virtual FIFOs is mediated with help from the hardware lockboxes. From an application perspective, virtual nodes behave like physical nodes, but with less memory. Each virtual node executes one compute process. Processes in different virtual nodes, even those allocated in the same compute node, only communicate through messages. Processes running in virtual node mode cannot invoke coroutines. The Blue Gene/L MPI implementation supports Virtual Node Mode operations by sharing the systems communications resources of a physical compute node between the two compute processes that execute on that physical node. The low-level communications library of Blue Gene/L, that is the message layer, virtualizes these
  • 17. communications resources into logical units that each process can use independently. Deciding which mode to use Whether you choose to use Communication Coprocessor Mode or Virtual Node Mode depends largely on the type of application you plan to execute. I/O intensive tasks that require a relatively large amount of data interchange between compute nodes benefit more by using Communication Coprocessor Mode. Those applications that are primarily CPU bound, and do not have large working memory requirements (the application only gets half of the node memory), run more quickly in VirtualNode Mode System calls supported by theCompute Node Kernel This chapter discusses the system calls (syscalls) that are supported by the Compute Node Kernel. It is important for you to understand which functions can be called, and
  • 18. perhaps moreimportantly, which ones cannot be called, by your application running on Blue Gene/L. Introduction to the Compute Node Kernel The role of the kernel on the Compute Node is to create an environment for the execution of a user process which is “Linux-like.” It is not a full Linux kernel implementation, but rather implements a subset of POSIX functionality. The Compute Node Kernel is a single-process operating system. It is designed to provide the services that are needed by applications which are expected to run on Blue Gene/L, but notfor all applications. The Compute Node Kernel is not intended to run system administration functions from the compute node.To achieve the best reliability, a small and simple kernel is a design goal. This enables a simpler checkpoint function.The compute node application never runs as the root user. In fact, it runs as the same user(uid) and group (gid) under which the job was submitted. System calls The Compute Node Kernel system calls are subdivided into the following categories: 1 .File I/O 2 .Directory operations 3 .Time 4.Process information 5.Signals 6.Miscellaneous 7.Sockets 8.Compute Node Kernel
  • 19. Additional Compute Node Kernel application support This section provides details about additional support provided to application developers by the Compute Node Kernel. 3.3.1 Allocating memory regions with specific L1 cache attributes Each PowerPC® 440 core on a compute node has a 32 KB L1 data cache that is 64-way set associative and uses a 32- byte cache line. A load or store operation to a virtual address is translated by the hardware to a real address using the Translation Lookaside Buffer (TLB). The real address is used to select the “set.” If the address is available in the data cache, it is returned to the processor without needing to access lower levels of the memory subsystem. If the address is not in the data cache (a cache miss), a cache line is evicted from the data cache and is replaced with the cache line containing the address. The “way” to evict from the data cache is selected using a round-robin algorithm. The L1 data cache can be divided into two regions: a normal region and a transient region. The number of “ways” to use for each region can be configured by the application. The Blue Gene/L memory subsystem supports the following L1 data cache attributes:_ Cache-inhibited or cached Memory with the cache-inhibited attribute causes all load and store operations to access the data from lower levels of the memory subsystem. Memory with the cached attribute might use the data cache for load and store operations. The default attribute for application memory is cached. _ Store without allocate (SWOA) or store with allocate (SWA) Memory with the SWOA attribute bypasses the L1 data cache on a cache miss for a store
  • 20. operation, and the data is stored directly to lower levels of the memory subsystem. Memory with the SWA attribute allocates a line in the L1 data cache when there is a cache miss for a store operation on the memory. The default attribute for application memory is SWA. Write-through or write-back memory with the write- through attribute is written through to the lower levels of the memory subsystem for store operations. If the memory also exists in the L1 data cache, it is written to the data cache and the cache line is marked as clean. Memory with the write-back attribute is written to the L1 data cache, and the cache line is mar Transient or normal Memory with the transient attribute uses the transient region of the L1 data cache. Memory with the normal attribute uses the normal region of the L1 data cache. By default, the L1 data cache is configured without a transient region and all application memory uses the normal region. Checkpoint and restart support Why use checkpoint and restart Given the scale of the Blue Gene/L system, faults are expected to be the norm rather than the exception. This is unfortunately inevitable, given the vast number of individual hardware processors and other components involved in running the system Checkpoint and restart are among the primary techniques for fault recovery. A special user-level checkpoint library has been developed for Blue Gene/L applications. Using this library, application programs can take a checkpoint of their program state at appropriate stages and can be restarted later from their last successful checkpoint
  • 21. Why should you be interested in this support? Among the following examples, numerous scenarios indicate that use of this support is warranted: 1.Your application is a long-running one. You do not want it to fail a long time into a run, losing all the calculations made up until the failure. Checkpoint and restart allow you to restart the application at the last checkpoint position, losing a much smaller slice ofprocessing time. 2. You are given access to a Blue Gene/L system for relatively small increments of time, and you know that your application run will take longer than your allotted amount of processing time. Checkpoint and restart allows you to execute your application to completion indistinct “chunks,” rather than in one continuous period of time.these are just two of many reasons to use checkpoint and restart support in you Blue Gene/L applications. 7.2.1 Input/output considerations All the external I/O calls made from a program are shipped to the corresponding I/O Node using a function shipping procedure implemented in the Compute Node Kernel The checkpoint library intercepts calls to the five main file I/O functions: open, close, read, write, and lseek. The function name open is a weak alias that maps to the _libc_open function. The checkpoint library intercepts this call and provides its own implementation of open that internally uses the _libc_open function The library maintains a file state table that stores the file name and current file position and the mode of all the files that are currently open. The table also maintains a translation that translates the file descriptors used by the Compute Node Kernel to another
  • 22. set of file descriptors to be used by the application. While taking a checkpoint, the file state table is also stored in the checkpoint file. Upon a restart, these tables are read. Also the corresponding files are opened in the required mode, and the file pointers are positioned at the desired locations as given in the checkpoint file. The current design assumes that the programs either always read the file or write the files sequentially. A read followed by an overlapping write, or a write followed by an overlapping read, is not supported. 7.2.2 Signal considerations Applications can register handlers for signals using the signal() function call. The checkpoint library intercepts calls to signal() and installs its own signal handler instead. It also updates a signal-state table that stores the address of the signal handler function (sighandler) registered for each signal (signum). When a signal is raised, the checkpoint signal handler calls the appropriate application handler given in the signal-state table. While taking checkpoints, the signal-state table is also stored in the checkpoint file in its signal-state section. At the time of restart, the signal- state table is read, and the checkpoint signal handler is installed for all the signals listed in the signal state table. The checkpoint handler calls the required application handlers when needed. Signals during checkpoint The application can potentially receive signals while the checkpoint is in progress. If the application signal handlers are called while a checkpoint is in progress, it can change the state of the memory being checkpointed. This might make the checkpoint inconsistent. Therefore, the signals arriving while a checkpoint is under progress need to be
  • 23. handled carefully. 86 IBM System Blue Gene Solution: Application Development For certain signals, such as SIGKILL and SIGSTOP, the action is fixed and the application terminates without much choice. The signals without any registered handler are simply ignored. For signals with installed handlers, there are two choices: 1. Deliver the signal immediately 2.Postpone the signal delivery until the checkpoint is complete All signals are classified into one of these two categories as shown in Table 7-1. If the signal must be delivered immediately, the memory state of the application might change, making the current checkpoint file inconsistent. Therefore, the current checkpoint must be aborted. The checkpoint routine periodically checks if a signal has been delivered since the currentcheckpoint began. In case a signal has been delivered, it aborts the current checkpoint and returns to the application. Checkpoint API The checkpoint interface consists of the following items: 1.A set of library functions that are used by the application developer to “checkpoint enable”the application 1.A set of conventions used to name and store the checkpoint files 2.A set of environment variables used to communicate with the application The following sections describe each of these components in detai lRestart A transparent restart mechanism is provided through the use of the BGLCheckpointInit() function and the BGL_CHKPT_RESTART_SEQNO environment variable. Upon startup, an application is
  • 24. expected to make a call to BGLCheckpointInit(). The BGLCheckpointInit() function initializes the checkpoint library data structures. Moreover the BGLCheckpointInit() function checks for the environment variable BGL_CHKPT_RESTART_SEQNO. If the variable is not set, a job launch is assumed and the function returns normally. In case the environment variable is set to zero, the individual processes restart from their individual latest consistent global checkpoint. If the variable is set to a positive integer, the application is started from the specified checkpoint sequence number Checkpoint and restart functionality It is often desirable to enable or disable the checkpoint functionality at the time of job launch.Application developers are not required to provide two versions of their programs: one with checkpoint enabled and another with checkpoint disabled. We have used environment variables to transparently enable and disable the checkpoint and restart functionality. The checkpoint library calls check for the environment variable BGL_CHKPT_ENABLED. The checkpoint functionality is invoked only if this environment variable is set to a value of “1.”
  • 25. High Throughput Computing on Blue Gene/L The original focus of the Blue Gene project was to create a high performance computer with a small footprint that consumed a relatively small amount of power. The model of running parallel applications (typically using Message Passing Interface (MPI)) on Blue Gene/L is known as High Performance Computing (HPC). Recent research has shown that Blue Gene/L also provides a good platform for High Throughput Computing (HTC). 3.Applications: MPI or HTC As previously stated, the Blue Gene architecture targeted MPI applications for optimal execution. These applications are characterized as Single Instruction Multiple Data (SIMD)with synchronized communication and execution. The tasks in an MPI program arecooperating to solve a single problem. Because of this close cooperation, a failure in a single node, software or hardware, requires the termination of all nodes. HTC applications have different characteristics. The code that executes on each node is
  • 26. independent of work that is being done on another node; communication between nodes is not required. At any specific time, each node is solving its own problem. As a result, a failure on a single node, software or hardware, does not necessitate the termination of all nodes. Initially, in order to run these applications on a Blue Gene supercomputer, a port to the MPI programming model was required. This was done successfully for several applications, but was not an optimal solution in some cases. Some MPI applications may benefit by being ported to the HTC model. In particular, some embarrassingly parallel MPI applications may be good candidates for HTC mode because they do not require communication between the nodes. Also the failure of one node does not invalidate the work being done on other nodes. A key advantage of the MPI model is a reduction of extraneous booking by the application. An MPI program is coded to handle data distribution, minimize I/O by having all I/O done by one node (typically rank 0), and distribute the work to the other MPI ranks. When running in HTC mode, the application data needs to be manually split up and distributed to the nodes. This sporting effort may be justified to achieve better application reliability and throughput than could be achieved with an MPI model. HTC mode In HTC mode, the compute nodes in a partition are running independent programs that do not communicate with each other. A launcher program (running on a compute node) requests work from a dispatcher that is running on a remote system that is connected to the functional network. Based on information provided by the dispatcher, the launcher
  • 27. program spawns a worker program that performs some task. When running in HTC mode, there are two basic operational differences over the default HPC mode. First, after the worker program completes, the launcher program is reloaded so it can handle additional work requests. Second, if a compute node encounters a soft error, such as a parity error, the entire partition is not terminated. Rather, the control system attempts to reset the single compute node while other compute nodes continue to operate. The control system polls hardware on a regular interval looking for compute nodes in a reset state. If a failed node is discovered and the node failure is not due to a network hardware error, a software reboot is attempted to recover the compute node. On a remote system that is connected to the functional network, which could be a Service Node or a Front End Node, there is a dispatcher that manages a queue of jobs to be run. There is a client/server relationship between the launcher program and the dispatcher program. After the launcher program is started on a compute node, it connects back to the dispatcher and indicates that it is ready to receive work requests. When the dispatcher has a job for the launcher, it responds to the work request by sending the launcher program the job related information, such as the name of the worker program executable, arguments, and environment variables. The launcher program then spawns off the worker program. When the worker program completes, the launcher is reloaded and repeats the process of requesting work from experiencing Launcher program
  • 28. In the default HPC mode, when a program ends on the compute node, the Compute Node Kernel sends a message to the I/O node that reports how the node ended. The message indicates if the program ended normally or by a signal and the exit value or signal number, respectively. The I/O node then forwards the message to the control system. When the control system has received messages for all of the compute nodes, it then ends the job. In HTC mode, the Compute Node Kernel handles a program ending differently depending on the program that ended. The Compute Node Kernel records the path of the program that is first submitted with the job, which is the launcher program. When a program other than the launcher program (or the worker program) ends, the Compute Node Kernel records the exit status of the worker program, and then reloads and restarts the launcher program If the worker program ended by a signal, the Compute Node Kernel generates an event to record the signal number that ended the program. If the worker program ended normally, no information is logged. The launcher program can retrieve the exit status of the worker program using a Compute Node Kernel system call.
  • 29. Launcher run sequence Since no message is sent to the control system indicating that a program ended, the job continues running. The effect is to have a continually running program on the compute nodes. To reduce the load on the file system, the launcher program is cached in memory on the I/O node. When the Compute Node Kernel requests to reload the launcher program, it does not need to be read from the file system but can be sent directly to the compute node from memory. Since the launcher is typically a small executable, it does not require much additional memory to cache it. When the launcher program ends, the Compute Node Kernel reports that a program ended to the control system as it does for HPC mode. This allows the launcher program to cleanly end on the compute node and for the control system to end the job Template for an asynchronous task dispatch subsystem This section gives a high level overview of one possible method to implement an asynchronous task dispatch subsystem. client, dispatcher program, and the launcher program in our example.
  • 30. Asynchronous task dispatch subsystem model Clients The client implements a task submission thread that publishes task submission messages onto a work queue. It also implements a task verification thread that listens for task completion messages. When the dispatch system informs the client that the task has terminated, and optionally supplies an exit status, the client is then responsible for resolvingtask completion status and for taking actions, including relaunching tasks. Keep in mind that the client is ultimately responsible for ensuring successful completion of the job. Message queue The design is based on publishing and subscribing to message queues. Clients publish task submission messages onto a single work queue. Dispatcher programs subscribe to the work queue and process task submission messages. Clients subscribe to task completion messages.Messages consist of text data that is comprised of the work to be
  • 31. performed, a job identifier, a task identifier, and a message type. Job identifiers are generated by the client process and are required to be globally unique. Task identifiers are unique within the job session. Themessage type field for task submission messages is used by the dispatcher to distinguish work of high priority versus work of normal priority. Responsibility for reliable messagedelivery belongs to the message queueing system. Dispatcher program The dispatcher program first pulls a task submission message off the work queue. Then it waits on a socket for a launcher connection and reads the launcher ID from the socket. It writes the task into the socket, and the association between task and launcher is stored in a table. The table stores the last task dispatched to the launcher program. This connection is an indication that the last task has completed and the task completion message can be published back to the client. Figure 12-3 shows the entire cycle of a job submitted in HTC mode.
  • 32. HTC job cycle The intention of this design is to optimize the launcher program. The dispatcher program spends little time between connect and dispatch, so latency volatility is mainly due to the waiting time for dispatcher program connections. After rebooting, the launcher program connects to the dispatcher program and passes the completion information back to thedispatcher program. To assist task status resolution, the Compute Node Kernel stores the exit status of the last running process in a buffer. After the launcher program restarts, the contents of this buffer can be written to the dispatcher and stored in the task completion message. Launcher program The launcher program is intentionally kept simple.
  • 33. Arguments to the launcher program describe a socket connection to the dispatcher. When the launcher program starts, it connects to this socket, writes its identity into the socket, and waits for a task message. Upon receipt of the task message, the launcher parses the message and calls the execve system call to execute the task. When the task exits (for any reason), the Compute Node Kernel restarts the launcher program again. The launcher program is not a container for the application. Therefore, regardless of what happens to the application, the launcher program will not fail to restart. I/O considerations Running in HTC mode changes the I/O patterns. When a program reads and writes to the file system, it is typically done with small buffer sizes. While the administrator can configure the buffer size, the most common sizes are 256K or 512K. When running in HTC mode, loading the worker program requires reading the complete executable into memory and sending it to a compute node. An executable is at least several megabytes and can be many megabytes. To achieve the fastest I/O throughput, low compute node to I/ O node ratio is the best. 4.ADVANTAGES 1.scalable 2.Less space (half of the tennis court) 3.Heat problems most supercomputers face 4.Speed 5.DISADVANTAGES Important safety notices Here are some important general comments about the Blue Gene/L system regarding safety.
  • 34. 1.This equipment must be installed by trained service personnel in a restricted access location as defined by the NEC (U.S. National Electric Code) and IEC 60950, The Standard for Safety of Information Technology Equipment. (C033) 2.The doors and covers to the product are to be closed at all times except for service bytrained service personnel. All covers must be replaced and doors locked at the conclusion of the service operation. (C013) 3.Servicing of this product or unit is to be performed by trained service personnel only.(C032) 4.This product is equipped with a 4 wire (three-phase and ground) power cable. Use thispower cable with a properly grounded electrical outlet to avoid electrical shock. (C019) 5.To prevent a possible electric shock from touching two surfaces with different protective ground (earth), use one hand, when possible, to connect or disconnect signal cables) 6.An electrical outlet that is not correctly wired could place hazardous voltage on the metal parts of the system or the devices that attach to the system. It is the responsibility of thecustomer to ensure that the outlet is correctly wired and grounded to prevent an electrical shock. (D004) 6.CONCLUSIONS 1.BG/L shows that a cell architecture for supercomputers is feasible. 2.Higher performance with a much smaller size and power requirements. 3.In theory, no limits to scalability of a BlueGene system. 7.REFERENCES 1.IBM Redbooks publications
  • 35. 2.General Parallel File System (GPFS) for Clusters: Concepts, Planning, and Installation,GA22-7968 3.IBM General Information Manual, Installation Manual- Physical Planning, GC22-7072 4. LoadLeveler for AIX 5L and Linux V3.2 Using and Administering, SA22-7881 5. PPC440x5 CPU Core User's Manualhttp://www-306.ibm.com/chips/techlib/techlib.nsf/p roducts/PowerPC_440_Embedded_Core