Fault tolerant platforms for manufacturing applications
1. )DXOW 7ROHUDQW 3ODWIRUPV
IRU 0DQXIDFWXULQJ $SSOLFDWLRQV
%< *5(* *25%$&+ 6(37(0%(5 $5 ,16,*+76 0 (
.(:25'6
Fault Tolerance, High Availability, Cluster, Collaborative Manufacturing
6800$5
New, low-cost technology for fault-tolerant platforms is now available for Microsoft
Windows 2000 environments. Manufacturers should revisit some old assumptions
about where they might benefit from deploying these platforms.
7KH FRVW RI WKH QHZ IDXOW WROHUDQW Collaboration puts a premium on real-time manufacturing informa-
VVWHPV KDV IDOOHQ VR IDU WKDW
tion, and these systems can help ensure that the information is
PDQXIDFWXUHUV PXVW UHWKLQN
always available. Next generation automation systems, production
HQVXULQJ WKH DYDLODELOLW RI WKHLU
FULWLFDO LQIRUPDWLRQ
management systems, business systems, and collaborative systems
can all benefit from this technology.
$1$/6,6
The first is the fully replicated, fault-tolerant hardware solution from Stratus Computer
Systems, with duplicate components operating in lockstep. In the event of a component
failure, there is no interruption in processing, no lost data, and no slowdown in perform-
ance. The second approach, offered by
'HVFULSWLRQ 6WUDWXV 0DUDWKRQ OXVWHU Marathon Technologies, isolates all I/O
$YDLODELOLW from both the user operating system and the
application by placing these tasks on differ-
5HFRYHU 7LPH =HUR 0LOOLVHFRQGV 0LQXWHV
ent computers connected through
RSLHV RI 26 0XOWLSOH 0XOWLSOH
proprietary interface cards, software, and
6PPHWULF 0XOWL $YDLODEOH 1R $YDLODEOH high speed interconnect.
3URFHVVLQJ
6VWHP 2SHUDWLRQ 6LQJOH 6VWHP 6SOLW 0XOWL6VWHP
,PDJH $UFKLWHFWXUH OXVWHU %HRQG OXVWHUV
While the traditional clustering approach to
,PSOHPHQWDWLRQ 1R ZRUN ,QWHJUDWH )7 6FULSW 'H
UHTXLUHG RPSR YHORSPHQW fault tolerance does provide for enhanced
QHQWV DQG 7HVWLQJ availability, there are significant limitations.
UG UG
'LVDVWHU 7ROHUDQFH 3DUW $YDLODEOH 3DUW Cluster solutions do not provide fault toler-
6LQJOH 6XSSRUW HV UG 3DUW UG 3DUW ance (failure and repair/recovery is
RQWDFW transparent to the user), only failover (a
RPSDULVRQ RI )DXOW 7ROHUDQW 6ROXWLRQV backup system automatically restarts the
@IU@SQSDT@Ã6I9ÃH6IVA68UVSDIBÃTUS6U@BD@TÃAPSÃDI9VTUS`Ã@Y@8VUDW@TÃ
2. 6S8ÃD†vtu‡†ÃQhtrÃ!Ã
applications and logs on the users). Implementation requires the development, testing,
and support of custom failover scripts, licensing and installation of multiple copies of
software, and possibly application modifications for a cluster environment. In the event
of a hardware failure, a cluster failover always loses all memory contents, and several
minutes will be required to recover. Cluster solutions offer 99.9 percent availability
(about 8 hours down per year), but fault tolerant solutions offer 99.999 percent availabil-
ity (about 5 minutes down per year).
+DUGZDUH )DXOW 7ROHUDQFH
The first requirement for high availability systems is hardware fault tolerance. Stratus
and Marathon each take a different approach.
Stratus ftServer
ftServer uses standard Intel server components and designs, but Stratus designs its own
motherboard (using standard Intel server design guidelines), removes the PCI I/O, and
adds fault detection logic that is key to fault isolation in a DMR configuration. The sys-
tem contains two motherboards for Dual Modular Redundancy (DMR) or three
motherboards for Triple Modular Redundancy (TMR). All motherboards run in lock-
step, using a single system clock, and
Disk PCI Fault
redundant clock cards. Fault-detection
Fault Memory CPU
Detection Detection
1-N way SMP and isolation logic (a custom ASIC) com-
Lockstep CPU’s
Lockstep CPU’s
Lockstep CPU’s
Lockstep CPU’s
Isolation Isolation Chipset
pares I/O output from all motherboards.
DMR
Disk PCI DMR systems rely on fault-detection
Fault Fault Memory CPU
Detection Detection
1-N way SMP logic on each motherboard to see which
Isolation Chipset
Isolation
is in error. If no motherboard error is
signaled, a software algorithm decides
Fault Memory CPU
Detection
1-N way SMP
which board to remove. In a TMR sys-
TMR
Chipset
Isolation tem, 3-way voting is used to isolate the
failed board. ftServer runs a single copy
6WUDWXV· IW6HUYHU $UFKLWHFWXUH (QVXUHV =HUR 6ZLWFKRYHU 7LPH of all software, resulting in lower licens-
1R 6LQJOH 3RLQW RI )DLOXUH DQG D 6LQJOH 6RIWZDUH ,PDJH
ing costs and simple administration.
Marathon Endurance System
Marathon physically and logically separates the two basic operations of computers, the
manipulating and transforming data (computing) and the moving data to and from mass
storage, networks, and other I/O devices (I/O processing). The computing function is
put on one server (the compute element), and the I/O processing function is put on an-
other server, (the I/O processor). These CE/IOP pairs (tuples) connect through
proprietary high-speed PCI interfaces and fiber optics. The Marathon Interface Card
‹Ã! ÇÃ6S8Ã6q‰v†‚…’ÃB…‚ˆƒÃ‡ÃÃ6yyvrqÃ9…v‰rÇÃ9rquh€ÃH6Ã!!%ÃVT6ÇÃ' # ÇÃ6S8rip‚€Ã
VT6ÇÃVFÇÃBr…€h’ÇÃEhƒhÇÃDqvhÃ
3. 6S8ÃD†vtu‡†ÃQhtrÃÃ
(MIC) sends and receives data from both systems simultaneously. The MIC also pro-
vides the comparison and test logic to ensures that both systems are identical. Each
tuple is a complete system, wherein the operating system running on both the CE and
IOP is a Windows server OS. All CE I/O task requests go to the IOP for handling.
Marathon software runs as an application on the IOP and controls all of the fault man-
agement, disk mirroring, system management, and resynchronization. Because the fault
management is done in software, it can impact the performance. Depending on the ap-
plications running, system performance may degrade by 10-20 percent or more.
It takes two tuples to configure an assured availability system. The
Compute Element IOPs run in parallel, but not in lockstep. If an IOP fails, the other
CPU IOP continues to run the system. The failed IOP can then be physi-
Applications and cally removed. After the Marathon software starts running, the
MEMORY MIC Operating System repaired IOP automatically rejoins the configuration. The mirrored
I/O Processor disks are re-mirrored in background mode over the private
Ethernet linking the IOPs. The same process handles the failure of
MEMORY MIC
All I/O a mirrored disk.
I/O
CPU ADAPTERS
6RIWZDUH $YDLODELOLW
Network The second requirement is for maximizing software availability.
Clusters rely on standard hardware, software, and service models
that do not help prevent failures, isolate failures, or resolve failures.
0DUDWKRQ 7XSOH ³ %XLOGLQJ
%ORFN IRU DQ $VVXUHG $YDLODELOLW They simply recover from failures. Once again, Marathon and Stra-
tus have different approaches.
Stratus
Software availability features seek to prevent outages, minimize those that cannot be
prevented, and resolve problems so that they do not happen again. Stratus does not
change any of the core Windows code. This guarantees 100 percent binary compatibility
of all Windows applications. Stratus does change the Windows 2000 environment, but
only in areas designed to be customized by hardware and software partners and sepa-
rated from the main body of Windows code by documented, well-defined interfaces.
Drivers cause a significant percentage of NT failures. Stratus driver hardening goes be-
yond Windows 2000 improvements to further reduce driver-induced OS failures. The
driver defines its memory boundaries and works with Stratus hardware to automatically
prevent memory transfers beyond the defined memory boundaries. This prevents a bad
PCI card from crashing the system. The new Microsoft driver model for Windows 2000
uses WMI (Windows Management Instrumentation) for management, control, and re-
porting functions. Stratus hardened drivers are completely compatible with WMI.
‹Ã! ÇÃ6S8Ã6q‰v†‚…’ÃB…‚ˆƒÃ‡ÃÃ6yyvrqÃ9…v‰rÇÃ9rquh€ÃH6Ã!!%ÃVT6ÇÃ' # ÇÃ6S8rip‚€Ã
VT6ÇÃVFÇÃBr…€h’ÇÃEhƒhÇÃDqvhÃ
4. 6S8ÃD†vtu‡†ÃQhtrÃ#Ã
Stratus recommends that all drivers be hardened. Hardened drivers for all installed
adapters are required in order to receive Stratus’ 100 percent availability guarantee.
Incompatible versions of hardware and software from different suppliers are well-
known. The Resource Inventory Manager (RIM) identifies all system hardware and
software configuration elements, along with their revision levels, at initial install and all
configuration changes. This information is stored and is also sent to the Stratus CAC,
which can check known conflicts and help diagnose any problems.
Marathon
Marathon’s architecture provides hardware fault tolerance, protection against transient
OS bugs, detects OS failures, and automatically restarts the system. Because the IOPs
run Marathon’s I/O management and fault-handling software, they are isolated from the
loads placed on the CEs by the user’s applications and operating system. The IOPs run
in parallel, but not in lockstep. Since the IOPs handle all interruptions, the CEs are free
to run the OS and user applications without the usual stream of asynchrony. Interrup-
tions are managed through a structured process that eliminates a major source of
asynchrony-induced software failures. The IOPs are subjected to these asynchronies, but
since there are two autonomous IOPs in a full fault-tolerant system, an interrupt-induced
software asynchrony will only affect one of the IOPs. If an IOP goes down, the surviving
IOP carries on until an automatic reboot of the failed IOP is completed.
6HUYLFH
The third requirement for high availability systems is designed-in serviceability. Again,
Stratus and Marathon have different approaches.
Stratus
Serviceability is built into the ftServer hardware design in the form of customer replace-
able modules, automatic fault isolation and remote management, and reporting through
the Stratus remote management card. The Stratus Service Network (SSN) enables re-
mote access to every customer system. The Stratus Customer Assistance Center provides
the 24/7 critical support.
ftServer automatically isolates failures to the component level while continuing opera-
tion on a second component. Failures are automatically reported to the CAC via a dial
connection. A replacement component is shipped from Stratus for next-day arrival. The
customer replaces the component while the system continues to operate. The new com-
ponent is automatically integrated into the running system. The system and application
continue to run normally through this entire process.
‹Ã! ÇÃ6S8Ã6q‰v†‚…’ÃB…‚ˆƒÃ‡ÃÃ6yyvrqÃ9…v‰rÇÃ9rquh€ÃH6Ã!!%ÃVT6ÇÃ' # ÇÃ6S8rip‚€Ã
VT6ÇÃVFÇÃBr…€h’ÇÃEhƒhÇÃDqvhÃ
5. 6S8ÃD†vtu‡†ÃQhtrÃ$Ã
Each ftServer comes with two ftServer Management PCI adapters. These adapters are,
themselves, board level computers. They run independently of the host system and are
powered even if the rest of the system is powered off. Either redundant ftServer Man-
agement adapter provides full control over the ftServer. Access is controlled through an
TCP/IP interface via dial modem or local Ethernet.
If a customer calls, Stratus will troubleshoot the problem. If the problem is in Microsoft
Windows 2000 code, Stratus calls in Microsoft, based on its service contract with Micro-
soft. Stratus also has licensed Windows 2000 source code and a staff of kernel-trained
engineers. Microsoft has also given Stratus access to their OS debugging tools.
Marathon
The Marathon Assured Availability system has three states: operational, vulnerable, and
down. The vulnerable state, invisible to users, notifies the system manager that a re-
pair/resynchronization cycle can be initiated. Marathon provides two notification
methods: the system console and the event log. The console presents a graphical model
on the system monitor, on remote systems over the network, or through a serial line to
the system manager. Color-coded components indicate their state, and a point-and-click
interface is used to examine and manage system components. The second method uses
the Windows server event log to log all events, including Marathon system events. Sev-
eral third-party tools are available that use the event log to communicate specified events
via beepers, fax, e-mail, etc., to the system manager.
5(200(1'$7,216
• All systems supporting real-time collaboration throughout the enterprise and value
chain should be deployed on fault-tolerant platforms.
• When it comes to control-level, real-time, batch and process control applications,
Stratus ftServer has the advantage because their architecture has no single point of
failure and zero switchover time.
• When selecting fault tolerant solutions, consider the whole solution, including
hardware fault tolerance, software availability, performance, implementation costs,
and serviceability.
For further information, contact your account manager or the author at ggorbach@arcweb.com.
Recommended circulation: All EAS and MAS clients.
‹Ã! ÇÃ6S8Ã6q‰v†‚…’ÃB…‚ˆƒÃ‡ÃÃ6yyvrqÃ9…v‰rÇÃ9rquh€ÃH6Ã!!%ÃVT6ÇÃ' # ÇÃ6S8rip‚€Ã
VT6ÇÃVFÇÃBr…€h’ÇÃEhƒhÇÃDqvhÃ