Building a Highly available messaging hub using the IBM MQ Appliance

Deploying a highly available Messaging
Hub using the IBM MQ Appliance
Session #3465
Anthony Beardsmore (abeards@uk.ibm.com)
IBM MQ Appliance Architect

Introduction
High availability versus disaster recovery
Timeline

Very high level overview and terminology
• High availability: providing continuity of service after a ‘component’
failure (e.g. disk, software bug, network switch, power supply)
• Disaster recovery: providing a means to re-establish service after a
catastrophic failure (e.g. entire data center power loss, flood, fire)
• Primary and secondary – which instance is currently ‘active’
– Has direct impact on commands you execute and behaviour
– N.B. in Disaster Recovery features ‘active’ doesn’t necessarily mean ‘running’
• Main and recovery– which instance is usually ‘active’
– Doesn’t effect behaviour but useful concept in discussions and documentation
2

HA and DR – comparison of aims
High Availability
• 100% data retention
• Short distances (meters or miles)
• Automatic failover
Disaster recovery
• Some (minimal) data loss
acceptable
• Long distances (out of region)
• Manual fail over
3
• Minimal effect possible on performance
• As little impact on applications as possible
In both cases we want:

Timeline / Versions
• This presentation is based on the capabilities of the 8.0.0.4 firmware for
the IBM MQ Appliance
– Released December 2015
• Support for High Availability (HA) was a key feature of the IBM MQ
Appliance from the first release and has been improved in every Fix
Pack since
– Strongly recommend moving to latest release to get all enhancements
• Support for Disaster Recovery (DR) has been added in 8.0.0.4 and it
uses some of the technology used for HA
4

Setting up HA
4. Then create an HA queue manager:
crtmqm -sx HAQM1
Implementing HA is a simple with the MQ Appliance!
2. On Appliance #1 issue the following command:
prepareha -s <some random text> -a <address of appliance2>
1. Connect two appliances together
 That’s it!
 Note that there is no need to run strmqm. Queue managers will
start and keep running unless explicitly ended with endmqm
3. On Appliance #2 issue the following command:
crthagrp -s <the same random text> -a <address of appliance1>

High Availability – Physical layout
Replication Connection
(10 Gb Ethernet)Heartbeat Connections
(1 Gb Ethernet)

QM1 QM2 QM3
QM1 QM2 QM3
Fully synchronous
replication
Key design points:
- No (persistent) message loss
- No external dependencies
- Transparent to application

QM1 QM2 QM3
QM1 QM2 QM3
HA Groups
• Exactly two appliances
• Manages automatic failover
Group

QM1 QM2 QM3
QM1 QM2 QM3
• Queue manager or entire
appliance level failures
• Will restart on local system if
possible

HA Queue managers in MQ Console
12
HA Group
Appliance #1
HA Group
Appliance #2

HA Queue managers in MQ Console: After Failover
13
• Appliance #1 is now in Standby
• All HA queue managers are now running on Appliance #2
• The console shows the High Availability alert, and a menu to allow you to see the
status and to suspend or resume the appliance in the HA group.

HA failover (CLI view)
On Appliance #2:
• HAQM2 is running there, on its primary and preferred location
• HAQM1 is running on its primary – Appliance #1, so is secondary on Appliance #2
On Appliance #2 – after failover:
• HAQM1 is now running on Appliance #2
14
Before failover
After failover

Appliance HA commands
Command Description
crthagrp Create a high availability group of appliances.
dsphagrp Display the status of the appliances in the HA group.
makehaprimary Specifies that an appliance is the 'winner' when
resolving a partitioned situation in the HA group.
prepareha Prepare an appliance to be part of an HA group
sethagrp Pause and resume an appliance in an HA group.
status Per Queue manager detailed information including
replication status (e.g. percentage complete when
re-syncing from lost connection)

Notes: Designing a group
• This image is straight from the Infocenter and gives a good overview of the
possible combinations of Queue managers in an HA Group.
• As long as IP Addresses etc. for the three HA interfaces are correctly pre-
configured, defining a group is as simple as executing the ‘crthagrp/prepareha’
commands.
– Appliances can only be in exactly one group of exactly two appliances
• Queue managers may be added to the group at crtmqm time, or after creation
using the ‘sethagrp’ command
– Queue managers can be active on either appliance (both appliances can simultaneously be
running different active queue managers). Up to 16 active/passive instances per appliance
are permitted.
• Unlimited (other than by storage capacity etc.) non-HA queue managers may
also be present on either appliance.
– This might be desirable for example if you have applications/queue managers with different
QOS agreements, or Test and production environments on the same system.

HA – requirements / restrictions
• MUST ensure redundancy between heartbeat (1GB) links
– Shared nothing – power, switches, routers, cables
– Minimises risk of partitioning (‘split brain’)
• Less than 10ms latency between appliances (replication interface)
– For good application performance may find far lower required.
– 1 or 2 ms a good target – practical at ‘metro area’ distances
• Sufficient bandwidth for all message data being transferred through HA
queue managers (replication interface)
• No native VLAN tagging or link aggregation on these connections
• Ideal world is physical cabling between systems
– In a single datacentre, co-located rack scenario, avoids all infrastructure concerns
17

Setting up Disaster Recovery
DR has different goals (asynchronous, manual) so slightly different
externals to HA but similar process
2. On ‘live’ appliance, convert queue manager to Disaster Recovery ‘source’:
crtdrprimary –m <name> -r <standby> -i <ip address> -p <port>
1. Connect two appliances together (only one,
10GB, connection needed)
 Synchronization begins immediately (‘status’ command shows
progress)
 In event that need to start the standby instance, ‘makedrprimary’
takes ownership of queue manager and then business as usual
3. On ‘standby’ appliance simply paste the text provided by the above
crtdrsecondary <some provided parameters>

DR Replication
Asynchronous
• Single 10GB Ethernet connection
• Eth20 interface
Disaster Recovery: Physical connection

Production appliance
Off‐site DR appliance
Asynchronous
replication
• Because DR has no concept of a group Disaster Recovery configurations can
be even more flexible than HA
• though of course there is no automated management and failover
• Each QM independently configures replication to a particular appliance.
• E.g. could configure single ‘DR’ site covering live appliances at multiple sites
Mixed Test/DR appliance
Production appliance
Disaster Recover: Flexible topologies

Replication, synchronization and snapshots (1)
• There are two modes in which data can be sent from the Primary
instance to the Secondary instance
1. Replication – when the two instances are connected, each individual write is
replicated from the Primary to the Secondary in the order in which they are made on
the Primary
2. Synchronization – when the connection is lost and then restored.
Synchronization is used to get the Secondary back in step as quickly as
possible
– but this means that the Secondary is Inconsistent until the synchronization completes
and the queue manager would not be able to start

Replication, synchronization and snapshots (2)
• To resolve this issue, a ‘snapshot’ is taken of the queue manager
whenever synchronization is started
• if connection lost again (or complete failure of the Primary appliance)
while synchronizing you can still issue makedrprimary command on
standby to recover. However:
– This will revert the queue manager to the state at the beginning of the synchronization
– Can take a long time (hours for a large queue manager)
– updates made to the Primary since original outage will be lost
• Space is reserved for this process whenever DR queue managers are
configured
– So may be surprised to see less disk available than you thought!

DR – requirements / restrictions
• The maximum latency for the replication link is 100 ms.
• The IP addresses used for the two eth20 ports should belong to the
same subnet, and that subnet should only be used for disaster
recovery configuration.
• Native VLAN (trunked) and link aggregation are not supported on the
replication interface
24

Communication considerations
MQ channels, Application reconnection, security

Channel reconnection
• The same approach is used whether in HA or DR mode
– While different underlying technology, uses same ‘externals’ as multi-instance queue
manager support (as included since MQ 7.0.1)
• Client applications (and other queue managers) reconnect to the
standby instance after failure by configuring multiple IP addresses for
the channels
– Either explicity in CONNAME (comma separated list)
– Or by defining a CCDT with multiple endpoints
– Or using a preconnect exit
• Don’t forget that cluster receivers define their own ‘multiple homes’
26

Client reconnect implications
• When a queue manager ‘fails over’ using HA or DR, effectively from the
point of view of an application or remote queue manager, this queue
manager has been restarted.
– Ordinarily, application would receive MQRC_CONNECTION_BROKEN
• This can typically be hidden from applications using 7.0.1 or higher
client libraries by using ‘client auto-reconnect’ feature, but there are
some limitations to be aware of
– Failure during initial connect will still result in ‘MQCONN’ receiving a bad rc and having
to retry
– Browse cursors are reset
– In doubt ‘commit’ cannot be transparently handled.
• This can allow existing applications to exploit HA with no change or
minimal change, but consult documentation 27

Security
• Most security data - for example certificate stores and authority records
- is replicated alongside queue manager in HA or DR configuration
• However: Users and groups are NOT replicated between appliances
– because not all configuration has to be identical – you may have queue managers on
either device and associated users which you do not wish replicated
– So… strongly consider LDAP for messaging users on replicated queue managers
• Group configuration/heartbeat/management is ‘secure by default’
– Prevents e.g. another device configuring itself replication target
• But currently no encryption on replication link – possibly acceptable
for single data centre HA, needs careful consideration for DR
– If necessary, using AMS will ensure all message data encrypted both at rest and
across replication 28

HA and DR Performance
30
0
10
20
30
40
50
60
70
80
90
100
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301
CPU%
RoundTrips/sec
Requester Clients
10 Application, 10QM Request Responder (2KB Persistent)
M2000A FP4 HA - Round Trips/sec
M200A FP4 HA 2ms - Round Trips/sec
M2000A FP4 HA - CPU%
M2000A FP4 HA 2ms - CPU%
• At just 2ms latency, HA
shows significant
degradation compared
to ‘direct’ connection
• For similar scenario,
even at 50ms latency
DR achieves 90% of
‘direct’
• See report MPA2 for full
analysis, including
comparisons to non-HA

Notices and Disclaimers
31
Copyright © 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission
from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of
initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS
DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE
USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY.
IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers
have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in
which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials
and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or
their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and
interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such
laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law

Notices and Disclaimers Con’t.
32
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not
tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products.
Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the
ability of any such third-party products to interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT
NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained h erein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual
property right.
IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®,
FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Information on Demand, ILOG,
Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®,
PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®,
StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International Business
Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM
trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.

Thank You
Your Feedback is Important!
Access the InterConnect 2016 Conference Attendee
Portal to complete your session surveys from your
smartphone,
laptop or conference kiosk.

Building a Highly available messaging hub using the IBM MQ Appliance

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Building a Highly available messaging hub using the IBM MQ Appliance

Similar a Building a Highly available messaging hub using the IBM MQ Appliance (20)

Último

Último (20)

Building a Highly available messaging hub using the IBM MQ Appliance