Best practices for high availability MySQL

Best practices for MySQL High Availability
Colin Charles, Chief Evangelist, Percona Inc.

colin.charles@percona.com / byte@bytebot.net

http://www.bytebot.net/blog/ | @bytebot on Twitter

Percona Live Europe Amsterdam, Netherlands

3 October 2016

whoami
• Chief Evangelist (in the CTO ofﬁce), Percona Inc
• Founding team of MariaDB Server (2009-2016), previously at
Monty Program Ab, merged with SkySQL Ab, now MariaDB
Corporation
• Formerly MySQL AB (exit: Sun Microsystems)
• Past lives include Fedora Project (FESCO), OpenOfﬁce.org
• MySQL Community Contributor of theYear Award winner 2014
2

Agenda
• Choosing the right High Availability (HA) solution
• Discuss replication
• Handling failure
• Discuss proxies
• HA in the cloud, geographical redundancy
• Sharding solutions
• MySQL 5.6/5.7 features + utilities + Fabric + Router
• What’s next?
3

Uptime
Percentile target Max downtime per year
90% 36 days
99% 3.65 days
99.5% 1.83 days
99.9% 8.76 hours
99.99% 52.56 minutes
99.999% 5.25 minutes
99.9999% 31.5 seconds
10

Estimates of levels of availability
11
Method
Level of
Availability
Simple replication 98-99.9%
Master-Master/MMM 99%
SAN 99.5-99.9%
DRBD, MHA, Tungsten
Replicator
99.9%
NDBCluster, Galera Cluster 99.999%

HA is Redundancy
• RAID: disk crashes? Another works
• Clustering: server crashes? Another works
• Power: fuse blows? Redundant power supplies
• Network: Switch/NIC crashes? 2nd network route
• Geographical: Datacenter ofﬂine/destroyed? Computation to
another DC
12

Durability
• Data stored on disks
• Is it really written to the disk?
• being durable means calling fsync() on each commit
• Is it written in a transactional way to guarantee atomicity,
crash safety, integrity?
13

High Availability for databases
• HA is harder for databases
• Hardware resources and data need to be redundant
• Remember, this isn’t just data - constantly changing data
• HA means the operation can continue uninterrupted, not by
restoring a new/backup server
• uninterrupted: measured in percentiles
14

Redundancy through client-side
XA transactions
• Client writes to 2 independent but identical databases
• HA-JDBC (http://ha-jdbc.github.io/)
• No replication anywhere
15

InnoDB “recovery” time
•innodb_log_file_size
• larger = longer recovery times
• Percona Server 5.5 (XtraDB) - innodb_recovery_stats
16

Redundancy through
shared storage
• Requires specialist hardware, like a SAN
• Complex to operate
• One set of data is your single point of failure
• Cold standby
• failover 1-30 minutes
• this isn’t scale-out
• Active/Active solutions: Oracle RAC, ScaleDB
17

Redundancy through disk
replication
• DRBD
• Linux administration vs. DBA skills
• Synchronous
• Second set of data inaccessible for use
• Passive server acting as hot standby
• Failover: 1-30 minutes
• Performance hit: DRBD worst case is ~60% single node performance, with
higher average latencies
18

MySQL Sandbox
• Great for testing various versions of MySQL/Percona Server/
MariaDB
• Great for creating replication environments
• make_sandbox mysql.tar.gz
• make_replication_sandbox mysql.tar.gz
• http://mysqlsandbox.net/
20

Redundancy through MySQL
replication
• MySQL replication
• Tungsten Replicator
• Galera Cluster
• MySQL Cluster (NDBCLUSTER)
• Storage requirements are multiplied
• Huge potential for scaling out
21

MySQL Replication
• Statement based generally
• Row based became available in 5.1, and the default in 5.7
• mixed-mode, resulting in STATEMENT except if calling
• UUID function, UDF, CURRENT_USER/USER function, LOAD_FILE function
• 2 or more AUTO_INCREMENT columns updated with same statement
• server variable used in statement
• storage engine doesn’t allow statement based replication, like
NDBCLUSTER
22

MySQL Replication II
• Asynchronous by default
• Semi-synchronous plugin in 5.5+
• However the holy grail of fully synchronous replication is not
part of standard MySQL replication (yet?)
• MariaDB Galera Cluster is built-in to MariaDB Server 10.1
23

The logs
• Binary log (binlog) - events that describe database changes
• Relay log - events read from binlog on master, written by slave
i/o thread
• master_info_log - status/conﬁg info for slave’s connection to
master
• relay_log_info_log - status info about execution point in slave’s
relay log
24

Semi-synchronous replication
• semi-sync capable slave acknowledges transaction event only
after written to relay log & ﬂushed to disk
• timeout occurs? master reverts to async replication; resumes
when slaves catch up
• at scale, Facebook runs semi-sync: http://
yoshinorimatsunobu.blogspot.com/2014/04/semi-synchronous-
replication-at-facebook.html
25

Semi-sync II
• nowadays, its enhanced (COMMIT method):
1. prepare transaction in storage engine
2. write transaction to binlog, ﬂush to disk
3. wait for at least one slave to ack binlog event
4. commit transaction to storage engine
26

MySQL Replication in 5.6
• Global Transaction ID (GTID)
• Server UUID
• Ignore (master) server IDs
(ﬁltering)
• Per-schema multi-threaded
slave
• Group commit in the binary
log
• Binary log (binlog) checksums
• Crash safe binlog and relay
logs
• Time delayed replication
• Parallel replication (per
database)
27

Group commit in MariaDB 5.3
onwards
• Do slow part of prepare() in parallel in InnoDB (ﬁrst
fsync(), InnoDB group commit)
• Put transaction in queue, decide commit order
28

• First in queue runs serial part for all, rest wait
• Wait for access to the binlog
• Write transactions into binlog, in order, then sync (second
fsync())
• Run the fast part of commit() for all transactions in order
29

• Finally, run the slow part of commit() in parallel (third
fsync(), InnoDB group commit)
• Only 2 context switches per thread (one sleep, one wakeup)
• Note: MySQL 5.6, MariaDB 10 only does 2 fsyncs/group
commit
30

Group commit in MariaDB 10
• Remove commit in slow part of InnoDB commit (stage 4)
• Reduce cost of crash-safe binlog
• A binlog checkpoint is a point in the binlog where no crash
recovery is needed before it. In InnoDB you wait for ﬂush +
fsync its redo log for commit
31

crash-safe binlog
• MariaDB 5.5 checkpoints after every commit —> expensive!
• 5.5/5.6 stalls commits around binlog rotate, waiting for all
prepared transactions to commit (since crash recovery can
only scan latest binlog ﬁle)
32

crash-safe binlog 10.0
• 10.0 makes binlog checkpoints asynchronous
• A binlog can have no checkpoints at all
• Ability to scan multiple binlogs during crash recovery
• Remove stalls around binlog rotates
33

group commit in 10.1
• Tricky locking issues hard to change without getting deadlocks sometimes
• mysql#68251, mysql#68569
• New code? Binlog rotate in background thread (further reducing stalls). Split
transactions across binlogs, so big transactions do not lead to big binlog ﬁles
• Works with enhanced semi-sync replication (wait for slave before commit on the
master rather than after commit)
34

Replication: START TRANSACTION
WITH CONSISTENT SNAPSHOT
• Works with the binlog, possible to obtain the binlog position corresponding to a
transactional snapshot of the database without blocking any other queries.
• by-product of group commit in the binlog to view commit ordering
• Used by the command mysqldump--single-transaction --
master-data to do a fully non-blocking backup
• Works consistently between transactions involving more than one storage
engine
• https://kb.askmonty.org/en/enhancements-for-start-transaction-with-consistent/
• Percona Server made it better, by session ID, and also introducing backup locks
35

Multi-source replication
• Multi-source replication - (real-time) analytics, shard provisioning,
backups, etc.
• @@default_master_connection contains current connection
name (used if connection name is not given)
• All master/slave commands take a connection name now (like
CHANGE MASTER “connection_name”, SHOW SLAVE
“connection_name” STATUS, etc.)
36

Global Transaction ID (GTID)
• Supports multi-source replication
• GTID can be enabled or disabled independently and online for masters or
slaves
• Slaves using GTID do not have to have binary logging enabled.
• (MariaDB) Supports multiple replication domains (independent binlog
streams)
• Queries in different domains can be run in parallel on the slave.
37

Why MariaDB GTID is different
compared to 5.6?
• MySQL 5.6 GTID does not support multi-source replication
• Supports —log-slave-updates=0 for efﬁciency
• Enabled by default
• Turn it on without having to restart the topology
38

Crash-safe slave (w/InnoDB
DML)
• Replace non-transactional ﬁle relay_log.info with transactional
mysql.rpl_slave_state
• Changes to rpl_slave_state are transactionally
recovered after crash along with user data.
39

Crash-safe slaves in 5.6?
• Not using GTID
• you can put relay-log.info into InnoDB table, that gets updated along w/trxn
• Using GTID
• relay-log.info not used. Slave position stored in binlog on slave (—log-slave-
updates required)
• Using parallel replication
• Uses a different InnoDB table for this use case
40

Replication domains
• Keep central concept that replication is just applying events in-order from a serial
binlog stream.
• Allow multi-source replication with multiple active masters
• Let’s the DBA conﬁgure multiple independent binlog streams (one per active
master: mysqld --git-domain-id=#)
• Events within one stream are ordered the same across entire replication
topology
• Events between different streams can be in different order on different servers
• Binlog position is one ID per replication domain
41

Parallel replication
• Multi-source replication from different masters executed in parallel
• Queries from different domains are executed in parallel
• Queries that are run in parallel on the master are run in parallel
on the slave (based on group commit).
• Transactions modifying the same table can be updated in parallel
on the slave!
• Supports both statement based and row based replication.
42

All in… sometimes it
can get out of sync
• Changed information on slave directly
• Statement based replication
• non-deterministic SQL (UPDATE/
DELETE with LIMIT and without
ORDER BY)
• triggers & stored procedures
• Master in MyISAM, slave in InnoDB
(deadlocks)
• --replication-ignore-db with fully
qualiﬁed queries
• Binlog corruption on master
• PURGE BINARY LOGS issued and
not enough ﬁles to update slave
• read_buffer_size larger than
max_allowed_packet
• Bugs?
43

Replication Monitoring
• Percona Toolkit is important
• pt-slave-find: find slave information from master
• pt-table-checksum: online replication consistency check
• executes checksum queries on master
• pt-table-sync: synchronise table data efficiently
• changes data, so backups important
44

Replication Monitoring with
PMM
45
•http://pmmdemo.percona.com/

Statement Based
Replication Binlog$ mysqlbinlog mysql-bin.000001
# at 3134
#140721 13:59:57 server id 1 end_log_pos 3217 CRC32 0x974e3831 Querythread_id=9 exec_time=0 error_code=0
SET TIMESTAMP=1405943997/*!*/;
BEGIN
/*!*/;
# at 3217
#140721 13:59:57 server id 1 end_log_pos 3249 CRC32 0x8de28161 Intvar
SET INSERT_ID=2/*!*/;
# at 3249
#140721 13:59:57 server id 1 end_log_pos 3370 CRC32 0x121ef29f Querythread_id=9 exec_time=0 error_code=0
SET TIMESTAMP=1405943997/*!*/;
insert into auto (data) values ('a test 2')
/*!*/;
# at 3370
#140721 13:59:57 server id 1 end_log_pos 3401 CRC32 0x34354945 Xid = 414
COMMIT/*!*/;
46

Dynamic replication variable
control
• SET GLOBAL binlog_format=‘STATEMENT’ | ‘ROW’ | ‘MIXED’
• Can also be set as a session level
• Dynamic replication ﬁltering variables on MariaDB 5.3+
47

Row based replication event
> mysqlbinlog mysql-bin.*
# at 3401
#140721 14:03:59 server id 1 end_log_pos 3477 CRC32 0xa37f424a Query thread_id=9 exec_time=0 error_code=0
SET TIMESTAMP=1405944239.559237/*!*/;
BEGIN
/*!*/;
# at 3477
#140721 14:03:59 server id 1 end_log_pos 3529 CRC32 0xf4587de5 Table_map: `demo`.`auto` mapped to number 80
# at 3529
#140721 14:03:59 server id 1 end_log_pos 3585 CRC32 0xbfd73d98 Write_rows: table id 80 ﬂags: STMT_END_F
BINLOG '
rwHNUxMBAAAANAAAAMkNAAAAAFAAAAAAAAEABGRlbW8ABGF1dG8AAwMRDwMGZAAE5X1Y9A==
rwHNUx4BAAAAOAAAAAEOAAAAAFAAAAAAAAEAAgAD//gDAAAAU80BrwiIhQhhIHRlc3QgM5g9178=
'/*!*/;
# at 3585
#140721 14:03:59 server id 1 end_log_pos 3616 CRC32 0x5f422fed Xid = 416
COMMIT/*!*/;
48

mysqlbinlog versions
• ERROR: Error in Log_event::read_log_event(): 'Found invalid
event in binary log', data_len: 56, event_type: 30
• 5.6 ships with a “streaming binlog backup server” - v.3.4;
MariaDB 10 doesn’t - v.3.3 (ﬁxed in 10.2 - MDEV-8713)
• GTID variances!
49

GTID
50
# at 471
#140721 14:20:01 server id 1 end_log_pos 519 CRC32 0x209d8843 GTID [commit=yes]
SET @@SESSION.GTID_NEXT= 'ff89bf58-105e-11e4-b2f1-448a5b5dd481:2'/*!*/;
# at 519
#140721 14:20:01 server id 1 end_log_pos 602 CRC32 0x5c798741 Query thread_id=3 exec_time=0 error_code=0
SET TIMESTAMP=1405945201.329607/*!*/;
BEGIN
/*!*/;
# at 602
# at 634
#140721 14:20:01 server id 1 end_log_pos 634 CRC32 0xa5005598 Intvar
SET INSERT_ID=5/*!*/;
#140721 14:20:01 server id 1 end_log_pos 760 CRC32 0x0b701850 Query thread_id=3 exec_time=0 error_code=0
SET TIMESTAMP=1405945201.329607/*!*/;
insert into auto (data) values ('a test 5 gtid')
/*!*/;
# at 760
#140721 14:20:01 server id 1 end_log_pos 791 CRC32 0x497a23e0 Xid = 31
COMMIT/*!*/;

SHOW SLAVE STATUS
mysql> show slave statusG
*************************** 1. row ***************************
Slave_IO_State:Waiting for master to send event
Master_Host: server1
Master_User: repluser
Master_Port: 3306
...
Master_Log_File: server1-binlog.000008 <- io_thread (read)
Read_Master_Log_Pos: 436614719 <- io_thread (read)
Relay_Log_File: server2-relaylog.000007 <- io_thread (write)
Relay_Log_Pos: 236 <- io_thread (write)
Relay_Master_Log_File: server1-binlog.000008 <- sql_thread
Slave_IO_Running:Yes
Slave_SQL_Running:Yes
...
Exec_Master_Log_Pos: 436614719 <- sql_thread
...
Seconds_Behind_Master: 0
51

Slave prefetching
• Replication Booster
• https://github.com/yoshinorim/replication-booster-for-mysql
• Prefetch MySQL relay logs to make the SQL thread faster
• Tungsten has slave prefetch
• Percona Server till 5.6 + MariaDB till 10.1 have InnoDB fake
changes
52

What replaces slave prefetching?
• In Percona Server 5.7, slave prefetching has been replaced by
doing intra-schema parallel replication
• Feature removed from XtraDB
• MariaDB Server 10.2 will also have this feature removed
53

Tungsten Replicator
• Replaces MySQL Replication layer
• MySQL writes binlog,Tungsten reads it and uses its own replication protocol
• Global Transaction ID
• Per-schema multi-threaded slave
• Heterogeneous replication: MySQL <-> MongoDB <-> PostgreSQL <-> Oracle
• Multi-master replication
• Multiple masters to single slave (multi-source replication)
• Many complex topologies
• Continuent Tungsten (Enterprise) vs Tungsten Replicator (Open Source)
54

In today’s world, what does it
offer?
• opensource MySQL <-> Oracle replication to aid in your
migration
• automatic failover without MHA
• multi-master with cloud topologies too
• Oracle <-> Oracle replication (this is Golden Gate for FREE)
• Replication from MySQL to MongoDB
• Data loading into Hadoop
55

Galera Cluster
• Inside MySQL, a replication plugin (wsrep)
• Replaces MySQL replication (but can work alongside it too)
• True multi-master, active-active solution
• Synchronous
• WAN performance: 100-300ms/commit, works in parallel
• No slave lag or integrity issues
• Automatic node provisioning
56

Percona XtraDB Cluster 5.7
• Engineering within Percona
• Load balancing with ProxySQL (bundled)
• PMM integration
• Beneﬁts of all the MySQL 5.7 feature-set
58

Group replication
• Fully synchronous replication (update everywhere), self-healing,
with elasticity, redundancy
• Single primary mode supported
• MySQL InnoDB Cluster - a combination of group replication,
Router, to make magic!
59

MySQL NDBCLUSTER
• 3 types of nodes: SQL, data and management
• MySQL node provides interface to data.Alternate API’s available: LDAP, memcached,
native NDBAPI, node.js
• Data nodes (NDB storage)
• different to InnoDB
• transactions synchronously written to 2 nodes(ore more) - replicas
• transparent sharding: partitions = data nodes/replicas
• automatic node provisioning, online re-partitioning
• High performance: 1 billion updates / minute
60

Summary of Replication
Performance
• SAN has "some" latency overhead compared to local disk. Can be great
for throughput.
• DRBD = 50% performance penalty
• Replication, when implemented correctly, has no performance penalty
• But MySQL replication with disk bound data set has single-threaded
issues!
• Semi-sync is poorer on WAN compared to async
• Galera & NDB provide read/write scale-out, thus more performance
61

Handling failure
• How do we ﬁnd out about failure?
• Polling, monitoring, alerts...
• Error returned to and handled in client side
• What should we do about it?
• Direct requests to the spare nodes (or DCs)
• How to protect data integrity?
• Master-slave is unidirectional: Must ensure there is only one master at all times.
• DRBD and SAN have cold-standby: Must mount disks and start mysqld.
• In all cases must ensure that 2 disconnected replicas cannot both commit independently. (split
brain)
62

Frameworks to handle failure
• MySQL-MMM
• Severalnines
ClusterControl
• Orchestrator
• MySQL MHA
• Percona Replication
Manager
• 5.6: mysqlfailover,
mysqlrpladmin
• (MariaDB) Replication
Manager
63

MySQL-MMM
• You have to setup all nodes and replication manually
• MMM gives Monitoring + Automated and manual failover on top
• Architecture consists of Monitor and Agents
• Typical topology:
• 2 master nodes
• Read slaves replicate from each master
• If a master dies, all slaves connected to it are stale
• http://mysql-mmm.org/
64

Severalnines ClusterControl
• Started as automated deployment of MySQL NDB Cluster
• now: 4 node cluster up and running in 5 min!
• Now supports
• MySQL replication and Galera
• Semi-sync replication
• Automated failover
• Manual failovers, status check, start & stop of node, replication, full cluster... from single command line.
• Monitoring
• Topology: Pair of semi-sync masters, additional read-only slaves
• Can move slaves to new master
• http://severalnines.com/
65

ClusterControl II
• Handles deployment: on-premise, EC2, or hybrid (Rackspace,
etc.)
• Adding HAProxy as a Galera load balancer
• Hot backups, online software upgrades
• Workload simulation
• Monitoring (real-time), health reports
66

Orchestrator
• Reads replication topologies, keeps state,
continuous polling
• Modify your topology — move slaves around
• Nice GUI, JSON API, CLI
67

MySQL MHA
• Like MMM, specialized solution for MySQL replication
• Developed byYoshinori Matsunobu at DeNA
• Automated and manual failover options
• Topology: 1 master, many slaves
• Choose new master by comparing slave binlog positions
• Can be used in conjunction with other solutions
• http://code.google.com/p/mysql-master-ha/
68

Cluster suites
• Heartbeat, Pacemaker, Red Hat Cluster Suite
• Generic, can be used to cluster any server daemon
• Usually used in conjunction with Shared Disk or Replicated
Disk solutions (preferred)
• Can be used with replication.
• Robust, Node Fencing / STONITH
69

Pacemaker
• Heartbeat, Corosync, Pacemaker
• Resource Agents, Percona-PRM
• Percona Replication Manager - cluster, geographical disaster recovery
options
• Pacemaker agent specialised on MySQL replication
• https://github.com/percona/percona-pacemaker-agents/
• Pacemaker Resource Agents 3.9.3+ include Percona Replication
Manager (PRM)
70

VM based failover
• VMWare, OracleVM, etc can migrate / failover the entireVM
guest
• This isn’t the focus of the talk
71

Load Balancers for multi-master
clusters
• Synchronous multi-master clusters like Galera require load
balancers
• HAProxy
• Galera Load Balancer (GLB)
• MaxScale
• ProxySQL
72

MySQL Fabric
• Framework to manage farms of MySQL servers
• High availability + built-in sharding
• Requires GTID + Fabric-aware connector (PHP, Java, Python, C
in beta)
73

MySQL Router
• Routing between applications and any backend MySQL servers
• Failover
• Load Balancing
• Pluggable architecture (connection routing, Fabric cache)
74

MaxScale
• “Pluggable router” that offers connection & statement based load
balancing
• MaxScale as binlog server @ Booking - to replace intermediate masters
(downloads binlog from master, saves to disk, serves to slave as if
served from master)
• Possibilities are endless - use it for logging, writing to other databases
(besides MySQL), preventing SQL injections via regex ﬁltering, route via
hints, query rewriting, have a binlog relay, etc.
• Load balance your Galera clusters today!
75

ProxySQL
• High Performance MySQL proxy with a GPL license
• Performance is a priority - the numbers prove it
• Can query rewrite
• Sharding by host/schema or both, with rule engine +
modiﬁcation to SQL + application logic
76

JDBC/PHP drivers
• JDBC - multi-host failover feature (just specify master/slave
hosts in the properties)
• true for MariaDB Java Connector too
• PHP handles this too - mysqlnd_ms
• Can handle read-write splitting, round robin or random host
selection, and more
77

Clustering: solution or part of
problem?
• "Causes of Downtime in Production MySQL Servers" whitepaper,
Baron SchwartzVividCortex
• Human error
• SAN
• Clustering framework + SAN = more problems
• Galera is replication based, has no false positives as there’s no
“failover” moment, you don’t need a clustering framework (JDBC or
PHP can load balance), and is relatively elegant overall
78

InnoDB based?
• Use InnoDB, continue using InnoDB, know workarounds to
InnoDB
• All solutions but NDB are InnoDB. NDB is great for telco/
session management for high bandwidth sites, but setup,
maintenance, etc. is complex
79

Replication type
• Competence choices
• Replication: MySQL DBA manages
• DRBD: Linux admin manages
• SAN: requires domain controller
• Operations
• DRBD (disk level) = cold standby = longer
failover
• Replication = hot standby = shorter failover
• GTID helps tremendously
• Performance
• SAN has higher latency than local disk
• DRBD has higher latency than local disk
• Replication has little overhead
• Redundancy
• Shared disk = SPoF
• Shared nothing = redundant
80

SBR vs RBR? Async vs sync?
• row based: deterministic
• statement based: dangerous
• GTID: easier setup & failover of complex topologies
• async: data loss in failover
• sync: best
• multi-threaded slaves: scalability (hello 5.6+,Tungsten)
81

Conclusions for choice
• Simpler is better
• MySQL replication > DRBD > SAN
• Sync replication = no data loss
• Async replication = no latency (WAN)
• Sync multi-master = no failover required
• Multi-threaded slaves help in disk-bound workloads
• GTID increases operational usability
• Galera provides all this with good performance & stability
82

Why MHA needs coverage
• High Performance MySQL, 3rd Edition
• Published: March 16 2012
84

Where did MHA come from?
• DeNA won 2011 MySQL
Community Contributor of the
Year (April 2011)
• MHA came in about 3Q/2011
• Written byYoshinori Matsunobu,
Oracle ACE Director
85

What is MHA?
• MHA for MySQL: Master High Availability Manager tools for
MySQL
• Goal: automating master failover & slave promotion with
minimal downtime
• Set of Perl scripts
86

Why MHA?
• Automating monitoring of your replication topology for master failover
• Scheduled online master switching to a different host for online maintenance
• Switch back after OPTIMIZE/ALTER table, software or hardware upgrade
• Schema changes without stopping services
• pt-online-schema-change, oak-online-alter-table, Facebook OSC, Github
gh-ost
• Interactive/non-interactive master failover (just for failover, with detection of
master failure +VIP takeover to Pacemaker)
87

Why is master failover hard?
• When master fails, no more writes
till failover complete
• MySQL replication is
asynchronous (MHA works with
async + semi-sync replication)
• slave2 is latest, slave1+3 have
missing events, MHA does:
• copy id=10 from master if possible
• apply all missing events
88

MHA:Typical scenario
• Monitor replication topology
• If failure detected on master, immediately switch to a candidate
master or the most current slave to become new master
• MHA must fail to connect to master server three times
• CHANGE MASTER for all slaves to new master
• Print (stderr)/email report, stop monitoring
89

So really, what does MHA do?
90

Typical timeline
• Usually no more than 10-30 seconds
• 0-10s: Master failover detected in around 10 seconds
• (optional) check connectivity via secondary network
• (optional) 10-20s: 10 seconds to power off master
• 10-20s: apply differential relay logs to new master
• Practice: 4s @ DeNA, usually less than 10s
91

How does MHA work?
• Save binlog events from crashed master
• Identify latest slave
• Apply differential relay log to other slaves
• Apply saved binlog events from master
• Promote a slave to new master
• Make other slaves replicate from new master
92

Getting Started
• MHA requires no changes to
your application
• You are of course to write to
a virtual IP (VIP) for your
master
• MHA does not build
replication environments for
you - that’s DIY
93

MHA Node
• Download mha4mysql-node & install this on all machines:
master, slaves, monitor
• Packages (DEB, RPM) available
• Manually, make sure you have DBD::mysql & ensure it knows
the path of your MySQL
94

MHA Manager server
• Monitor server doesn’t have to be powerful at all, just remain
up
• This is a single-point-of-failure so monitor the manager server
where MHA Manager gets installed
• If MHA Manager isn’t running, your app still runs, but
automated failover is now disabled
95

MHA Manager
• You must install mha4mysql-node then mha4mysql-
manager
• Manager server has many Perl dependencies: DBD::mysql,
Config::Tiny, Log::Dispatch,
Parallel::ForkManager, Time::HiRes
• Package management ﬁxes dependencies, else use CPAN
96

Configuring MHA
• Application configuration file: see samples/conf/
app1.cnf
• Place this in /etc/MHA/app1.cnf
• Global configuration file: see /etc/MHA/
masterha_default.cnf (see samples/conf/
masterha_default.cnf)
97

app1.cnf
[server default]
manager_workdir=/var/log/masterha/app1
manager_log=/var/log/masterha/app1/manager.log
[server1]
hostname=host1
[server2]
hostname=host2
candidate_master=1
[server3]
hostname=host3
[server4]
hostname=host4
no_master=1
98
no need to specify
master as
MHA auto-detects this
sets priority, but doesn’t necessarily mean it gets
promoted
as a default (say its too far behind replication).
But maybe this is a more powerful box, or has a
better setup
will never be the master. RAID0 instead
of RAID1+0?
Slave is in another data centre?

masterha_default.cnf
[server default]
user=root
password=rootpass
ssh_user=root
master_binlog_dir= /var/lib/mysql,/var/log/mysql
remote_workdir=/data/log/masterha
ping_interval=3
# secondary_check_script=masterha_secondary_check -s
remote_host1 -s remote_host2
# master_ip_failover_script= /script/masterha/master_ip_failover
# shutdown_script= /script/masterha/power_manager
# report_script= /script/masterha/send_report
# master_ip_online_change_script= /script/masterha/
master_ip_online_change
99
check master activity from
manager->remote_hostN->
master (multiple hosts to
ensure its not a network issue)
STONITH

MHA uses SSH
• MHA uses SSH actively; passphraseless login
• In theory, only require Manager SSH to all nodes
• However, remember masterha_secondary_check
•masterha_check_ssh --conf=/etc/MHA/app1.cnf
100

Check replication
•masterha_check_repl --conf=/etc/MHA/
app1.cnf
• If you don’t see MySQL Replication Health is OK, MHA will
fail
• Common errors? Master binlog in different position, read
privileges on binary/relay log not granted, using multi-master
replication w/o read-only=1 set (only 1 writable master
allowed)
101

MHA Manager
•masterha_manager --conf=/etc/MHA/app1.cnf
• Logs are printed to stderr by default, set manager_log
• Recommended running with nohup, or daemontools
(preferred in production)
• http://code.google.com/p/mysql-master-ha/wiki/
Runnning_Background
102

So, the MHA Playbook
• Install MHA node, MHA manager
•masterha_check_ssh --conf=/etc/app1.cnf
•masterha_check_repl --conf=/etc/app1.cnf
•masterha_manager --conf=/etc/app1.cnf
• That’s it!
103

master_ip_failover_script
• Pacemaker can monitor & takeoverVIP if required
• Can use a catalog database
• map between application name + writer/reader IPs
• SharedVIP is easy to implement with minimal changes to
master_ip_failover itself (however, use shutdown_script
to power off machine)
104

master_ip_online_change
• Similar to master_ip_failover script, but used for online
maintenance
•masterha_master_switch --
master_state=alive
• MHA executes FLUSH TABLES WITH READ LOCK after
the writing freeze
105

Test the failover
•masterha_check_status --conf=/etc/MHA/
app1.cnf
• Kill MySQL (kill -9, shutdown server, kernel panic)
• MHA should go thru failover (stderr)
• parse the log as well
• Upon completion, it stops running
106

masterha_master_switch
• Manual failover
•--master_state=dead
• Scheduled online master switchover
• Great for upgrades to server, etc.
•masterha_master_switch --
master_state=alive --conf=/etc/MHA/
app1.cnf --new_master_host=host2
107

HandlingVIPs
my $vip = ‘192.168.0.1/24”;
my $interface = “0”;
my $ssh_start_vip = “sudo /sbin/ifconfig eth0:$key $vip”;
my $ssh_stop_vip = “sudo /sbin/ifconfig eth0:$key down”;
...
sub start_vip() {
`ssh $ssh_user@$new_master_host ” $ssh_start_vip ”`; }
sub stop_vip() {
`ssh $ssh_user@$orig_master_host ” $ssh_stop_vip ”`; }
108

Integration with other HA
solutions
• Pacemaker
• on RHEL6, you need some HA add-on, just use the CentOS packages
• /etc/ha.d/haresources to conﬁgureVIP
•`masterha_master_switch --master_state=dead --
interactive=0 --wait_on_failover_error=0 --
dead_master_host=host1 --new_master_host=host2`
• Corosync + Pacemaker works well
109

What about replication delay?
• By default, MHA checks to see if slave is behind master. By
more than 100MB, it is never a candidate master
• If you have candidate_master=1 set, consider setting
check_repl_delay=0
• You can integrate it with pt-heartbeat from Percona
Toolkit
110

MHA deployment tips
• You really should install this as root
• SSH needs to work across all hosts
• If you don’t want plaintext passwords in conﬁg ﬁles, use init_conf_load_script
• Each monitor can monitor multiple MHA pairs (hence app1, app2, etc.)
• You can have a standby master, make sure its read-only
• By default, master1->master2->slave3 doesn’t work
• MHA manages master1->master2 without issue
• Use multi_tier_slave=1 option
• Make sure replication user exists on candidate master too!
111

Consul
• Service discovery & conﬁguration. Distributed, highly available,
data centre aware
• Comes with its own built-in DNS server, KV storage with
HTTP API
• Raft Consensus Algorithm
112

VIPs vs Consul
• Previously, you handledVIPs and had to write to
master_ip_online_change/master_ip_failover
• system(“curl -X PUT -d ‘{”Node”:”master”}’ localhost:8500/
v1/catalog/deregister);
• system(“curl -X PUT -d ‘{”Node”:”master”, ”Address”:
”$new_master_host”}’ localhost:8500/v1/catalog/register);
114

mysqlfailover
• mysqlfailover from mysql-utilities using GTID’s in 5.6
• target topology: 1 master, n-slaves
• enable: log-slave-updates, report-host, report-port, master-info-table=TABLE
• modes: elect (choose candidate from list), auto (default), fail
• --discover-slaves-login for topology discovery
• monitoring node: SPoF
• Errant transactions prevent failover!
• Restart node? Rejoins replication topology, as a slave.
115

MariaDB 10
• New slave: SET GLOBAL GTID_SLAVE_POS = BINLOG_GTID_POS("master-
bin.00024", 1600); CHANGE MASTER TO master_host="10.2.3.4",
master_use_gtid=slave_pos; START SLAVE;
• use GTID: STOP SLAVE 
CHANGE MASTER TO master_use_gtid=current_pos; START SLAVE;
• Change master: STOP SLAVE 
CHANGE MASTER TO master_host="10.2.3.5"; START SLAVE;
116

Where is MHA used
• DeNA
• Premaccess (Swiss HA hosting company)
• Ireland’s national TV & radio service
• Jetair Belgium (MHA + MariaDB!)
• Samsung
• SK Group
• DAPA
117

MHA 0.56 is current
• Major release: MHA 0.56 April 1 2014 (0.55:
December 12 2012)
wiki/ReleaseNotes
118

MHA 0.56
• 5.6 GTID: GTID + auto position enabled? Failover with GTID SQL syntax
not relay log failover
• MariaDB 10+ still needs work
• MySQL 5.6 support for checksum in binlog events + multi-threaded slaves
• mysqlbinlog and mysql in custom locations (conﬁgurable clients)
• binlog streaming server supported
119

MHA 0.56
• ping_type = INSERT (for master connectivity checks - assuming master
isn’t accepting writes)
120

(MariaDB) Replication Manager
• Support for MariaDB Server GTIDs
• Single, portable 12MB binary
• Interactive GTID monitoring
• Supports failover or switchover based on requests
• Topology detection
• Health checks
121

Is fully automated
failover a good idea?
• False alarms
• Can cause short downtime, restarting all write connections
• Repeated failover
• Problem not ﬁxed? Master overloaded?
• MHA ensures a failover doesn’t happen within 8h, unless --last_failover_minute=n is set
• Data loss
• id=103 is latest, relay logs are at id=101, loss
• group commit means sync_binlog=1, innodb_ﬂush_log_at_trx_commit=1 can be enabled! (just wait for
master to recover)
• Split brain
• sometimes poweroff takes a long time
122

Video resources
• Yoshinori Matsunobu talking about High Availability & MHA at Oracle
MySQL day
• http://www.youtube.com/watch?v=CNCALAw3VpU
• Alex Alexander (AccelerationDB) talks about MHA, with an example of
failover, and how it compares to Tungsten
• http://www.youtube.com/watch?v=M9vVZ7jWTgw
• Consul & MHA failover in action
• https://www.youtube.com/watch?v=rA4hyJ-pccU
123

References
• Design document
• http://www.slideshare.net/matsunobu/automated-master-failover
• Conﬁguration parameters
• http://code.google.com/p/mysql-master-ha/wiki/Parameters
• JetAir MHA use case
• http://www.percona.com/live/mysql-conference-2012/sessions/case-study-jetair-
dramatically-increasing-uptime-mha
• MySQL binary log
• http://dev.mysql.com/doc/internals/en/binary-log.html
124

Service Level Agreements (SLA)
• AWS - 99.95% in a calendar month
• Rackspace - 99.9% in a calendar month
• Google - 99.95% in a calendar month
• SLAs exclude “scheduled maintenance”
• AWS is 30 minutes/week, so really 99.65%
126

RDS: Multi-AZ
• Provides enhanced durability (synchronous data replication)
• Increased availability (automatic failover)
• Warning: can be slow (1-10 mins+)
• Easy GUI administration
• Doesn’t give you another usable “read-replica” though
127

External replication
• MySQL 5.6 you can do RDS -> Non-RDS
• enable backup retention, you now have binlog access
• You still can’t replicate INTO RDS
• use Tungsten Replicator
• also supports going from RDS to Rackspace/etc.
128

High Availability
• Plan for node failures
• Don’t assume node provisioning is quick
• Backup, backup, backup!
• “Bad” nodes exist
• HA is not equal across options - RDS wins so far
129

Unsupported features
• AWS: GTIDs, InnoDB Cache Warming, InnoDB transportable
tablespaces, authentication plugins, semi-sync replication
• Google: UDFs, replication, LOAD DATA INFILE, INSTALL
PLUGIN, SELECT ... INTO OUTFILE
130

Warning: automatic
upgrades
• Regressions happen even with a minor version upgrade in the
MySQL world
• InnoDB update that modiﬁes rows PK triggers recursive
behaviour until all disk space is exceeded? 5.5.24->5.5.25 (ﬁxed:
5.5.25a)
• Using query cache for partitioned tables? Disabled since 5.5.22-
>5.5.23!
131

Can you configure MySQL?
• You don’t access
my.cnf naturally
• In AWS you have
parameter groups
which allow
configuration of
MySQL
132
source: http://www.mysqlperformanceblog.com/2013/08/21/amazon-rds-with-mysql-5-6-configuration-variables/

Sharding solutions
• Not all data lives in one place
• hash records to partitions
• partition alphabetically? put n-users/shard? organise by postal
codes?
134

Horizontal vs vertical
135
192.168.0.1
User
id int(10)
username char(15)
password char(15)
email char(50)
192.168.0.2
User
id int(10)
username char(15)
password char(15)
email char(50)
192.168.0.3
User
id int(10)
username char(15)
password char(15)
email char(50)
192.168.0.1
User
id int(10)
username char(15)
password char(15)
email char(50)
192.168.0.2
UserInfo
login datetime
md5 varchar(32)
guid varchar(32)
Better if INSERT
heavy and there’s
less frequently
changed data

How do you shard?
• Use your own sharding framework
• write it in the language of your choice
• simple hashing algorithm that you can devise yourself
• SPIDER
• Tumblr JetPants
• GoogleVitess
• ShardQuery
136

SPIDER
• storage engine to vertically partition tables
137

Tungsten Replicator (OSS)
• Each transaction tagged with a Shard ID
• controlled in a ﬁle: shard.list, exposed via JMX MBean API
• primary use? geographic replication
• in application, requires changes to use the API to specify shards
used
138

Tumblr JetPants
• clone replicas, rebalance shards, master promotions (can also
use MHA for master promotions)
• Ruby library, range-based sharding scheme
• https://github.com/tumblr/jetpants
• Uses MariaDB as an aggregator node (multi-source replication)
139

Google (YouTube) vitess
• Servers & tools to scale MySQL for web written in Go
• Has MariaDB support too (*)
• Python client interface
• DML annotation, connection pooling, shard management, workﬂow
management, zero downtime restarts
• Become super easy to use: http://vitess.io/ (with the help of
Kubernetes)
140

Conclusion
• MySQL replication is amazing if you know it (and monitor it)
well enough
• Large sites run just ﬁne with semi-sync + tooling for automated
failover
• Galera Cluster is great for fully synchronous replication
• Don’t forget the need for a load balancer: ProxySQL is nifty
142

Q&A / Thanks
colin.charles@percona.com / byte@bytebot.net
@bytebot on Twitter | http://bytebot.net/blog/
slides: slideshare.net/bytebot
145

Best practices for high availability MySQL

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Best practices for high availability MySQL

Similar a Best practices for high availability MySQL (20)

Más de Colin Charles

Más de Colin Charles (20)

Último

Último (20)

Best practices for high availability MySQL