SlideShare una empresa de Scribd logo
1 de 14
Descargar para leer sin conexión
The two little bugs that almost
brought down Booking.com
Jean-François Gagné (System Engineer)
jeanfrancois DOT gagne AT booking.com
April 25, 2017 – Percona Live Santa Clara 2017
2
For a, b and c relatively small:
Consequence(a + b + c) is much bigger than
Conseq.(a) + Conseq.(b) + Conseq.(c)
MySQL/MariaDB replication at Booking.com
● Typical Booking.com MySQL/MariaDB replication deployment:
+---+
| M |
+---+
|
+------+-- ... --+---------------+-------- ...
| | | |
+---+ +---+ +---+ +---+
| S1| | S2| | Sn| | M1|
+---+ +---+ +---+ +---+
|
+-- ... --+
| |
+---+ +---+
| T1| | Tm|
+---+ +---+
3
Impacted setup (simplified)
+---+
| M |
+---+
|
+---------- .... ----------+--------------+
| | |
+---+ +---+ +---+
| M1| | Mi| | Mj|
+---+ +---+ +---+
| | |
+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+
| | | | | | | | | |
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
4
Upgrade from 5.5 to new major version
+---+
| M |
+---+
|
+---------- .... ----------+--------------+
| | |
+---+ +---+ +---+
| M1| | Mi| | Mj|
+---+ +---+ +---+
| | |
+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+
| | | | | | | | | |
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
5
Bad transaction on the master
+---+
| M | <<-- “bad transaction”
+---+
|
+---------- .... ----------+--------------+
| | |
+---+ +---+ +---+
| M1| | Mi| | Mj|
+---+ +---+ +---+
| | |
+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+
| | | | | | | | | |
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
6
Oups ! M2 runs OOM and is killed
+---+
| M | <<-- “bad transaction”
+---+
|
+---------- .... ----------+--------------+
| | |
+---+ +-/+ +---+
| M1| | Mi| | Mj|
+---+ +/-+ +---+
| | |
+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+
| | | | | | | | | |
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
7
Oups2 ! all “blue” run OOM and are killed
+---+
| M | <<-- “bad transaction”
+---+
|
+---------- .... ----------+--------------+
| | |
+---+ +-/+ +---+
| M1| | Mi| | Mj|
+---+ +/-+ +---+
| | |
+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+
| | | | | | | | | |
+-/+ +---+ +-/+ +---+ +---+ +---+ +---+ +-/+ +---+ +-/+
| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|
+/-+ +---+ +/-+ +---+ +---+ +---+ +---+ +/-+ +---+ +/-+
8
What is the “bad” transaction ?
● DELETE FROM TABLE WHERE …lot of rows…;
● Transaction of ~2 GB in the binary logs (RBR)
● Obviously a bug in the application
(but it should not have triggered an OOM)
9
What needs to be done next ?
● Reminder: 5.5 is not replication crash safe
● Next version is crash safe, but can’t…
● Crashed slaves either OOM again or are corrupted
● We need to re-clone all crashed slaves !
10
What saved us ?
● Engaged team of skilled DBAs: all joined to help
● Data not too sensitive on replication delay
● Data not too sensitive on “skipping transactions”
● pt-slave-restart
● IDEMPOTENT mode
● A torrenting cloning tool
11
What could have helped us ?
● A “working” torrenting cloning tool…
● Not used often enough, so we did not know it was broken
(fixed in less than 2 hours)
● An AUTO-FIX/AUTO-REPAIR mode (RBR)
● Instead of skipping transaction (and make data diverge)
should repair (fix) slave drift (and make data converge)
https://bugs.mysql.com/bug.php?id=54250
http://blog.wl0.org/2016/05/the-differences-between-idempotent-
and-my-suggested-auto-repair-mode/
12
● We are hiring !
● MySQL Engineer / DBA
● System Administrator
● System Engineer
● Site Reliability Engineer
● Developer / Designer
● Technical Team Lead
● Product Owner
● Data Scientist
● And many more…
● https://workingatbooking.com/
Want to know more…
Thanks
Jean-François Gagné
jeanfrancois DOT gagne AT booking.com

Más contenido relacionado

Similar a The two little bugs that almost brought down Booking.com

M|18 User Defined Function
M|18 User Defined FunctionM|18 User Defined Function
M|18 User Defined FunctionMariaDB plc
 
Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...
Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...
Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...Puppet
 
Fosscon 2012 firewall workshop
Fosscon 2012 firewall workshopFosscon 2012 firewall workshop
Fosscon 2012 firewall workshopjvehent
 
Assembly programming on the nand2tetris architecture
Assembly programming on the nand2tetris architectureAssembly programming on the nand2tetris architecture
Assembly programming on the nand2tetris architectureAmy Hanlon
 
Spring MVC - Wiring the different layers
Spring MVC -  Wiring the different layersSpring MVC -  Wiring the different layers
Spring MVC - Wiring the different layersIlio Catallo
 
Microservices Tutorial Session at JavaOne 2016
Microservices Tutorial Session at JavaOne 2016Microservices Tutorial Session at JavaOne 2016
Microservices Tutorial Session at JavaOne 2016Jason Swartz
 
Functional Database Strategies at Scala Bay
Functional Database Strategies at Scala BayFunctional Database Strategies at Scala Bay
Functional Database Strategies at Scala BayJason Swartz
 
Discovering and querying temporal data
Discovering and querying temporal dataDiscovering and querying temporal data
Discovering and querying temporal dataMariaDB plc
 
A tale of queues — from ActiveMQ over Hazelcast to Disque - Philipp Krenn
A tale of queues — from ActiveMQ over Hazelcast to Disque - Philipp KrennA tale of queues — from ActiveMQ over Hazelcast to Disque - Philipp Krenn
A tale of queues — from ActiveMQ over Hazelcast to Disque - Philipp Krenndistributed matters
 
16 MySQL Optimization #burningkeyboards
16 MySQL Optimization #burningkeyboards16 MySQL Optimization #burningkeyboards
16 MySQL Optimization #burningkeyboardsDenis Ristic
 
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backendscodecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backendsDataStax Academy
 
1Hippeus - zerocopy messaging по законам Спарты, Леонид Юрьев (ПЕТЕР-СЕРВИС)
1Hippeus -  zerocopy messaging по законам Спарты, Леонид Юрьев (ПЕТЕР-СЕРВИС)1Hippeus -  zerocopy messaging по законам Спарты, Леонид Юрьев (ПЕТЕР-СЕРВИС)
1Hippeus - zerocopy messaging по законам Спарты, Леонид Юрьев (ПЕТЕР-СЕРВИС)Ontico
 
Workshop 20140522 BigQuery Implementation
Workshop 20140522   BigQuery ImplementationWorkshop 20140522   BigQuery Implementation
Workshop 20140522 BigQuery ImplementationSimon Su
 
CoreOS in anger : firing up wordpress across a 3 machine CoreOS cluster
CoreOS in anger : firing up wordpress across a 3 machine CoreOS cluster CoreOS in anger : firing up wordpress across a 3 machine CoreOS cluster
CoreOS in anger : firing up wordpress across a 3 machine CoreOS cluster Shaun Domingo
 
IP Addresses
IP AddressesIP Addresses
IP Addressesadil raja
 
Homework Assignment 3 Chapter 3 St. Clair & Visick, Putting you.docx
Homework Assignment 3 Chapter 3 St. Clair & Visick, Putting you.docxHomework Assignment 3 Chapter 3 St. Clair & Visick, Putting you.docx
Homework Assignment 3 Chapter 3 St. Clair & Visick, Putting you.docxfideladallimore
 
ANALYZE for Statements - MariaDB's hidden gem
ANALYZE for Statements - MariaDB's hidden gemANALYZE for Statements - MariaDB's hidden gem
ANALYZE for Statements - MariaDB's hidden gemSergey Petrunya
 
Advanced Query Optimizer Tuning and Analysis
Advanced Query Optimizer Tuning and AnalysisAdvanced Query Optimizer Tuning and Analysis
Advanced Query Optimizer Tuning and AnalysisMYXPLAIN
 

Similar a The two little bugs that almost brought down Booking.com (20)

M|18 User Defined Function
M|18 User Defined FunctionM|18 User Defined Function
M|18 User Defined Function
 
Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...
Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...
Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...
 
Fosscon 2012 firewall workshop
Fosscon 2012 firewall workshopFosscon 2012 firewall workshop
Fosscon 2012 firewall workshop
 
Assembly programming on the nand2tetris architecture
Assembly programming on the nand2tetris architectureAssembly programming on the nand2tetris architecture
Assembly programming on the nand2tetris architecture
 
Spring MVC - Wiring the different layers
Spring MVC -  Wiring the different layersSpring MVC -  Wiring the different layers
Spring MVC - Wiring the different layers
 
Microservices Tutorial Session at JavaOne 2016
Microservices Tutorial Session at JavaOne 2016Microservices Tutorial Session at JavaOne 2016
Microservices Tutorial Session at JavaOne 2016
 
Functional Database Strategies at Scala Bay
Functional Database Strategies at Scala BayFunctional Database Strategies at Scala Bay
Functional Database Strategies at Scala Bay
 
Explain
ExplainExplain
Explain
 
Discovering and querying temporal data
Discovering and querying temporal dataDiscovering and querying temporal data
Discovering and querying temporal data
 
A tale of queues — from ActiveMQ over Hazelcast to Disque - Philipp Krenn
A tale of queues — from ActiveMQ over Hazelcast to Disque - Philipp KrennA tale of queues — from ActiveMQ over Hazelcast to Disque - Philipp Krenn
A tale of queues — from ActiveMQ over Hazelcast to Disque - Philipp Krenn
 
16 MySQL Optimization #burningkeyboards
16 MySQL Optimization #burningkeyboards16 MySQL Optimization #burningkeyboards
16 MySQL Optimization #burningkeyboards
 
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backendscodecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
 
1Hippeus - zerocopy messaging по законам Спарты, Леонид Юрьев (ПЕТЕР-СЕРВИС)
1Hippeus -  zerocopy messaging по законам Спарты, Леонид Юрьев (ПЕТЕР-СЕРВИС)1Hippeus -  zerocopy messaging по законам Спарты, Леонид Юрьев (ПЕТЕР-СЕРВИС)
1Hippeus - zerocopy messaging по законам Спарты, Леонид Юрьев (ПЕТЕР-СЕРВИС)
 
Workshop 20140522 BigQuery Implementation
Workshop 20140522   BigQuery ImplementationWorkshop 20140522   BigQuery Implementation
Workshop 20140522 BigQuery Implementation
 
CoreOS in anger : firing up wordpress across a 3 machine CoreOS cluster
CoreOS in anger : firing up wordpress across a 3 machine CoreOS cluster CoreOS in anger : firing up wordpress across a 3 machine CoreOS cluster
CoreOS in anger : firing up wordpress across a 3 machine CoreOS cluster
 
IP Addresses
IP AddressesIP Addresses
IP Addresses
 
Homework Assignment 3 Chapter 3 St. Clair & Visick, Putting you.docx
Homework Assignment 3 Chapter 3 St. Clair & Visick, Putting you.docxHomework Assignment 3 Chapter 3 St. Clair & Visick, Putting you.docx
Homework Assignment 3 Chapter 3 St. Clair & Visick, Putting you.docx
 
ANALYZE for Statements - MariaDB's hidden gem
ANALYZE for Statements - MariaDB's hidden gemANALYZE for Statements - MariaDB's hidden gem
ANALYZE for Statements - MariaDB's hidden gem
 
8 congestion-ipv6
8 congestion-ipv68 congestion-ipv6
8 congestion-ipv6
 
Advanced Query Optimizer Tuning and Analysis
Advanced Query Optimizer Tuning and AnalysisAdvanced Query Optimizer Tuning and Analysis
Advanced Query Optimizer Tuning and Analysis
 

Más de Jean-François Gagné

MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)
MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)
MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)Jean-François Gagné
 
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorAlmost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorJean-François Gagné
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyJean-François Gagné
 
The consequences of sync_binlog != 1
The consequences of sync_binlog != 1The consequences of sync_binlog != 1
The consequences of sync_binlog != 1Jean-François Gagné
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentJean-François Gagné
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentJean-François Gagné
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyJean-François Gagné
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyJean-François Gagné
 
The Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication TutorialThe Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication TutorialJean-François Gagné
 
MySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.comMySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.comJean-François Gagné
 
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...Jean-François Gagné
 
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitationsMySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitationsJean-François Gagné
 
How Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lagHow Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lagJean-François Gagné
 
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsJean-François Gagné
 
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsJean-François Gagné
 
MySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitationsMySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitationsJean-François Gagné
 
Riding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication StreamRiding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication StreamJean-François Gagné
 

Más de Jean-François Gagné (17)

MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)
MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)
MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)
 
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorAlmost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
The consequences of sync_binlog != 1
The consequences of sync_binlog != 1The consequences of sync_binlog != 1
The consequences of sync_binlog != 1
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated Environment
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated Environment
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
The Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication TutorialThe Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication Tutorial
 
MySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.comMySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.com
 
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
 
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitationsMySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
 
How Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lagHow Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lag
 
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitations
 
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitations
 
MySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitationsMySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitations
 
Riding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication StreamRiding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication Stream
 

Último

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

The two little bugs that almost brought down Booking.com

  • 1. The two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagne AT booking.com April 25, 2017 – Percona Live Santa Clara 2017
  • 2. 2 For a, b and c relatively small: Consequence(a + b + c) is much bigger than Conseq.(a) + Conseq.(b) + Conseq.(c)
  • 3. MySQL/MariaDB replication at Booking.com ● Typical Booking.com MySQL/MariaDB replication deployment: +---+ | M | +---+ | +------+-- ... --+---------------+-------- ... | | | | +---+ +---+ +---+ +---+ | S1| | S2| | Sn| | M1| +---+ +---+ +---+ +---+ | +-- ... --+ | | +---+ +---+ | T1| | Tm| +---+ +---+ 3
  • 4. Impacted setup (simplified) +---+ | M | +---+ | +---------- .... ----------+--------------+ | | | +---+ +---+ +---+ | M1| | Mi| | Mj| +---+ +---+ +---+ | | | +-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+ | | | | | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo| +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 4
  • 5. Upgrade from 5.5 to new major version +---+ | M | +---+ | +---------- .... ----------+--------------+ | | | +---+ +---+ +---+ | M1| | Mi| | Mj| +---+ +---+ +---+ | | | +-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+ | | | | | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo| +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 5
  • 6. Bad transaction on the master +---+ | M | <<-- “bad transaction” +---+ | +---------- .... ----------+--------------+ | | | +---+ +---+ +---+ | M1| | Mi| | Mj| +---+ +---+ +---+ | | | +-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+ | | | | | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo| +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 6
  • 7. Oups ! M2 runs OOM and is killed +---+ | M | <<-- “bad transaction” +---+ | +---------- .... ----------+--------------+ | | | +---+ +-/+ +---+ | M1| | Mi| | Mj| +---+ +/-+ +---+ | | | +-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+ | | | | | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo| +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 7
  • 8. Oups2 ! all “blue” run OOM and are killed +---+ | M | <<-- “bad transaction” +---+ | +---------- .... ----------+--------------+ | | | +---+ +-/+ +---+ | M1| | Mi| | Mj| +---+ +/-+ +---+ | | | +-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+ | | | | | | | | | | +-/+ +---+ +-/+ +---+ +---+ +---+ +---+ +-/+ +---+ +-/+ | S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo| +/-+ +---+ +/-+ +---+ +---+ +---+ +---+ +/-+ +---+ +/-+ 8
  • 9. What is the “bad” transaction ? ● DELETE FROM TABLE WHERE …lot of rows…; ● Transaction of ~2 GB in the binary logs (RBR) ● Obviously a bug in the application (but it should not have triggered an OOM) 9
  • 10. What needs to be done next ? ● Reminder: 5.5 is not replication crash safe ● Next version is crash safe, but can’t… ● Crashed slaves either OOM again or are corrupted ● We need to re-clone all crashed slaves ! 10
  • 11. What saved us ? ● Engaged team of skilled DBAs: all joined to help ● Data not too sensitive on replication delay ● Data not too sensitive on “skipping transactions” ● pt-slave-restart ● IDEMPOTENT mode ● A torrenting cloning tool 11
  • 12. What could have helped us ? ● A “working” torrenting cloning tool… ● Not used often enough, so we did not know it was broken (fixed in less than 2 hours) ● An AUTO-FIX/AUTO-REPAIR mode (RBR) ● Instead of skipping transaction (and make data diverge) should repair (fix) slave drift (and make data converge) https://bugs.mysql.com/bug.php?id=54250 http://blog.wl0.org/2016/05/the-differences-between-idempotent- and-my-suggested-auto-repair-mode/ 12
  • 13. ● We are hiring ! ● MySQL Engineer / DBA ● System Administrator ● System Engineer ● Site Reliability Engineer ● Developer / Designer ● Technical Team Lead ● Product Owner ● Data Scientist ● And many more… ● https://workingatbooking.com/ Want to know more…