DOES16 London - Kevin Bowman - Surviving the Grand National

•

1 recomendación•529 vistas

Surviving the Grand National Kevin Bowman, Head of Operations, Sky Betting & Gaming As one of the UK's largest online betting websites, the Grand National always breaks records by every metric. We have a complex technical platform, involving in-house and 3rd-party provided software, serving a highly dynamic website on the busiest day of the sports betting year. I'll tell the story "from the trenches" of how we planned for and ran the day, and what lessons we learned to make next year's even better. Including technical details (keeping our website running whilst normal customer traffic looks to most companies like a DDoS) and how we organised our responsibilities on the day itself, I'll go through the various challenges we faced and how we overcame them to have our most successful Grand National yet. DevOps Enterprise Summit London 2016

Tecnología

Surviving the Grand National
A s t o r y f r o m t h e t r e n c h e s

Some Grand National facts…
• Started in 1839
• Nearly 7km long (4 miles 514 yards)
• 30 fences jumped, over 2 laps
• Highest prized jump-race in Europe
• Up to 40 horses
• The favourite has only won 9 times in
the last 70 years
• Highest bet traffic of the year

Account / SSO
Infrastructure
Data analytics
… and others

Architecture Engineering Operations
The “Bet” Tribe - Technology

Architecture Engineering Operations
Squad
Squad
Squad
Squad
The “Bet” Tribe -
Technology

Web Ops Site Reliability
Engineering
Platform
Engineering
The “Bet” Tribe – Technology - Operations
Service
Management

Problems found in load testing
• NFS performance
• Account API memory leak
• Virtual CPU layout issue
• and… performance of the load injectors

Load tests only tell you
about things you’re
expecting
If the world was predictable, there wouldn’t be a betting industry…

Unexpected events…
Leicester City wins the Premier
League
Horse races get
cancelled
Competitor websites
fail

Unexpected failures
• Payment gateways fail
• Banks fail
• Identity verification services fail
• Network transit links fail
• Storage backends (e.g. SANs) fail
• ISPs fail
• Customer services systems fail

Two important things
Have a
plan
Have a good
team

Two important things
Have a
plan
Have a good
team… and let them do their job

Team structure
• Web servers
• Databases
• Trading systems
• Network
• Storage
• Compute power
• Account
• etc

Risk management system
New = “Orwell” Old = “Not cool enough
to have a name”

Lessons learnt
• Small teams
• Trust your teams
• Small, fast decision-making chain

Lessons learnt
• Small teams
• Trust your teams
• Small, fast decision-making chain
• Everyone likes a successful day!

Thanks for listening!
Image credits:
“Grand National in Leeds”: Tom Hudson

DOES16 London - Kevin Bowman - Surviving the Grand National

Más contenido relacionado

Destacado

DOES14 - Justin Arbuckle - CHEF - Hunting the DevOps WhaleGene Kim

DOES15 - Mark Michaelis - Metrics that MatterGene Kim

DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps TransitionGene Kim

DOES15 - Mirco Hering - Adopting DevOps Practices for Systems of Record – An ...Gene Kim

DOES14 - Shakeel Sorathia - Ticketmaster - 40 Year Old Company Transformed by...Gene Kim

DOES15 - Mike Bland - Pain Is Over, If You Want ItGene Kim

DOES16 London - Charlotta Croiset van Uchelen - Choose Your Own BossGene Kim

DOES16 San Francisco - Damon Edwards - The Talent You Need is Already Inside ...Gene Kim

DOES14 - John Kosco - Blue Agility - Discover How to Improve Productivity by ...Gene Kim

DOES16 London - Pat Reed - Mind the GAAP: A Playbook for Agile AccountingGene Kim

DOES16 San Francisco - Carmen DeArdo, Cindy Payne, & Jim Grafmeyer - Episode ...Gene Kim

DOES15 - Scott Prugh & Erica Morrison - Conway & Taylor Meet the Strangler (v...Gene Kim

DOES15 - Aaron Volkmann - Busting Silos & Red Tape: DevOps in Federal GovernmentGene Kim

DOES16 London - Tom Clark - ITV's Common PlatformGene Kim

DOES14 - Courtney Kissler - Nordstrom - Transforming to a Culture of Continuo...Gene Kim

DOES14 - Jessica DeVita - Microsoft - No Whiteboards AllowedGene Kim

Destacado (16)

DOES14 - Justin Arbuckle - CHEF - Hunting the DevOps Whale

DOES15 - Mark Michaelis - Metrics that Matter

DOES15 - Randy Shoup - Ten (Hard-Won) Lessons of the DevOps Transition

DOES15 - Mirco Hering - Adopting DevOps Practices for Systems of Record – An ...

DOES14 - Shakeel Sorathia - Ticketmaster - 40 Year Old Company Transformed by...

DOES15 - Mike Bland - Pain Is Over, If You Want It

DOES16 London - Charlotta Croiset van Uchelen - Choose Your Own Boss

DOES16 San Francisco - Damon Edwards - The Talent You Need is Already Inside ...

DOES14 - John Kosco - Blue Agility - Discover How to Improve Productivity by ...

DOES16 London - Pat Reed - Mind the GAAP: A Playbook for Agile Accounting

DOES16 San Francisco - Carmen DeArdo, Cindy Payne, & Jim Grafmeyer - Episode ...

DOES15 - Scott Prugh & Erica Morrison - Conway & Taylor Meet the Strangler (v...

DOES15 - Aaron Volkmann - Busting Silos & Red Tape: DevOps in Federal Government

DOES16 London - Tom Clark - ITV's Common Platform

DOES14 - Courtney Kissler - Nordstrom - Transforming to a Culture of Continuo...

DOES14 - Jessica DeVita - Microsoft - No Whiteboards Allowed

Similar a DOES16 London - Kevin Bowman - Surviving the Grand National

Follow the money with graphsStanka Dalekova

Using Graph Analysis and Fraud Detection in the Fintech IndustryStanka Dalekova

Python Through the Back Door: Netflix Presentation at CodeMash 2014royrapoport

Capacity Planning for fun & profitRodrigo Campos

Amazon Athena (April 2017)Julien SIMON

DataStax Enterprise in the Field – 20160920Daniel Cohen

Kx brianBrian Collins

R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538Krishna Sankar

Launching Your First Big Data Project on AWSAmazon Web Services

Coates bosc2010 clouds-fluff-and-no-substanceBOSC 2010

Splunk Implementation and Usage - GarminSplunk

Scaling a Rails Application from the Bottom Up Abhishek Singh

SoCal code camp Fullerton - Mar 2015Vishal Narayan Saxena, MVP

Architecture at ScaleElasticsearch

Design for Scale / Surge 2010Christopher Brown

Functional Leap of Faith (Keynote at JDay Lviv 2014)Tomer Gabel

Elastic at Procter & Gamble: A Network StoryElasticsearch

Delta Lake: Open Source Reliability w/ Apache SparkGeorge Chow

Cba lecture 5 intro_ch_05_a-brnazninislamnipa

Similar a DOES16 London - Kevin Bowman - Surviving the Grand National (20)

Follow the money with graphs

Using Graph Analysis and Fraud Detection in the Fintech Industry

Python Through the Back Door: Netflix Presentation at CodeMash 2014

Capacity Planning for fun & profit

Amazon Athena (April 2017)

DataStax Enterprise in the Field – 20160920

Kx brian

R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538

Launching Your First Big Data Project on AWS

Coates bosc2010 clouds-fluff-and-no-substance

Splunk Implementation and Usage - Garmin

Scaling a Rails Application from the Bottom Up

SoCal code camp Fullerton - Mar 2015

Architecture at Scale

Design for Scale / Surge 2010

Functional Leap of Faith (Keynote at JDay Lviv 2014)

Elastic at Procter & Gamble: A Network Story

Delta Lake: Open Source Reliability w/ Apache Spark

Cba lecture 5 intro_ch_05_a-br

Más de Gene Kim

DOES SFO 2016 - Kaimar Karu - ITIL. You keep using that word. I don't think i...Gene Kim

DOES SFO 2016 - Ross Clanton and Chivas Nambiar - DevOps at VerizonGene Kim

DOES SFO 2016 - Scott Willson - Top 10 Ways to Fail at DevOpsGene Kim

DOES SFO 2016 - Daniel Perez - Doubling Down on ChatOps in the EnterpriseGene Kim

DOES SFO 2016 - Greg Maxey and Laurent Rochette - DSL at ScaleGene Kim

DOES SFO 2016 - Rich Jackson & Rosalind Radcliffe - The Mainframe DevOps Team...Gene Kim

DOES SFO 2016 - Greg Padak - Default to OpenGene Kim

DOES SFO 2016 - Michael Nygard - Tempo, Maneuverability, InitiativeGene Kim

DOES SFO 2016 - Alexa Alley - Value Stream MappingGene Kim

DOES SFO 2016 - Mark Imbriaco - Lessons From the Bleeding EdgeGene Kim

DOES SFO 2016 - Topo Pal - DevOps at Capital OneGene Kim

DOES SFO 2016 - Cornelia Davis - DevOps: Who Does What?Gene Kim

DOES SFO 2016 - Avan Mathur - Planning for Huge ScaleGene Kim

DOES SFO 2016 - Chris Fulton - CD for DBsGene Kim

DOES SFO 2016 - Marc Priolo - Are we there yet? Gene Kim

DOES SFO 2016 - Steve Brodie - The Future of DevOps in the EnterpriseGene Kim

DOES SFO 2016 - Aimee Bechtle - Utilizing Distributed Dojos to Transform a Wo...Gene Kim

DOES SFO 2016 - Ray Krueger - Speed as a Prime DirectiveGene Kim

DOES SFO 2016 - Paula Thrasher & Kevin Stanley - Building Brilliant Teams Gene Kim

DOES SFO 2016 - Kevina Finn-Braun & J. Paul Reed - Beyond the Retrospective: ...Gene Kim

Más de Gene Kim (20)

DOES SFO 2016 - Kaimar Karu - ITIL. You keep using that word. I don't think i...

DOES SFO 2016 - Ross Clanton and Chivas Nambiar - DevOps at Verizon

DOES SFO 2016 - Scott Willson - Top 10 Ways to Fail at DevOps

DOES SFO 2016 - Daniel Perez - Doubling Down on ChatOps in the Enterprise

DOES SFO 2016 - Greg Maxey and Laurent Rochette - DSL at Scale

DOES SFO 2016 - Rich Jackson & Rosalind Radcliffe - The Mainframe DevOps Team...

DOES SFO 2016 - Greg Padak - Default to Open

DOES SFO 2016 - Michael Nygard - Tempo, Maneuverability, Initiative

DOES SFO 2016 - Alexa Alley - Value Stream Mapping

DOES SFO 2016 - Mark Imbriaco - Lessons From the Bleeding Edge

DOES SFO 2016 - Topo Pal - DevOps at Capital One

DOES SFO 2016 - Cornelia Davis - DevOps: Who Does What?

DOES SFO 2016 - Avan Mathur - Planning for Huge Scale

DOES SFO 2016 - Chris Fulton - CD for DBs

DOES SFO 2016 - Marc Priolo - Are we there yet?

DOES SFO 2016 - Steve Brodie - The Future of DevOps in the Enterprise

DOES SFO 2016 - Aimee Bechtle - Utilizing Distributed Dojos to Transform a Wo...

DOES SFO 2016 - Ray Krueger - Speed as a Prime Directive

DOES SFO 2016 - Paula Thrasher & Kevin Stanley - Building Brilliant Teams

DOES SFO 2016 - Kevina Finn-Braun & J. Paul Reed - Beyond the Retrospective: ...

Último

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Architecting Cloud Native ApplicationsWSO2

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

DOES16 London - Kevin Bowman - Surviving the Grand National

1. Surviving the Grand National A s t o r y f r o m t h e t r e n c h e s

2. The Grand National 9th April 2016

3. Some Grand National facts… • Started in 1839 • Nearly 7km long (4 miles 514 yards) • 30 fences jumped, over 2 laps • Highest prized jump-race in Europe • Up to 40 horses • The favourite has only won 9 times in the last 70 years • Highest bet traffic of the year

4. A normal weekday

5. A normal weekday

6. A normal Saturday

7. Grand National day

9. Account / SSO Infrastructure Data analytics … and others

10. Account / SSO Infrastructure Data analytics … and others

11. Marketing The “Bet” Tribe

12. Architecture Engineering Operations The “Bet” Tribe - Technology

13. Architecture Engineering Operations Squad Squad Squad Squad The “Bet” Tribe - Technology

14. Web Ops Site Reliability Engineering Platform Engineering The “Bet” Tribe – Technology - Operations Service Management

15.

16. Sky Bet Architecture

17. Load test cycle

18.

19. Load test cycle

20. Problems found in load testing • NFS performance • Account API memory leak • Virtual CPU layout issue • and… performance of the load injectors

21. Load tests only tell you about things you’re expecting If the world was predictable, there wouldn’t be a betting industry…

22. Unexpected events… Leicester City wins the Premier League Horse races get cancelled Competitor websites fail

23. Unexpected failures • Payment gateways fail • Banks fail • Identity verification services fail • Network transit links fail • Storage backends (e.g. SANs) fail • ISPs fail • Customer services systems fail

24. Unexpected failures • Payment gateways fail • Banks fail • Identity verification services fail • Network transit links fail • Storage backends (e.g. SANs) fail • ISPs fail • Customer services systems fail

25. Grand National timeline

26. The day itself

27. Two important things Have a plan Have a good team

28. Two important things Have a plan Have a good team… and let them do their job

29. Team structure

30. Office layout

31. Team structure • Web servers • Databases • Trading systems • Network • Storage • Compute power • Account • etc

32. Team structure

33. So, what went wrong?

34. Risk-management system

35. Risk management system New = “Orwell”

36. Risk management system New = “Orwell” Old = “Not cool enough to have a name”

37. Risk management system New = “Orwell” Old = “Not cool enough to have a name”

38. Risk management system New = “Orwell” Old = “Not cool enough to have a name”

39. Risk management system New = “Orwell” Old = “Not cool enough to have a name”

40. Risk management system New = “Orwell” Old = “Not cool enough to have a name”

41. Risk management system New = “Orwell” Old = “Not cool enough to have a name”

42. Risk management system New = “Orwell” Old = “Not cool enough to have a name”

43. Risk management system New = “Orwell” Old = “Not cool enough to have a name”

44. Risk management system New = “Orwell” Old = “Not cool enough to have a name”

45. Risk management system New = “Orwell” Old = “Not cool enough to have a name”

46. Redis memory usage

47. Redis

48. What happened during the danger zone?

49. Grand National day

50. Grand National day

51. Lessons learnt • Small teams • Trust your teams • Small, fast decision-making chain

52. Lessons learnt • Small teams • Trust your teams • Small, fast decision-making chain • Everyone likes a successful day!

53. Thanks for listening! Image credits: “Grand National in Leeds”: Tom Hudson

Notas del editor

The Grand National is the biggest horse race of the UK horse racing season. It's part of Aintree festival, which isn't the largest festival (it's not as big as Cheltenham), but it's the one race in the UK that everyone bets on even they don't bet on horse races. It's unique, it's an anomaly, and it makes the most traffic which a betting website sees all year.
Notice the horse racing “spikiness” from 2pm – 9pm ish. Now we’ll scale that down…
Here’s what a regular Saturday’s traffic looks like. Notice the football start and end, and then the late football match start and end.
And here’s Grand National. Notice the football shape still happening, but uplifted a lot more. Also the sharp drop as the Grand National itself starts then the progressive recovery of traffic as we controlled the post-race thundering herd
Who am I? Let’s understand a little about how SB&G is organised…
SB&G is arranged into several areas
I work in the ”Bet Tribe”, where a “Tribe” is something like a self-contained business unit
The Bet Tribe is not only Technology – it includes Product and Marketing as well.
Within Technology, we are grouped into 3 areas…
… although delivery teams cut across all areas. Moving forwards, we’re trying to change it from just “delivery” into “ownership” – note that Operations is included here.
Within the Operations world, we have 4 teams We look after the 24x7 site.
SB&G is a rapidly growing business, and has been for several years now. We used to be wholly-owned by Sky, but just over a year ago they sold 80% of us to a private VC company. That gave us a lot of autonomy in how we operate, but it also made our first Grand National as an independent company a very high profile event in the eyes of our new owners.
Architecturally, our betting site works in a traditional multi-tiered setup. Website runs on PHP, behind proxies which do no caching. Behind the PHP there's a few databases - MySQL and Mongo, as well as Redis for session and betslip storage (our betslip is a bit like an ecommerce shopping cart). If you want to place a bet or do anything transactional, then we call APIs into another system which, although we don't develop that ourselves, we do host it as part of our technology stack and we manage and scale it ourselves alongside the website. Notice the single Informix database behind it all. Scaling yay! We also run a whole replication system, which uses NodeJS and RabbitMQ to get the catalog of sporting events and betting odds out of the backend and into the website's databases.
Preparation generally starts after Christmas, with a series of iterative cycles: Load test key areas Identify potential improvements Plan and implement the improvements HOWEVER… your load tests are lying to you.
DON’T BELIEVE A LOAD TEST. You will never ever ever replicate real traffic. You need to use your experience to identify "key areas", and never think that the numbers you achieve in a load-test are representative of real life.
Use your experience in every step of the cycle. Some things we found whilst load-testing…
However, not everything is predictable…
More unexpected events But worse than these failing, is…
… them slowing down. For additional excitement…
… this year they chose to change the start-time of the race, introducing a database-killing danger zone into the mix. To throw another unknown into the mix, last year we also made a lot of changes to our Accounts / SSO team; here’s Phil to tell you more.
On the day itself, we did two things which were particularly important to us surviving.
We knew that the plan wouldn't be stuck to, but it was useful to have a list of the un-movable parts of the day horse race starts football kick-offs and half- and full-times expected settlement times etc. Basically, a paper list of things which I could tick off as the day unfolded, and quickly reference if we saw any unexpected behaviour.
Please let your teams do their jobs.
More importantly, though, was having clusters of small autonomous teams spread throughout the office but all within a sight-line (and hearing range) of a single central point (which was me). Obviously, Sheffield weren't within hearing range of Leeds, but Phil acted as a proxy and was constantly on a video-call with the rest of the Account team in Sheffield, so as far as I was concerned Phil "was" the Sheffield office.
It's important to call out the autonomous nature of the teams. We had teams of 2-4 people looking after things like
I also had assistants for: Service management and non-technical business communication Recording everything which was happening, decisions made, incidents etc Taking over the coordination from me when I needed a break
Remember when I said that load-testing doesn't work...? A couple of things very nearly went wrong on the day, but our teams noticed the potential problems and got out in front of them to put a fix in place.
For example, one of our risk management systems (which lets our traders see a filtered list of all bets going through the system) nearly took out our ability to take bets.
We have 2 versions of this risk management system – a new one (“Orwell”) …
… and an old one which isn’t cool enough to have a name.
Traders should be using Orwell, but In the run-up to the 3pm football kick-offs, Orwell hiccuped with the volume of bets going through it. As a result, all of the traders launched the old system, which was the known fallback for Orwell.
Unfortunately, the way the old system works means that the more people who launch it during high-volume bet-placement periods, the more it starts to crumble. To make matters worse, one of the reasons why we've replaced the old system is because it's directly in line with bet-placement. If it slows down, so does our bet placement capability. The team looking after the trading systems noticed this, and realised that the Orwell hiccup had caused lots of people to launch the old system. After a quick conversation with me and our infrastructure team, they quickly scaled-up the Orwell platform …
… opened the firewall to the new servers, put them into the load-balancer, and told the traders to go back to using Orwell instead of the old system.
Bet placement remained unaffected.
It was some of the best business we did all day. Zooming in on the danger zone…
… there are 3 uplifts, and each of these is related to our phased approach to football settlement, which is how we protected our database. One of our largest competitors tweeted afterwards that they’d seen a record-breaking 9,779 bets/minute on the day – our management of the danger zone pushed us well above that for 20 mins or so.
Crucially, there were a small number of people in the decision-making chain, and we had the right people in the right place to enact changes.
Also... everyone went home at the end of day feeling good about themselves.

DOES16 London - Kevin Bowman - Surviving the Grand National

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (16)

Similar a DOES16 London - Kevin Bowman - Surviving the Grand National

Similar a DOES16 London - Kevin Bowman - Surviving the Grand National (20)

Más de Gene Kim

Más de Gene Kim (20)

Último

Último (20)

DOES16 London - Kevin Bowman - Surviving the Grand National

Notas del editor