SlideShare una empresa de Scribd logo
1 de 37
@jewelia
Scaling Slack
Infrastructure
🚀
Julia Grace
Senior Director of Engineering
@jewelia
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
slack-scaling-infrastructure/
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
@jewelia
Phase 0: 2015
@jewelia
~2.5M Daily
Active Users
@jewelia
Phase 1: 2016
@jewelia
~4M Daily Active
Users
@jewelia
Phase 1: 2016
Slack was originally designed for teams < 150ppl.
You make very different architectural decisions when you’re building for a team of
100 people vs 500,000.
Before August 2016 we had no Infra team.
Original infrastructure built for Glitch worked very well in 2014/2015.
~150 Engineers total.
Infrastructure investments would come secondary to feature work.
@jewelia
Things were starting to break in strange,
unusual
ways.


@jewelia
Phase 1: 2016
Example: User Presence
Green dot indicaFng online/away/offline.
Very few people noFce it, unless it’s broken (people expect it to “just work”).
Apps and bots are always online.
@jewelia
Phase 1: 2016
User Presence
IniFally broadcast all changes to all users (e.g. “Julia Grace is away”) to the whole
workspace: O(n^2).
Presence was ~80% of all web socket traffic.
Peak volume in late 2016: 16 million messages/minute over web socket.
Presence messages: 13 million/minute.
Rapidly transiFon from broadcast to publish/subscribe.
@jewelia
There were many organizational challenges as
well.


@jewelia
Phase 1: 2016
How to build engineering-led org in a product-
led company?
Would we be able to get headcount, budget?
How to communicate the value of we are doing to non-technical audiences?
How do we interface with sales?
Infrastructure as a compeFFve advantage.
@jeweliahUps://www.flickr.com/photos/pocheco/14833391966
@jewelia
Phase 1: 2016
Start internal evangelism on day #1.
I went on an internal PR campaign: Why our work was important, why we needed
to conFnually invest in infrastructure. Make work very visible to execs in other
funcFons.
Followed existing company process.
We did planning, status reporFng, etc. at the same cadence and in the same
meeFngs as product engineering. Don’t try to start a new group and invent new
process.
Identify executive sponsor.
@jewelia
Phase 2: 2017
@jewelia
Phase 2: 2017
Technology landscape.
Hack/PHP monolith on backend, JavaScript with no libraries on frontend.
1 service: presence and real-Fme messaging.
Building a second service: Go caching service.
These bespoke services each had to handle rate limiFng, traffic management,
deployment.
@jewelia
Phase 2: 2017
It was time to change our DB sharding strategy.
MySQL sharded by team/workspace to Vitess sharded by various keys.
Worked great! UnFl we hit scaling limits, significant hotspots.


@jewelia
Monolith
Service
A
Service
B
@jewelia
Monolith
Service
A
Service
B
@jewelia
Monolith
Service
A
Service
B
Who owns this?
@jewelia
Communication Risk
The more technically complex,
nuanced a problem is…
@jewelia
Communication Risk
The more technically complex,
nuanced a problem is…
The higher the communication risk.
@jewelia
Phase 2: 2017
Immense pressure to hire engineers.
Many human SPOFs (single points of failure) because team was so small.
Everyone was overextended and overcommihed.
We had to figure out how to hire Infra
engineers.
All our hiring processes were opFmized around hiring generalists: frontend
backend, iOS, Android, Ops.
We skills do we need and value? How do we test for those skills?
@jewelia
Phase 2: 2017
Decided to hire Infra engineering generalists.
Created a take home coding exercise designed
to test:
1. An understanding of servers, networking, and protocols.
2. An understanding of concurrency, performance, and resource constraints, and
an ability to anFcipate future issues and implement soluFons.
3. An ability to write clear, easy to understand code, communicate your approach,
and reason about tradeoffs that you have made.
@jewelia
Phase 2: 2017
I wore so many hats. Too many hats.
Similar to my days as a startup CTO!
I was the Engineering Director and
Forming strategy, hiring managers and ICs, evangelizing the org.
…Product Manager and
Internal interface to Product Engineering/PMs building features, externally to
customers with quesFons about the integrity of our infrastructure.
…Program Manager.
Running cross funcFonal iniFaFves.
@jewelia
Phase 3: 2018
@jewelia
Phase 3: 2018
“0 to 1” was over. Now time for “1 to ∞”.
ReacFve to ProacFve.
Transition from few teams to an org in 3 offices.
Team nearly 100 engineers by end of year.
Now included Data, Machine Learning, Search Infrastructure
Many orders of magnitude better performance
Things were not breaking all the time.
@jewelia
Phase 3: 2018
Services model matured significantly.
SLAs for services, consistent deployment processes, etc.
Mature incident response process.
Dividing into sub-teams made sense.
Data Stores & Cache Infra, Service Mesh & Web Serving, Distributed Messaging.
@jewelia
Phase 3: 2018
Hired Director Specialists…
Had to quickly learn how to hire senior leaders whose jobs you haven’t done before.
How to do this well: talk to a lot people who currently do the job you’re trying to hire
for, deeply understand the talent market.
and Product Managers…
and did an acquisition.
@jewelia
Phase 3: 2018
Challenge: coherency across a large
organization.
Example: overlap between Machine Learning and Frontend Infra was NULL.
Difficult to have a unified vision.
Stakeholders were each org were different for each part of the org; Data Infra
organizaFon worked closely with G&A (finance), Search Infra did not.
I should have done more re-orgs!
@jewelia
2016:
@jewelia
2016:
2017:
@jewelia
2016:
2017:
2018:
@jewelia
Today
Infra has been around for ~3 years
400M async jobs processed/day to 2.5B
3M DAU (daily active users) to 10M DAU
1M simultaneously connected users to 7.5M
10 to ~100 engineers in SF, NYC, YVR
Generalist (ICs, Managers) to specialists
1 amazing team
@jewelia
Thank You!
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
slack-scaling-infrastructure/

Más contenido relacionado

Más de C4Media

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinC4Media
 

Más de C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Último

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Scaling Infrastructure Engineering at Slack

  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ slack-scaling-infrastructure/
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 8. @jewelia Phase 1: 2016 Slack was originally designed for teams < 150ppl. You make very different architectural decisions when you’re building for a team of 100 people vs 500,000. Before August 2016 we had no Infra team. Original infrastructure built for Glitch worked very well in 2014/2015. ~150 Engineers total. Infrastructure investments would come secondary to feature work.
  • 9. @jewelia Things were starting to break in strange, unusual ways. 

  • 10. @jewelia Phase 1: 2016 Example: User Presence Green dot indicaFng online/away/offline. Very few people noFce it, unless it’s broken (people expect it to “just work”). Apps and bots are always online.
  • 11. @jewelia Phase 1: 2016 User Presence IniFally broadcast all changes to all users (e.g. “Julia Grace is away”) to the whole workspace: O(n^2). Presence was ~80% of all web socket traffic. Peak volume in late 2016: 16 million messages/minute over web socket. Presence messages: 13 million/minute. Rapidly transiFon from broadcast to publish/subscribe.
  • 12. @jewelia There were many organizational challenges as well. 

  • 13. @jewelia Phase 1: 2016 How to build engineering-led org in a product- led company? Would we be able to get headcount, budget? How to communicate the value of we are doing to non-technical audiences? How do we interface with sales? Infrastructure as a compeFFve advantage.
  • 15. @jewelia Phase 1: 2016 Start internal evangelism on day #1. I went on an internal PR campaign: Why our work was important, why we needed to conFnually invest in infrastructure. Make work very visible to execs in other funcFons. Followed existing company process. We did planning, status reporFng, etc. at the same cadence and in the same meeFngs as product engineering. Don’t try to start a new group and invent new process. Identify executive sponsor.
  • 17. @jewelia Phase 2: 2017 Technology landscape. Hack/PHP monolith on backend, JavaScript with no libraries on frontend. 1 service: presence and real-Fme messaging. Building a second service: Go caching service. These bespoke services each had to handle rate limiFng, traffic management, deployment.
  • 18. @jewelia Phase 2: 2017 It was time to change our DB sharding strategy. MySQL sharded by team/workspace to Vitess sharded by various keys. Worked great! UnFl we hit scaling limits, significant hotspots. 

  • 22. @jewelia Communication Risk The more technically complex, nuanced a problem is…
  • 23. @jewelia Communication Risk The more technically complex, nuanced a problem is… The higher the communication risk.
  • 24. @jewelia Phase 2: 2017 Immense pressure to hire engineers. Many human SPOFs (single points of failure) because team was so small. Everyone was overextended and overcommihed. We had to figure out how to hire Infra engineers. All our hiring processes were opFmized around hiring generalists: frontend backend, iOS, Android, Ops. We skills do we need and value? How do we test for those skills?
  • 25. @jewelia Phase 2: 2017 Decided to hire Infra engineering generalists. Created a take home coding exercise designed to test: 1. An understanding of servers, networking, and protocols. 2. An understanding of concurrency, performance, and resource constraints, and an ability to anFcipate future issues and implement soluFons. 3. An ability to write clear, easy to understand code, communicate your approach, and reason about tradeoffs that you have made.
  • 26. @jewelia Phase 2: 2017 I wore so many hats. Too many hats. Similar to my days as a startup CTO! I was the Engineering Director and Forming strategy, hiring managers and ICs, evangelizing the org. …Product Manager and Internal interface to Product Engineering/PMs building features, externally to customers with quesFons about the integrity of our infrastructure. …Program Manager. Running cross funcFonal iniFaFves.
  • 28. @jewelia Phase 3: 2018 “0 to 1” was over. Now time for “1 to ∞”. ReacFve to ProacFve. Transition from few teams to an org in 3 offices. Team nearly 100 engineers by end of year. Now included Data, Machine Learning, Search Infrastructure Many orders of magnitude better performance Things were not breaking all the time.
  • 29. @jewelia Phase 3: 2018 Services model matured significantly. SLAs for services, consistent deployment processes, etc. Mature incident response process. Dividing into sub-teams made sense. Data Stores & Cache Infra, Service Mesh & Web Serving, Distributed Messaging.
  • 30. @jewelia Phase 3: 2018 Hired Director Specialists… Had to quickly learn how to hire senior leaders whose jobs you haven’t done before. How to do this well: talk to a lot people who currently do the job you’re trying to hire for, deeply understand the talent market. and Product Managers… and did an acquisition.
  • 31. @jewelia Phase 3: 2018 Challenge: coherency across a large organization. Example: overlap between Machine Learning and Frontend Infra was NULL. Difficult to have a unified vision. Stakeholders were each org were different for each part of the org; Data Infra organizaFon worked closely with G&A (finance), Search Infra did not. I should have done more re-orgs!
  • 35. @jewelia Today Infra has been around for ~3 years 400M async jobs processed/day to 2.5B 3M DAU (daily active users) to 10M DAU 1M simultaneously connected users to 7.5M 10 to ~100 engineers in SF, NYC, YVR Generalist (ICs, Managers) to specialists 1 amazing team
  • 37. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ slack-scaling-infrastructure/