SlideShare una empresa de Scribd logo
1 de 49
Descargar para leer sin conexión
Billions of Hits:
Scaling Twitter
John Adams
Twitter Operations
#chirpscale
John Adams                            @netik
•   Early Twitter employee (mid-2008)

•   Lead engineer: Outward Facing Services (Apache,
    Unicorn, SMTP), Auth, Security

•   Keynote Speaker: O’Reilly Velocity 2009

•   O’Reilly Web 2.0 Speaker (2008, 2010)

•   Previous companies: Inktomi, Apple, c|net

•   Working on Web Operations book with John Alspaw
    (flickr, etsy), out in June
Growth.
752%
                       2008 Growth
source: comscore.com - (based only on www traffic, not API)
1358%
                       2009 Growth
source: comscore.com - (based only on www traffic, not API)
12 th
                    most popular
source: alexa.com
55M
                   Tweets per day
                  (640 TPS/sec, 1000 TPS/sec peak)
source: twitter.com internal
600M
                   Searches/Day
source: twitter.com internal
25%




Web               API
            75%
Operations
•   What do we do?

    •   Site Availability

    •   Capacity Planning (metrics-driven)

    •   Configuration Management

    •   Security

    •   Much more than basic Sysadmin
What have we done?
•   Improved response time, reduced latency

•   Less errors during deploys (Unicorn!)

•   Faster performance

•   Lower MTTD (Mean time to Detect)

•   Lower MTTR (Mean time to Recovery)
Operations Mantra

                                  Move to
     Find            Take
                                   Next
    Weakest        Corrective
                                  Weakest
     Point          Action
                                   Point


   Metrics +
Logs + Science =    Process     Repeatability
   Analysis
Make an attack plan.
 Symptom    Bottleneck   Vector     Solution

                          HTTP
Bandwidth   Network                Servers++
                         Latency
 Timeline                Update      Better
            Database
  Delay                   Delay    algorithm
  Status                             Flock
            Database     Delays
 Growth                            Cassandra
 Updates    Algorithm    Latency   Algorithms
Finding Weakness
•   Metrics + Graphs

    •   Individual metrics are irrelevant

    •   We aggregate metrics to find knowledge

•   Logs

•   SCIENCE!
Monitoring
•   Twitter graphs and reports critical metrics in
    as near real time as possible

•   If you build tools against our API, you should
    too.

    •   RRD, other Time-Series DB solutions

    •   Ganglia + custom gmetric scripts

•   dev.twitter.com - API availability
Analyze
•   Turn data into information

    •   Where is the code base going?

    •   Are things worse than they were?

        •   Understand the impact of the last software
            deploy

        •   Run check scripts during and after deploys

•   Capacity Planning, not Fire Fighting!
Data Analysis
•   Instrumenting the world pays off.

•   “Data analysis, visualization, and other
    techniques for seeing patterns in data are
    going to be an increasingly valuable skill set.
    Employers take notice!”
          “Web Squared: Web 2.0 Five Years On”, Tim O’Reilly, Web 2.0 Summit, 2009
Forecasting             Curve-fitting for capacity planning
                        (R, fityk, Mathematica, CurveFit)



              unsigned int (32 bit)
                Twitpocolypse



  status_id

                                      signed int (32 bit)
                                        Twitpocolypse




                                                  r2=0.99
Internal Dashboard
External API Dashbord




   http://dev.twitter.com/status
What’s a Robot ?
•   Actual error in the Rails stack (HTTP 500)

•   Uncaught Exception

•   Code problem, or failure / nil result

•   Increases our exception count

•   Shows up in Reports
What’s a Whale ?
•   HTTP Error 502, 503

•   Twitter has a hard and fast five second timeout

•   We’d rather fail fast than block on requests

•   We also kill long-running queries (mkill)

•   Timeout
Whale Watcher
•   Simple shell script,

    •   MASSIVE WIN by @ronpepsi

•   Whale = HTTP 503 (timeout)

•   Robot = HTTP 500 (error)

•   Examines last 60 seconds of
    aggregated daemon / www logs

•   “Whales per Second” > Wthreshold

    •   Thar be whales! Call in ops.
Deploy Watcher
Sample window: 300.0 seconds

First start time:
Mon Apr 5 15:30:00 2010 (Mon Apr   5 08:30:00 PDT 2010)
Second start time:
Tue Apr 6 02:09:40 2010 (Mon Apr   5 19:09:40 PDT 2010)

PRODUCTION APACHE: ALL OK
PRODUCTION OTHER: ALL OK
WEB0049 CANARY APACHE: ALL OK
WEB0049 CANARY BACKEND SERVICES: ALL OK
DAEMON0031 CANARY BACKEND SERVICES: ALL OK
DAEMON0031 CANARY OTHER: ALL OK
Feature “Darkmode”
•   Specific site controls to enable and disable
    computationally or IO-Heavy site function

•   The “Emergency Stop” button

•   Changes logged and reported to all teams

•   Around 60 switches we can throw

•   Static / Read-only mode
request flow
           Load Balancers

         Apache mod_proxy

           Rails (Unicorn)

 Flock      memcached        Kestrel

         MySQL      Cassandra

             Daemons
Servers
•   Co-located, dedicated machines at NTT America

•   No clouds; Only for monitoring, not serving

    •   Need raw processing power, latency too high
        in existing cloud offerings

•   Frees us to deal with real, intellectual, computer
    science problems.

•   Moving to our own data center soon
unicorn
•   A single socket Rails application Server (Rack)

•   Zero Downtime Deploys (!)

    •   Controlled, shuffled transfer to new code

•   Less memory, 30% less CPU

•   Shift from mod_proxy_balancer to
    mod_proxy_pass

    •   HAProxy, Ngnix wasn’t any better. really.
Rails
•   Mostly only for front-end.

•   Back end mostly Scala and pure ruby

•   Not to blame for our issues. Analysis found:

    •   Caching + Cache invalidation problems

    •   Bad queries generated by ActiveRecord, resulting in
        slow queries against the db

    •   Queue Latency

•   Replication Lag
memcached
•   memcached isn’t perfect.

    •   Memcached SEGVs hurt us early on.

•   Evictions make the cache unreliable for
    important configuration data
    (loss of darkmode flags, for example)

•   Network Memory Bus isn’t infinite

•   Segmented into pools for better performance
Loony
•   Central machine database (MySQL)

    •   Python, Django, Paraminko SSH

        •   Paraminko - Twitter OSS (@robey)

    •   Ties into LDAP groups

•   When data center sends us email, machine
    definitions built in real-time
Murder
•   @lg rocks!

•   Bittorrent based replication for deploys

•   ~30-60 seconds to update >1k machines

•   P2P - Legal, valid, Awesome.
Kestrel
•   @robey

•   Works like memcache (same protocol)

•   SET = enqueue | GET = dequeue

•   No strict ordering of jobs

•   No shared state between servers

•   Written in Scala.
Asynchronous Requests
•   Inbound traffic consumes a unicorn worker

•   Outbound traffic consumes a unicorn worker

•   The request pipeline should not be used to
    handle 3rd party communications or
    back-end work.

•   Reroute traffic to daemons
Daemons
•   Daemons touch every tweet

•   Many different daemon types at Twitter

•   Old way: One daemon per type (Rails)

    •   New way: Fewer Daemons (Pure Ruby)

•   Daemon Slayer - A Multi Daemon that could
    do many different jobs, all at once.
Disk is the new Tape.
•   Social Networking application profile has
    many O(ny) operations.

•   Page requests have to happen in < 500mS or
    users start to notice. Goal: 250-300mS

•   Web 2.0 isn’t possible without lots of RAM

•   SSDs? What to do?
Caching
•   We’re the real-time web, but lots of caching
    opportunity. You should cache what you get from us.

•   Most caching strategies rely on long TTLs (>60 s)

•   Separate memcache pools for different data types to
    prevent eviction

•   Optimize Ruby Gem to libmemcached + FNV Hash
    instead of Ruby + MD5

•   Twitter now largest contributor to libmemcached
MySQL
•   Sharding large volumes of data is hard

•   Replication delay and cache eviction produce
    inconsistent results to the end user.

•   Locks create resource contention for popular
    data
MySQL Challenges
•   Replication Delay

    •   Single threaded. Slow.

•   Social Networking not good for RDBMS

    •   N x N relationships and social graph / tree
        traversal

    •   Disk issues (FS Choice, noatime, scheduling
        algorithm)
Relational Databases
not a Panacea
•   Good for:

    •   Users, Relational Data, Transactions

•   Bad:

    •   Queues. Polling operations. Social Graph.

•   You don’t need ACID for everything.
Database Replication
•   Major issues around users and statuses tables

•   Multiple functional masters (FRP, FWP)

•   Make sure your code reads and writes to the
    write DBs. Reading from master = slow death

    •   Monitor the DB. Find slow / poorly designed
        queries

•   Kill long running queries before they kill you
    (mkill)
Flock
                                          Flock
•   Scalable Social Graph Store

•   Sharding via Gizzard
                                          Gizzard
•   MySQL backend (many.)

•   13 billion edges,
    100K reads/second
                                  Mysql   Mysql     Mysql
•   Open Source!
Cassandra
•   Originally written by Facebook

•   Distributed Data Store

•   @rk’s changes to Cassandra Open Sourced

•   Currently double-writing into it

•   Transitioning to 100% soon.
Lessons Learned
•   Instrument everything. Start graphing early.

•   Cache as much as possible

•   Start working on scaling early.

•   Don’t rely on memcache, and don’t rely on the
    database

•   Don’t use mongrel. Use Unicorn.
Join Us!
@jointheflock
Q&A
Thanks!
•   @jointheflock

•   http://twitter.com/jobs

•   Download our work

    •   http://twitter.com/about/opensource

Más contenido relacionado

La actualidad más candente

Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
skaluska
 

La actualidad más candente (20)

Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Introduction to influx db
Introduction to influx dbIntroduction to influx db
Introduction to influx db
 
Stream processing and managing real-time data
Stream processing and managing real-time dataStream processing and managing real-time data
Stream processing and managing real-time data
 
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.ioTHE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
 
Ml ops on AWS
Ml ops on AWSMl ops on AWS
Ml ops on AWS
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Apache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patternsApache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patterns
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouse
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Arquitetura Funcional em Microservices
Arquitetura Funcional em MicroservicesArquitetura Funcional em Microservices
Arquitetura Funcional em Microservices
 
Vector database
Vector databaseVector database
Vector database
 
Introduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis AnalyticsIntroduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis Analytics
 
How to Improve Performance Testing Using InfluxDB and Apache JMeter
How to Improve Performance Testing Using InfluxDB and Apache JMeterHow to Improve Performance Testing Using InfluxDB and Apache JMeter
How to Improve Performance Testing Using InfluxDB and Apache JMeter
 

Destacado

3. shaping a new nation [1782 1788]
3. shaping a new nation [1782 1788]3. shaping a new nation [1782 1788]
3. shaping a new nation [1782 1788]
jtoma84
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with Cassandra
Ryan King
 
Lesson Plan_Us History_the birth of a new nation
Lesson Plan_Us History_the birth of a new nationLesson Plan_Us History_the birth of a new nation
Lesson Plan_Us History_the birth of a new nation
scott severance
 
American History - Chapter 6
American History - Chapter 6American History - Chapter 6
American History - Chapter 6
Alison Kurtz
 
Adams To Jefferson
Adams To JeffersonAdams To Jefferson
Adams To Jefferson
James Henry
 
Election Of 1800 Power Point
Election Of 1800 Power PointElection Of 1800 Power Point
Election Of 1800 Power Point
guest8b3f7
 

Destacado (20)

Scaling Instagram
Scaling InstagramScaling Instagram
Scaling Instagram
 
Embracing Open Source: Practice and Experience from Alibaba
Embracing Open Source: Practice and Experience from AlibabaEmbracing Open Source: Practice and Experience from Alibaba
Embracing Open Source: Practice and Experience from Alibaba
 
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
 
Xyz affair
Xyz affairXyz affair
Xyz affair
 
Us history shaping a new nation
Us history shaping a new nationUs history shaping a new nation
Us history shaping a new nation
 
Thomas jefferson
Thomas jeffersonThomas jefferson
Thomas jefferson
 
LVS development and experience
LVS development and experienceLVS development and experience
LVS development and experience
 
3. shaping a new nation [1782 1788]
3. shaping a new nation [1782 1788]3. shaping a new nation [1782 1788]
3. shaping a new nation [1782 1788]
 
Product design: How to create a product
Product design: How to create a productProduct design: How to create a product
Product design: How to create a product
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with Cassandra
 
Washington Presidency
Washington PresidencyWashington Presidency
Washington Presidency
 
The presidency of george washingtion ppt for notes
The presidency of george washingtion ppt for notesThe presidency of george washingtion ppt for notes
The presidency of george washingtion ppt for notes
 
Lesson Plan_Us History_the birth of a new nation
Lesson Plan_Us History_the birth of a new nationLesson Plan_Us History_the birth of a new nation
Lesson Plan_Us History_the birth of a new nation
 
The presidency of john adams
The presidency of john adamsThe presidency of john adams
The presidency of john adams
 
The First Five Presidents of the United States
The First Five Presidents of the United StatesThe First Five Presidents of the United States
The First Five Presidents of the United States
 
John adams presidency ppt
John adams presidency pptJohn adams presidency ppt
John adams presidency ppt
 
American History - Chapter 6
American History - Chapter 6American History - Chapter 6
American History - Chapter 6
 
Adams To Jefferson
Adams To JeffersonAdams To Jefferson
Adams To Jefferson
 
Chapter 10 Sections 1 & 2
Chapter 10 Sections 1 & 2Chapter 10 Sections 1 & 2
Chapter 10 Sections 1 & 2
 
Election Of 1800 Power Point
Election Of 1800 Power PointElection Of 1800 Power Point
Election Of 1800 Power Point
 

Similar a Chirp 2010: Scaling Twitter

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
Christopher Whitaker
 
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
DATAVERSITY
 

Similar a Chirp 2010: Scaling Twitter (20)

John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Velocity2009
Fixing Twitter Velocity2009Fixing Twitter Velocity2009
Fixing Twitter Velocity2009
 
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20....Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
 
Dibi Conference 2012
Dibi Conference 2012Dibi Conference 2012
Dibi Conference 2012
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.com
 
Hacklu2011 tricaud
Hacklu2011 tricaudHacklu2011 tricaud
Hacklu2011 tricaud
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
Be faster then rabbits
Be faster then rabbitsBe faster then rabbits
Be faster then rabbits
 
Asynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per secondAsynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per second
 
Monitoring MySQL at scale
Monitoring MySQL at scaleMonitoring MySQL at scale
Monitoring MySQL at scale
 
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
 
In Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionIn Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics Solution
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDB
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

Chirp 2010: Scaling Twitter

  • 1.
  • 2. Billions of Hits: Scaling Twitter John Adams Twitter Operations
  • 4. John Adams @netik • Early Twitter employee (mid-2008) • Lead engineer: Outward Facing Services (Apache, Unicorn, SMTP), Auth, Security • Keynote Speaker: O’Reilly Velocity 2009 • O’Reilly Web 2.0 Speaker (2008, 2010) • Previous companies: Inktomi, Apple, c|net • Working on Web Operations book with John Alspaw (flickr, etsy), out in June
  • 6. 752% 2008 Growth source: comscore.com - (based only on www traffic, not API)
  • 7. 1358% 2009 Growth source: comscore.com - (based only on www traffic, not API)
  • 8. 12 th most popular source: alexa.com
  • 9. 55M Tweets per day (640 TPS/sec, 1000 TPS/sec peak) source: twitter.com internal
  • 10. 600M Searches/Day source: twitter.com internal
  • 11. 25% Web API 75%
  • 12. Operations • What do we do? • Site Availability • Capacity Planning (metrics-driven) • Configuration Management • Security • Much more than basic Sysadmin
  • 13. What have we done? • Improved response time, reduced latency • Less errors during deploys (Unicorn!) • Faster performance • Lower MTTD (Mean time to Detect) • Lower MTTR (Mean time to Recovery)
  • 14. Operations Mantra Move to Find Take Next Weakest Corrective Weakest Point Action Point Metrics + Logs + Science = Process Repeatability Analysis
  • 15. Make an attack plan. Symptom Bottleneck Vector Solution HTTP Bandwidth Network Servers++ Latency Timeline Update Better Database Delay Delay algorithm Status Flock Database Delays Growth Cassandra Updates Algorithm Latency Algorithms
  • 16. Finding Weakness • Metrics + Graphs • Individual metrics are irrelevant • We aggregate metrics to find knowledge • Logs • SCIENCE!
  • 17. Monitoring • Twitter graphs and reports critical metrics in as near real time as possible • If you build tools against our API, you should too. • RRD, other Time-Series DB solutions • Ganglia + custom gmetric scripts • dev.twitter.com - API availability
  • 18. Analyze • Turn data into information • Where is the code base going? • Are things worse than they were? • Understand the impact of the last software deploy • Run check scripts during and after deploys • Capacity Planning, not Fire Fighting!
  • 19. Data Analysis • Instrumenting the world pays off. • “Data analysis, visualization, and other techniques for seeing patterns in data are going to be an increasingly valuable skill set. Employers take notice!” “Web Squared: Web 2.0 Five Years On”, Tim O’Reilly, Web 2.0 Summit, 2009
  • 20. Forecasting Curve-fitting for capacity planning (R, fityk, Mathematica, CurveFit) unsigned int (32 bit) Twitpocolypse status_id signed int (32 bit) Twitpocolypse r2=0.99
  • 22. External API Dashbord http://dev.twitter.com/status
  • 23. What’s a Robot ? • Actual error in the Rails stack (HTTP 500) • Uncaught Exception • Code problem, or failure / nil result • Increases our exception count • Shows up in Reports
  • 24. What’s a Whale ? • HTTP Error 502, 503 • Twitter has a hard and fast five second timeout • We’d rather fail fast than block on requests • We also kill long-running queries (mkill) • Timeout
  • 25. Whale Watcher • Simple shell script, • MASSIVE WIN by @ronpepsi • Whale = HTTP 503 (timeout) • Robot = HTTP 500 (error) • Examines last 60 seconds of aggregated daemon / www logs • “Whales per Second” > Wthreshold • Thar be whales! Call in ops.
  • 26. Deploy Watcher Sample window: 300.0 seconds First start time: Mon Apr 5 15:30:00 2010 (Mon Apr 5 08:30:00 PDT 2010) Second start time: Tue Apr 6 02:09:40 2010 (Mon Apr 5 19:09:40 PDT 2010) PRODUCTION APACHE: ALL OK PRODUCTION OTHER: ALL OK WEB0049 CANARY APACHE: ALL OK WEB0049 CANARY BACKEND SERVICES: ALL OK DAEMON0031 CANARY BACKEND SERVICES: ALL OK DAEMON0031 CANARY OTHER: ALL OK
  • 27. Feature “Darkmode” • Specific site controls to enable and disable computationally or IO-Heavy site function • The “Emergency Stop” button • Changes logged and reported to all teams • Around 60 switches we can throw • Static / Read-only mode
  • 28. request flow Load Balancers Apache mod_proxy Rails (Unicorn) Flock memcached Kestrel MySQL Cassandra Daemons
  • 29. Servers • Co-located, dedicated machines at NTT America • No clouds; Only for monitoring, not serving • Need raw processing power, latency too high in existing cloud offerings • Frees us to deal with real, intellectual, computer science problems. • Moving to our own data center soon
  • 30. unicorn • A single socket Rails application Server (Rack) • Zero Downtime Deploys (!) • Controlled, shuffled transfer to new code • Less memory, 30% less CPU • Shift from mod_proxy_balancer to mod_proxy_pass • HAProxy, Ngnix wasn’t any better. really.
  • 31. Rails • Mostly only for front-end. • Back end mostly Scala and pure ruby • Not to blame for our issues. Analysis found: • Caching + Cache invalidation problems • Bad queries generated by ActiveRecord, resulting in slow queries against the db • Queue Latency • Replication Lag
  • 32. memcached • memcached isn’t perfect. • Memcached SEGVs hurt us early on. • Evictions make the cache unreliable for important configuration data (loss of darkmode flags, for example) • Network Memory Bus isn’t infinite • Segmented into pools for better performance
  • 33. Loony • Central machine database (MySQL) • Python, Django, Paraminko SSH • Paraminko - Twitter OSS (@robey) • Ties into LDAP groups • When data center sends us email, machine definitions built in real-time
  • 34. Murder • @lg rocks! • Bittorrent based replication for deploys • ~30-60 seconds to update >1k machines • P2P - Legal, valid, Awesome.
  • 35. Kestrel • @robey • Works like memcache (same protocol) • SET = enqueue | GET = dequeue • No strict ordering of jobs • No shared state between servers • Written in Scala.
  • 36. Asynchronous Requests • Inbound traffic consumes a unicorn worker • Outbound traffic consumes a unicorn worker • The request pipeline should not be used to handle 3rd party communications or back-end work. • Reroute traffic to daemons
  • 37. Daemons • Daemons touch every tweet • Many different daemon types at Twitter • Old way: One daemon per type (Rails) • New way: Fewer Daemons (Pure Ruby) • Daemon Slayer - A Multi Daemon that could do many different jobs, all at once.
  • 38. Disk is the new Tape. • Social Networking application profile has many O(ny) operations. • Page requests have to happen in < 500mS or users start to notice. Goal: 250-300mS • Web 2.0 isn’t possible without lots of RAM • SSDs? What to do?
  • 39. Caching • We’re the real-time web, but lots of caching opportunity. You should cache what you get from us. • Most caching strategies rely on long TTLs (>60 s) • Separate memcache pools for different data types to prevent eviction • Optimize Ruby Gem to libmemcached + FNV Hash instead of Ruby + MD5 • Twitter now largest contributor to libmemcached
  • 40. MySQL • Sharding large volumes of data is hard • Replication delay and cache eviction produce inconsistent results to the end user. • Locks create resource contention for popular data
  • 41. MySQL Challenges • Replication Delay • Single threaded. Slow. • Social Networking not good for RDBMS • N x N relationships and social graph / tree traversal • Disk issues (FS Choice, noatime, scheduling algorithm)
  • 42. Relational Databases not a Panacea • Good for: • Users, Relational Data, Transactions • Bad: • Queues. Polling operations. Social Graph. • You don’t need ACID for everything.
  • 43. Database Replication • Major issues around users and statuses tables • Multiple functional masters (FRP, FWP) • Make sure your code reads and writes to the write DBs. Reading from master = slow death • Monitor the DB. Find slow / poorly designed queries • Kill long running queries before they kill you (mkill)
  • 44. Flock Flock • Scalable Social Graph Store • Sharding via Gizzard Gizzard • MySQL backend (many.) • 13 billion edges, 100K reads/second Mysql Mysql Mysql • Open Source!
  • 45. Cassandra • Originally written by Facebook • Distributed Data Store • @rk’s changes to Cassandra Open Sourced • Currently double-writing into it • Transitioning to 100% soon.
  • 46. Lessons Learned • Instrument everything. Start graphing early. • Cache as much as possible • Start working on scaling early. • Don’t rely on memcache, and don’t rely on the database • Don’t use mongrel. Use Unicorn.
  • 48. Q&A
  • 49. Thanks! • @jointheflock • http://twitter.com/jobs • Download our work • http://twitter.com/about/opensource