SlideShare una empresa de Scribd logo
1 de 59
Descargar para leer sin conexión
SCALING HIGH-AVAILABILITY
     INFRASTRUCTURE
       IN THE CLOUD




                          OCT 11, 2011, WEB 2.0
   twilio
   CLOUD COMMUNICATIONS
                                   EVAN COOKE
                                   CO-FOUNDER & CTO
High-Availability
 Sounds good, we need that!




                             eat!
                      al   M
                 n ic
           Te ch
     mm
 umm
Y
High-Availability
   Sounds good, we need that!



                      Uptime
Availability =
                 Uptime + Downtime
High-Availability
            Sounds good, we need that!


  Availability %         Downtime/yr     Downtime/mo

99.9% ("three nines")      8.76 hours     43.2 minutes

99.99% ("four nines")    52.56 minutes    4.32 minutes

99.999% ("five nines")     5.26 minutes    25.9 seconds

99.9999% ("six nines")    31.5 seconds    2.59 seconds
High-Availability
            Sounds good, we need that!


  Availability %         Downtime/yr     Downtime/mo

99.9% ("three nines")      8.76 hours     43.2 minutes

99.99% ("four nines")    52.56 minutes    4.32 minutes
99.999% ("five nines")    5.26 minutes    25.9 seconds

99.9999% ("six nines")    31.5 seconds    2.59 seconds

Can’t rely on human to respond in a 5 min window!
               Must use automation.
Happens to the best

2.5 Hours Down                 11 Hours Down                       Hours
  September 23, 2010               October 4, 2010            November 14, 2010
“...we had to stop all       “...At 6:30pm EST, we         “...Before every run of
traffic to this database      determined the most           our test suite we destroy
cluster, which meant         effective course of action    then re-create the
turning off the site. Once   was to re-index the           database... Due to the
the databases had            [database] shard, which       configuration error
recovered and the root       would address the memory      GitHub's production
cause had been fixed, we      fragmentation and usage       database            was
slowly allowed more          issues. The whole process,    destroyed then re-
people back onto the         including extensive testing   created. Not good.”
site.”                       against data loss and data
                             corruption, took about five
                             hours.”
Causes of Downtime
Lack of best practice change control
Lack of best practice monitoring of the relevant components
Lack of best practice requirements and procurement
Lack of best practice operations
Lack of best practice avoidance of network failures
Lack of best practice avoidance of internal application failures
Lack of best practice avoidance of external services that fail
Lack of best practice physical environment
Lack of best practice network redundancy
Lack of best practice technical solution of backup
Lack of best practice process solution of backup
Lack of best practice physical location
Lack of best practice infrastructure redundancy
Lack of best practice storage architecture redundancy
                    E. Marcus and H. Stern, Blueprints for high availability, second edition.
                                   Indianapolis, IN, USA: John Wiley & Sons, Inc., 2003.
Cloud          Non-Cloud
   Data          Change          Operations            Datacenter
Persistence      Control
   storage      change control    monitoring of        avoidance of
 architecture                     the relevant        network failures
 redundancy                       components
                                                          physical
  technical                       requirements          environment
 solution of                      procurement
                                                          network
   backup                          operations           redundancy
  process                         avoidance of
                                                      physical location
 solution of                      internal app
  backup                             failures           infrastructure
                                  avoidance of           redundancy
                                     external
                                 services that fail
Happens to the best

2.5 Hours Down                 11 Hours Down                       Hours
  September 23, 2010               October 4, 2010            November 14, 2010
“...we had to stop all       “...At 6:30pm EST, we         “...Before every run of
traffic to this database      determined the most           our test suite we destroy
  Database
cluster, which meant
turning off the site. Once
                                 Database
                             effective course of action
                             was to re-index the              Database
                                                           then re-create the
                                                           database... Due to the
the databases had            [database] shard, which       configuration error
                             would address the memory
recovered and the root
cause had been fixed, we      fragmentation and usage            Change
                                                           GitHub's production
                                                           database            was
                             issues. The whole process,
slowly allowed more
people back onto the         including extensive testing        Control
                                                           destroyed then re-
                                                           created. Not good.”
site.”                       against data loss and data
                             corruption, took about five
                             hours.”
Data         Change    Operations            Datacenter
Persistence     Control
       Today control
   storagechange           monitoring of
                           the relevant
                                                avoidance of
                                               network failures
 architecture
Data Persistence
 redundancy                components
                                                   physical
Change Control
 technical                 requirements          environment
 solution of               procurement
                                                   network
  backup                    operations           redundancy
 lessons learned
  process                  avoidance of
 solution@twilio
                                               physical location
         of                internal app
   backup                     failures           infrastructure
                           avoidance of           redundancy
                              external
                          services that fail
Twilio provides web service APIs to
  automate Voice and SMS communications

 Carriers                     Inbound Calls
                  Voice      Outbound Calls
                           Mobile/Browser VoIP

                           Send To/From Phone
                   SMS          Numbers
                              Short Codes
Developer
                 Phone       Dynamically Buy
                Numbers      Phone Numbers
End User
2011




              2010
    2009




3    6            20


           70+
100x Growth in Tx/Day over 1 Year
100X




10X




  X

                    1 Year
2011

          2010
2009
                    100’s of
          10’s of   Servers
  10      Servers
Servers
2011
• 100’s of prod hosts in continuous
  operation
• 80+ service types running in prod
• 50+ prod database servers
• Prod deployments several times/day
  across 7 engineering teams
2011
• Frameworks
 - PHP for frontend components
 - Python Twisted & gevent for async network
   services
 - Java for backend services
• Storage technology
 - MySQL for core DB services
 - Redis for queuing and messaging
Data persistence is hard
(especially in the cloud)
Data persistence is hard
 Data persistence is the hardest
technical problem most scalable
     SaaS businesses face
What is data persistence?




     Stuff that looks like this
What is data persistence?


        Databases
         Queues
          Files
Incoming Requests
                     LB


             A                 A
Tier 1                                    Data
             Q                 Q
                                       Persistence!
                     SQL

Tier 2   B       B         B       B


Files    C       C         D       D         K/V
Tier 3
Why is persistence so hard?
• Difficult to change structure
  - Huge inertia e.g., large schema migrations
• Painful to recover from disk/node failures
  - “just boot a new node” doesn’t work
• Woeful performance/scalability
  - I/O is huge bottleneck in modern servers (e.g. EC2)
• Freak’in complex!!!
  - Atomic transactions/rollback, ACID, blah blah blah
Difficult to Change Structure
                    ALTER TABLE names
                    DROP COLUMN Value
Id       Name Value                 Id      Name
 1        Bob           12           1          Bob
 2        Jane          78           2          Jane
 3        Steve         56           3          Steve
           ...
     500 million rows
                               HOURS later...


‣ You live with data decisions for a long time
Painful to Recover from Failures
                             Data on secondary?
       W       R     R
                             How much data?
                             R/W consistency?



              DB                DB



           Primary           Secondary
‣ Because of complexity, failover is human process
Woeful Performance/Scalability
ec2
m1.xlarge
raid0 4x ephemeral
Device:   rrqm/s   wrqm/s   r/s   w/s     rMB/s    wMB/s avgrq-sz avgqu-sz    await svctm %util
sda1        0.00     0.00 0.00 0.00        0.00     0.00     0.00     0.00     0.00  0.00  0.00
sdb       169.31   111.88 57.43 469.31      0.90     2.25    12.24     2.29     4.36  1.12 59.01
sdc       178.22   110.89 59.41 396.04      0.93     1.98    13.08     1.58     3.50  1.18 53.56
sdd       145.54   102.97 50.50 384.16      0.78     1.90    12.63     1.00     2.34  1.03 44.85
sde       166.34    95.05 54.46 337.62      0.85     1.69    13.27     1.12     2.84  1.22 47.92
md0       0.00     0.00 880.20 2007.92      3.44     7.82     7.99     0.00     0.00  0.00  0.00




                                         ~10 MB/s write


‣ Poor I/O on cloud today, 100x slower than real HW
Woeful Performance/Scalability



             DB DB DB DB DB DB




  ‣ Difficult to horizontally scale in the cloud
@!#$%^&* Complex
• Incredibly complex                BUFFER POOL AND MEMORY
                                    ----------------------
                                    Total memory allocated 11655168000; in
  configuration                      Internal hash tables (constant factor
                                        Adaptive hash index 223758224 (179
                                        Page hash           11248264
  - Billion knobs and buttons           Dictionary cache    45048690 (449
                                        File system         84400 (82672
  - Whole companies exist just to       Lock system
                                        Recovery system
                                                            28180376 (281
                                                            0 (0 + 0)
   tune DB’s                            Threads             428608    (406
                                    Dictionary memory allocated 57346

• Lots of consistency/
                                    Buffer pool size        693759
                                    Buffer pool size, bytes 11366547456
                                    Free buffers            1
  transactional models              Database pages
                                    Old database pages
                                                            691085
                                                            255087


• Multi-region data is
                                    Modified db pages       326490
                                    Pending reads 0
                                    Pending writes: LRU 0, flush list 0, s
  unsolved - Facebook and           Pages made young 497782847, not young
                                    24.78 youngs/s, 0.00 non-youngs/s
  Google struggle                   Pages read 447257683, created 16982810
                                    24.82 reads/s, 1.14 creates/s, 33.36 w
                                    Buffer pool hit rate 993 / 1000, young
Deep breath, step back
 Think about each problem
 (use @twilio examples)

 • Software that runs in the cloud
 • Open source
1
    Difficult to Change Structure
     • Don’t have structure
       - key/value databases (SimpleDB, Cassandra)
       - document-orient databases (CouchDB, MongoDB)
     • Don’t store a lot of data...
1
           Don’t Store Stuff
    • Outsource data as much as possible
    • But NOT to your customers
1
           Don’t Store Stuff
    • Aggressively archive and move data offline

                         S3/SimpleDB
             ~500M
              Rows
    (keep indices in memory)

      Build UX that supports longer/restricted
             access times to older data
1
           Don’t Store Stuff
    • Avoid stateful systems/architectures where
      possible

                            Web


     Browser                Web         Session
                                          DB

    Cookie:                 Web
    SessionID
1
                Don’t Store Stuff
        • Avoid stateful systems/architectures where
           possible
    Store state in client       Web
         browser

          Browser               Web         Session
                                              DB

       Cookie:                  Web
       enc($session)
2
 Painful to Recover from Failures
   • Avoid single points of failure
     -   E.g., master-master (active/active)
     -   Complex to set up, complex failure modes
     -   Sometimes it’s the only solution
     -   Lots of great docs on web

   • Minimize number of stateful node, separate
     stateful & stateless components...
2
    Separate Stateful and Stateless
            Components

    Req   App A    App B      App C



                           On failure, even
                   App B
                           if we boot
                           replacement, we
                           lose data
2
    Separate Stateful and Stateless
            Components

    Req           App A           App B             App C


                          Queue




                                            Queue
          Queue




                                          On failure, even
                                  App B
                                          if we boot
                          Queue



                                          replacement, we
                                          lose data
2
    Separate Stateful and Stateless
            Components
          Keep connection open for whole app path!
                      (hint: use evented framework)
    Req      App AA      App BB     App C
             App A
              App Twilio’s App stack App C
                          App B      App C
                           SMS
                     uses this approach


    On failure, we
    don’t lose a
    single request
2
 Painful to Recover from Failures
   • Avoid single points of failure
     -   E.g., master-master (active/active)
     -   Complex to set up, complex failure modes
     -   Sometimes it’s the only solution
     -   Lots of great blog docs on web

   • Minimize number of stateful nodes,
     separate stateful & stateless components
   • Build a data change control process to
     avoid mistakes and errors...
• 100’s of prod hosts in continuous
   operation
 • 80+ service types running in prod
 • 50+ prod database servers
 • Prod deployments several times/day
   across 7 engineering teams
Components deployed at different frequencies:
 Partially Continuous Deployment
Website                 Deployment
            Content
                                 Frequency(Risk)
                                               4 buckets
                      Website
                       Code
Log Scale




            1000x
                                   REST
                                    API        Big DB
                       100x
                                               Schema
                                    10x
                                                 1x
             CMS      PHP/Ruby   Python/Java     SQL
                        etc.         etc.
Website                        Deployment
Content
                                 Processes

             Website
              Code
                            REST
                             API              Big DB
                                              Schema

 One Click     CI Tests      CI Tests           CI Tests
              One Click   Human Sign-off     Human Sign-off
                            One Click      Human Assisted Click
3
    Woeful Performance/Scalability
      • If disk I/O is poor, avoid disk
        - Tune tune tune. Keep your indices in memory
        - Use an in-memory datastore e.g., Redis and
          configure replication such that if you have a master
          failure, you can always promote a slave

      • When disk I/O saturates, shard
        - LOTs of sharding info on web
        - Method of last resort, single point of failure
          becomes multiple single points of failure
4
                @#$%^&* Complex

    • Bring the simplest tool to          Magic Database

      the job
      - Use a strictly consistent store
        only if you need it
      - If you don’t need HA, don’t add
        the complexity

    • There is no magic database.         Magic Database does it all.
      Decompose requirements,             Consistency, Availability,

      mix-and-match datastores            Partition-tolerance, it's got all
                                          three.
      as needed...
4
            Twilio Data Lifecycle


      CREATE        UPDATE        UPDATE

      name:foo      name:foo      name:foo     name:foo
    status:INIT   status:QUEUED status:GOING status:DONE
       ret:0         ret:0         ret:0        ret:42

          Twilio Examples: Call, SMS, Conference
          Other Examples: Order, Workflow, $
4
            Twilio Data Lifecycle


      CREATE        UPDATE        UPDATE

      name:foo      name:foo      name:foo     name:foo
    status:INIT   status:QUEUED status:GOING status:DONE
       ret:0         ret:0         ret:0        ret:42

                    In-Flight                Post-Flight
4
           Twilio Data Lifecycle
                      Applications



    • Atomically update       • Billing
     part of a workflow        • Log Access
                              • Analytics
                              • Reporting

          In-Flight                  Post-Flight
4
           Twilio Data Lifecycle
                       Properties

    High-Availability
    • Strict Consistency      • Eventual Consistency
    • Key/Value               • Range Queries w/
    • ~20ms                     Filters
                              • ~200ms

         In-Flight                  Post-Flight
4
         Twilio Data Lifecycle
     Systems with very different access semantics




    Data Store A             Data Store B

       In-Flight                  Post-Flight
4
    In-Flight         Post-Flight
                                    Eventual consistency
                  Q
                         Logs       Range queries
                                    Filtered queries
                       (REST API)   ~200ms
                                    Billions
    Strict
    Consistency                     Eventual consistency
    Key/Value                       Arbitrary queries
                  Q   Reporting     High Latency
    ~20ms                           Billions
    10k-1M
                                    Idempotent
                                    Aggregation
                  Q     Billing     Key/Value
                                    Billions
4
    In-Flight        Post-Flight
                                   SQL Sharded
                        Logs       Cassandra/Acunu
                 Q                 MongoDb
                      (REST API)   Riak
                                   CouchDb

    MySQL
    PostgreSQL   Q   Reporting     Hadoop
    Redis
    NDB


                                   SQL Sharded
                 Q     Billing     Redis
ata
D
Why is persistence so hard?
• Difficult to change structure
           Don’t store stuff!
  - Huge inertia e.g.,schema-less
               Go large schema migrations

• Painful to recover from disk/node failures
     Separate stateful/stateless
        Change control processes
  - “just boot a new node” doesn’t work
• Woeful performance/scalability
            Memory FTW
                       Shard
  - I/O is huge bottleneck in modern servers (e.g. EC2)
• Freak’in complex!!! data lifecycle
      Decompose
  - AtomicMinimize complexity blah blah
          transactions/rollback, ACID, blah
Incoming Requests
                     LB


             A                 A
Tier 1
             Q                 Q
                     SQL

Tier 2   B       B         B       B


Files    C       C         D       D       K/V
Tier 3
Incoming Requests

            Idempotent        LB
           request path
                  A                    A   Aggregate into
  Tier 1                                   HA queues
                                           Master-Master
                      Q Q      SQL
                                 SQL       MySQL

  Tier 2      B           B        B       B   Move K/V to
                                               SimpleDB w/
Move file store                                 local cache
to S3 S3                                        SimpleDB
              C           C        D       D
  Tier 3
Data          Change          Operations            Datacenter
Persistence      Control
   storage
 architecture
 redundancy
                      HA
                change control    monitoring of
                                  the relevant
                                  components
                                                       avoidance of
                                                      network failures

                                                          physical


                      is
  technical                       requirements          environment
 solution of                      procurement
                                                          network
   backup                          operations           redundancy
  process                         avoidance of


                     Hard
                                                      physical location
 solution of                      internal app
  backup                             failures           infrastructure
                                  avoidance of           redundancy
                                     external
                                 services that fail
SCALING HIGH-AVAILABILITY
INFRASTRUCTURE IN THE CLOUD


     Focus on data
       How you store it
      Where you store it
     When you can delete it
     Control changes to it
Open Problems...
In-Flight            Post-Flight
                                      Massively
                                       scalable
                HA       Logs
                 Q                  range queries
               queue   (REST API)     filterable
                                       ~200ms
   Simple
 multi-AZ                             Simple
multi-region     Q     Reporting        HA
                                     Hadoop
 consistent                          Hadoop
    K/V

                                      Massively
                 Q      Billing       scalable
                                     aggregator
twilio
http://www.twilio.com
      @emcooke

Más contenido relacionado

La actualidad más candente

SOC Lessons from DevOps and SRE by Anton Chuvakin
SOC Lessons from DevOps and SRE by Anton ChuvakinSOC Lessons from DevOps and SRE by Anton Chuvakin
SOC Lessons from DevOps and SRE by Anton ChuvakinAnton Chuvakin
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)Hussain Mansoor
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at ScaleMongoDB
 
DevSecOps Implementation Journey
DevSecOps Implementation JourneyDevSecOps Implementation Journey
DevSecOps Implementation JourneyDevOps Indonesia
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
Secure your Azure and DevOps in a smart way
Secure your Azure and DevOps in a smart waySecure your Azure and DevOps in a smart way
Secure your Azure and DevOps in a smart wayEficode
 
From Mobile to MongoDB: Store your app's data using Realm
From Mobile to MongoDB: Store your app's data using RealmFrom Mobile to MongoDB: Store your app's data using Realm
From Mobile to MongoDB: Store your app's data using RealmDiego Freniche Brito
 
Azure DevOps Tutorial | Developing CI/ CD Pipelines On Azure | Edureka
Azure DevOps Tutorial | Developing CI/ CD Pipelines On Azure | EdurekaAzure DevOps Tutorial | Developing CI/ CD Pipelines On Azure | Edureka
Azure DevOps Tutorial | Developing CI/ CD Pipelines On Azure | EdurekaEdureka!
 
Accountable and Robust Automatic Fact Checking
Accountable and Robust Automatic Fact CheckingAccountable and Robust Automatic Fact Checking
Accountable and Robust Automatic Fact CheckingIsabelle Augenstein
 
End-to-End CI/CD at scale with Infrastructure-as-Code on AWS
End-to-End CI/CD at scale with Infrastructure-as-Code on AWSEnd-to-End CI/CD at scale with Infrastructure-as-Code on AWS
End-to-End CI/CD at scale with Infrastructure-as-Code on AWSBhuvaneswari Subramani
 
Automating a PostgreSQL High Availability Architecture with Ansible
Automating a PostgreSQL High Availability Architecture with AnsibleAutomating a PostgreSQL High Availability Architecture with Ansible
Automating a PostgreSQL High Availability Architecture with AnsibleEDB
 
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of KubernetesDevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of KubernetesDevOps.com
 
Devops Devops Devops
Devops Devops DevopsDevops Devops Devops
Devops Devops DevopsKris Buytaert
 
Neo4j Bloom for Project Teams: Browser-Based and Multi-User Enabled
Neo4j Bloom for Project Teams: Browser-Based and Multi-User EnabledNeo4j Bloom for Project Teams: Browser-Based and Multi-User Enabled
Neo4j Bloom for Project Teams: Browser-Based and Multi-User EnabledNeo4j
 
MongoDB Performance Tuning
MongoDB Performance TuningMongoDB Performance Tuning
MongoDB Performance TuningPuneet Behl
 

La actualidad más candente (20)

SOC Lessons from DevOps and SRE by Anton Chuvakin
SOC Lessons from DevOps and SRE by Anton ChuvakinSOC Lessons from DevOps and SRE by Anton Chuvakin
SOC Lessons from DevOps and SRE by Anton Chuvakin
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
 
DevSecOps Implementation Journey
DevSecOps Implementation JourneyDevSecOps Implementation Journey
DevSecOps Implementation Journey
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
 
DevSecOps: What Why and How : Blackhat 2019
DevSecOps: What Why and How : Blackhat 2019DevSecOps: What Why and How : Blackhat 2019
DevSecOps: What Why and How : Blackhat 2019
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Secure your Azure and DevOps in a smart way
Secure your Azure and DevOps in a smart waySecure your Azure and DevOps in a smart way
Secure your Azure and DevOps in a smart way
 
From Mobile to MongoDB: Store your app's data using Realm
From Mobile to MongoDB: Store your app's data using RealmFrom Mobile to MongoDB: Store your app's data using Realm
From Mobile to MongoDB: Store your app's data using Realm
 
Azure DevOps Tutorial | Developing CI/ CD Pipelines On Azure | Edureka
Azure DevOps Tutorial | Developing CI/ CD Pipelines On Azure | EdurekaAzure DevOps Tutorial | Developing CI/ CD Pipelines On Azure | Edureka
Azure DevOps Tutorial | Developing CI/ CD Pipelines On Azure | Edureka
 
Accountable and Robust Automatic Fact Checking
Accountable and Robust Automatic Fact CheckingAccountable and Robust Automatic Fact Checking
Accountable and Robust Automatic Fact Checking
 
DevOps & SRE at Google Scale
DevOps & SRE at Google ScaleDevOps & SRE at Google Scale
DevOps & SRE at Google Scale
 
End-to-End CI/CD at scale with Infrastructure-as-Code on AWS
End-to-End CI/CD at scale with Infrastructure-as-Code on AWSEnd-to-End CI/CD at scale with Infrastructure-as-Code on AWS
End-to-End CI/CD at scale with Infrastructure-as-Code on AWS
 
Automating a PostgreSQL High Availability Architecture with Ansible
Automating a PostgreSQL High Availability Architecture with AnsibleAutomating a PostgreSQL High Availability Architecture with Ansible
Automating a PostgreSQL High Availability Architecture with Ansible
 
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of KubernetesDevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
 
Devops Devops Devops
Devops Devops DevopsDevops Devops Devops
Devops Devops Devops
 
Implementing DevSecOps
Implementing DevSecOpsImplementing DevSecOps
Implementing DevSecOps
 
Neo4j Bloom for Project Teams: Browser-Based and Multi-User Enabled
Neo4j Bloom for Project Teams: Browser-Based and Multi-User EnabledNeo4j Bloom for Project Teams: Browser-Based and Multi-User Enabled
Neo4j Bloom for Project Teams: Browser-Based and Multi-User Enabled
 
MongoDB Performance Tuning
MongoDB Performance TuningMongoDB Performance Tuning
MongoDB Performance Tuning
 

Destacado

Scaling Twilio - Evan Cooke - Twilio Conference 2011
Scaling Twilio - Evan Cooke - Twilio Conference 2011Scaling Twilio - Evan Cooke - Twilio Conference 2011
Scaling Twilio - Evan Cooke - Twilio Conference 2011Twilio Inc
 
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...Twilio Inc
 
Twilio Voice Applications with Amazon AWS S3 and EC2
Twilio Voice Applications with Amazon AWS S3 and EC2Twilio Voice Applications with Amazon AWS S3 and EC2
Twilio Voice Applications with Amazon AWS S3 and EC2Twilio Inc
 
KrankyGeek 2015 - Mixing Data and Video - IBM Bluemix, Watson, and Twilio
KrankyGeek 2015 - Mixing Data and Video - IBM Bluemix, Watson, and TwilioKrankyGeek 2015 - Mixing Data and Video - IBM Bluemix, Watson, and Twilio
KrankyGeek 2015 - Mixing Data and Video - IBM Bluemix, Watson, and TwilioJeff Sloyer
 
Scaling SaaS on Oracle
Scaling SaaS on OracleScaling SaaS on Oracle
Scaling SaaS on OracleOpSource
 
Building a scalable infrastructure for social mobile web apps
Building a scalable infrastructure for social mobile web appsBuilding a scalable infrastructure for social mobile web apps
Building a scalable infrastructure for social mobile web appsngonpham
 
Scaling Wordpress
Scaling WordpressScaling Wordpress
Scaling Wordpressngonpham
 
Traditional Infrastructure Capacity Models vs. Cloud Capacity Models
Traditional Infrastructure Capacity Models vs. Cloud Capacity ModelsTraditional Infrastructure Capacity Models vs. Cloud Capacity Models
Traditional Infrastructure Capacity Models vs. Cloud Capacity ModelsJitscale
 
S-CUBE LP: Indentify User Tasks from Past Usage Logs
S-CUBE LP: Indentify User Tasks from Past Usage LogsS-CUBE LP: Indentify User Tasks from Past Usage Logs
S-CUBE LP: Indentify User Tasks from Past Usage Logsvirtual-campus
 
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...
Thesis Proposal: User Application Profiles for Publishing Linked Data in  HTM...Thesis Proposal: User Application Profiles for Publishing Linked Data in  HTM...
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...Sean Petiya
 
A formacao social da mente (vygotsky)
A formacao social da mente (vygotsky)A formacao social da mente (vygotsky)
A formacao social da mente (vygotsky)Ronaldo Pacheco .'.
 
AWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the CloudAWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the CloudAmazon Web Services
 
Howto setup IT infrastructure services for web application development And Tools
Howto setup IT infrastructure services for web application development And ToolsHowto setup IT infrastructure services for web application development And Tools
Howto setup IT infrastructure services for web application development And ToolsPhuwadon D
 
Identifying Frequent User Tasks from Application Logs
Identifying Frequent User Tasks from Application LogsIdentifying Frequent User Tasks from Application Logs
Identifying Frequent User Tasks from Application LogsHimel Dev
 
Cloud vs. Traditional Hosting - Andrei Yurkevich @ CloudCamp Denmark 2011
Cloud vs. Traditional Hosting - Andrei Yurkevich @ CloudCamp Denmark 2011Cloud vs. Traditional Hosting - Andrei Yurkevich @ CloudCamp Denmark 2011
Cloud vs. Traditional Hosting - Andrei Yurkevich @ CloudCamp Denmark 2011Altoros
 
Escuela Proceso Administrativo
Escuela Proceso AdministrativoEscuela Proceso Administrativo
Escuela Proceso Administrativorosalriver
 
Thin Server Architecture SPA, 5 years old presentation
Thin Server Architecture SPA, 5 years old presentationThin Server Architecture SPA, 5 years old presentation
Thin Server Architecture SPA, 5 years old presentationDavid Amend
 

Destacado (20)

Scaling Twilio - Evan Cooke - Twilio Conference 2011
Scaling Twilio - Evan Cooke - Twilio Conference 2011Scaling Twilio - Evan Cooke - Twilio Conference 2011
Scaling Twilio - Evan Cooke - Twilio Conference 2011
 
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...
 
Twilio Voice Applications with Amazon AWS S3 and EC2
Twilio Voice Applications with Amazon AWS S3 and EC2Twilio Voice Applications with Amazon AWS S3 and EC2
Twilio Voice Applications with Amazon AWS S3 and EC2
 
KrankyGeek 2015 - Mixing Data and Video - IBM Bluemix, Watson, and Twilio
KrankyGeek 2015 - Mixing Data and Video - IBM Bluemix, Watson, and TwilioKrankyGeek 2015 - Mixing Data and Video - IBM Bluemix, Watson, and Twilio
KrankyGeek 2015 - Mixing Data and Video - IBM Bluemix, Watson, and Twilio
 
Scaling SaaS on Oracle
Scaling SaaS on OracleScaling SaaS on Oracle
Scaling SaaS on Oracle
 
Sculpturing SIP World
Sculpturing SIP WorldSculpturing SIP World
Sculpturing SIP World
 
Building a scalable infrastructure for social mobile web apps
Building a scalable infrastructure for social mobile web appsBuilding a scalable infrastructure for social mobile web apps
Building a scalable infrastructure for social mobile web apps
 
Scaling Wordpress
Scaling WordpressScaling Wordpress
Scaling Wordpress
 
Traditional Infrastructure Capacity Models vs. Cloud Capacity Models
Traditional Infrastructure Capacity Models vs. Cloud Capacity ModelsTraditional Infrastructure Capacity Models vs. Cloud Capacity Models
Traditional Infrastructure Capacity Models vs. Cloud Capacity Models
 
Introduction to cloud computing
Introduction to cloud computingIntroduction to cloud computing
Introduction to cloud computing
 
книга жкх
книга жкхкнига жкх
книга жкх
 
S-CUBE LP: Indentify User Tasks from Past Usage Logs
S-CUBE LP: Indentify User Tasks from Past Usage LogsS-CUBE LP: Indentify User Tasks from Past Usage Logs
S-CUBE LP: Indentify User Tasks from Past Usage Logs
 
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...
Thesis Proposal: User Application Profiles for Publishing Linked Data in  HTM...Thesis Proposal: User Application Profiles for Publishing Linked Data in  HTM...
Thesis Proposal: User Application Profiles for Publishing Linked Data in HTM...
 
A formacao social da mente (vygotsky)
A formacao social da mente (vygotsky)A formacao social da mente (vygotsky)
A formacao social da mente (vygotsky)
 
AWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the CloudAWS Webcast - Website Hosting in the Cloud
AWS Webcast - Website Hosting in the Cloud
 
Howto setup IT infrastructure services for web application development And Tools
Howto setup IT infrastructure services for web application development And ToolsHowto setup IT infrastructure services for web application development And Tools
Howto setup IT infrastructure services for web application development And Tools
 
Identifying Frequent User Tasks from Application Logs
Identifying Frequent User Tasks from Application LogsIdentifying Frequent User Tasks from Application Logs
Identifying Frequent User Tasks from Application Logs
 
Cloud vs. Traditional Hosting - Andrei Yurkevich @ CloudCamp Denmark 2011
Cloud vs. Traditional Hosting - Andrei Yurkevich @ CloudCamp Denmark 2011Cloud vs. Traditional Hosting - Andrei Yurkevich @ CloudCamp Denmark 2011
Cloud vs. Traditional Hosting - Andrei Yurkevich @ CloudCamp Denmark 2011
 
Escuela Proceso Administrativo
Escuela Proceso AdministrativoEscuela Proceso Administrativo
Escuela Proceso Administrativo
 
Thin Server Architecture SPA, 5 years old presentation
Thin Server Architecture SPA, 5 years old presentationThin Server Architecture SPA, 5 years old presentation
Thin Server Architecture SPA, 5 years old presentation
 

Similar a High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

The Cloud: A game changer to test, at scale and in production, SOA based web...
The Cloud: A game changer to test, at scale and in production,  SOA based web...The Cloud: A game changer to test, at scale and in production,  SOA based web...
The Cloud: A game changer to test, at scale and in production, SOA based web...Fred Beringer
 
KoprowskiT_SQLSat152_Bulgaria_HighAvailabilityOfSQLintheContextOfSLA
KoprowskiT_SQLSat152_Bulgaria_HighAvailabilityOfSQLintheContextOfSLAKoprowskiT_SQLSat152_Bulgaria_HighAvailabilityOfSQLintheContextOfSLA
KoprowskiT_SQLSat152_Bulgaria_HighAvailabilityOfSQLintheContextOfSLATobias Koprowski
 
Jimwebber soa
Jimwebber soaJimwebber soa
Jimwebber soad0nn9n
 
Orchestration Panel at Cloud Connect 2010
Orchestration Panel at Cloud Connect 2010Orchestration Panel at Cloud Connect 2010
Orchestration Panel at Cloud Connect 2010dev2ops
 
Big Events Cause Network Mayhem
Big Events Cause Network MayhemBig Events Cause Network Mayhem
Big Events Cause Network MayhemPacketTrap Msp
 
Testability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testableTestability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testableAlexander Tarlinder
 
Virtual Data : Eliminating the data constraint in Application Development
Virtual Data :  Eliminating the data constraint in Application DevelopmentVirtual Data :  Eliminating the data constraint in Application Development
Virtual Data : Eliminating the data constraint in Application DevelopmentKyle Hailey
 
DevOps, Databases and The Phoenix Project UGF4042 from OOW14
DevOps, Databases and The Phoenix Project UGF4042 from OOW14DevOps, Databases and The Phoenix Project UGF4042 from OOW14
DevOps, Databases and The Phoenix Project UGF4042 from OOW14Kyle Hailey
 
Puppet Camp Amsterdam 2015: Keynote
Puppet Camp Amsterdam 2015: KeynotePuppet Camp Amsterdam 2015: Keynote
Puppet Camp Amsterdam 2015: KeynotePuppet
 
The 5 Keys to Virtual Backup Excellence
The 5 Keys to Virtual Backup ExcellenceThe 5 Keys to Virtual Backup Excellence
The 5 Keys to Virtual Backup ExcellenceBill Hobbib
 
The 5 Keys To Virtual Backup Excellence Exa Grid And Veeam October 25 2012
The 5 Keys To Virtual Backup Excellence  Exa Grid And Veeam October 25 2012The 5 Keys To Virtual Backup Excellence  Exa Grid And Veeam October 25 2012
The 5 Keys To Virtual Backup Excellence Exa Grid And Veeam October 25 2012Bill Hobbib
 
Webinar issues we_find_slideshare
Webinar issues we_find_slideshareWebinar issues we_find_slideshare
Webinar issues we_find_slideshareSOASTA
 
Network automation seminar
Network automation seminarNetwork automation seminar
Network automation seminarpatmisasi
 
BGOUG "Agile Data: revolutionizing database cloning'
BGOUG  "Agile Data: revolutionizing database cloning'BGOUG  "Agile Data: revolutionizing database cloning'
BGOUG "Agile Data: revolutionizing database cloning'Kyle Hailey
 
Green Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos EngineeringGreen Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos EngineeringGreen Custard
 
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionKeet Sugathadasa
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationKyle Hailey
 
Kscope 14 Presentation : Virtual Data Platform
Kscope 14 Presentation : Virtual Data PlatformKscope 14 Presentation : Virtual Data Platform
Kscope 14 Presentation : Virtual Data PlatformKyle Hailey
 
Shuttle: Intrusion Recovery in Paas
Shuttle: Intrusion Recovery in PaasShuttle: Intrusion Recovery in Paas
Shuttle: Intrusion Recovery in PaasDário Nascimento
 
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...Splunk
 

Similar a High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011 (20)

The Cloud: A game changer to test, at scale and in production, SOA based web...
The Cloud: A game changer to test, at scale and in production,  SOA based web...The Cloud: A game changer to test, at scale and in production,  SOA based web...
The Cloud: A game changer to test, at scale and in production, SOA based web...
 
KoprowskiT_SQLSat152_Bulgaria_HighAvailabilityOfSQLintheContextOfSLA
KoprowskiT_SQLSat152_Bulgaria_HighAvailabilityOfSQLintheContextOfSLAKoprowskiT_SQLSat152_Bulgaria_HighAvailabilityOfSQLintheContextOfSLA
KoprowskiT_SQLSat152_Bulgaria_HighAvailabilityOfSQLintheContextOfSLA
 
Jimwebber soa
Jimwebber soaJimwebber soa
Jimwebber soa
 
Orchestration Panel at Cloud Connect 2010
Orchestration Panel at Cloud Connect 2010Orchestration Panel at Cloud Connect 2010
Orchestration Panel at Cloud Connect 2010
 
Big Events Cause Network Mayhem
Big Events Cause Network MayhemBig Events Cause Network Mayhem
Big Events Cause Network Mayhem
 
Testability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testableTestability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testable
 
Virtual Data : Eliminating the data constraint in Application Development
Virtual Data :  Eliminating the data constraint in Application DevelopmentVirtual Data :  Eliminating the data constraint in Application Development
Virtual Data : Eliminating the data constraint in Application Development
 
DevOps, Databases and The Phoenix Project UGF4042 from OOW14
DevOps, Databases and The Phoenix Project UGF4042 from OOW14DevOps, Databases and The Phoenix Project UGF4042 from OOW14
DevOps, Databases and The Phoenix Project UGF4042 from OOW14
 
Puppet Camp Amsterdam 2015: Keynote
Puppet Camp Amsterdam 2015: KeynotePuppet Camp Amsterdam 2015: Keynote
Puppet Camp Amsterdam 2015: Keynote
 
The 5 Keys to Virtual Backup Excellence
The 5 Keys to Virtual Backup ExcellenceThe 5 Keys to Virtual Backup Excellence
The 5 Keys to Virtual Backup Excellence
 
The 5 Keys To Virtual Backup Excellence Exa Grid And Veeam October 25 2012
The 5 Keys To Virtual Backup Excellence  Exa Grid And Veeam October 25 2012The 5 Keys To Virtual Backup Excellence  Exa Grid And Veeam October 25 2012
The 5 Keys To Virtual Backup Excellence Exa Grid And Veeam October 25 2012
 
Webinar issues we_find_slideshare
Webinar issues we_find_slideshareWebinar issues we_find_slideshare
Webinar issues we_find_slideshare
 
Network automation seminar
Network automation seminarNetwork automation seminar
Network automation seminar
 
BGOUG "Agile Data: revolutionizing database cloning'
BGOUG  "Agile Data: revolutionizing database cloning'BGOUG  "Agile Data: revolutionizing database cloning'
BGOUG "Agile Data: revolutionizing database cloning'
 
Green Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos EngineeringGreen Custard Friday Talk 19: Chaos Engineering
Green Custard Friday Talk 19: Chaos Engineering
 
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualization
 
Kscope 14 Presentation : Virtual Data Platform
Kscope 14 Presentation : Virtual Data PlatformKscope 14 Presentation : Virtual Data Platform
Kscope 14 Presentation : Virtual Data Platform
 
Shuttle: Intrusion Recovery in Paas
Shuttle: Intrusion Recovery in PaasShuttle: Intrusion Recovery in Paas
Shuttle: Intrusion Recovery in Paas
 
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
 

Más de Twilio Inc

Building Blocks for Next Generation Contact Centers
Building Blocks for Next Generation Contact CentersBuilding Blocks for Next Generation Contact Centers
Building Blocks for Next Generation Contact CentersTwilio Inc
 
Create an IVR that Keeps Up with Your Customers
Create an IVR that Keeps Up with Your CustomersCreate an IVR that Keeps Up with Your Customers
Create an IVR that Keeps Up with Your CustomersTwilio Inc
 
Salesforce’s Andy Kung on the Power of CRM Integrations
 Salesforce’s Andy Kung on the Power of CRM Integrations Salesforce’s Andy Kung on the Power of CRM Integrations
Salesforce’s Andy Kung on the Power of CRM IntegrationsTwilio Inc
 
All Web Leads’ Lorena Lauv on How to Scale a Virtual Call Center
All Web Leads’ Lorena Lauv on How to Scale a Virtual Call CenterAll Web Leads’ Lorena Lauv on How to Scale a Virtual Call Center
All Web Leads’ Lorena Lauv on How to Scale a Virtual Call CenterTwilio Inc
 
Why Mobile Messaging Works?
Why Mobile Messaging Works?Why Mobile Messaging Works?
Why Mobile Messaging Works?Twilio Inc
 
Understand How Consumers Use Messaging
Understand How Consumers Use MessagingUnderstand How Consumers Use Messaging
Understand How Consumers Use MessagingTwilio Inc
 
What Can You Do With Twilio
What Can You Do With TwilioWhat Can You Do With Twilio
What Can You Do With TwilioTwilio Inc
 
How To Track Calls Using Twilio?
How To Track Calls Using Twilio?How To Track Calls Using Twilio?
How To Track Calls Using Twilio?Twilio Inc
 
Twilio Contact Center Overview
Twilio Contact Center OverviewTwilio Contact Center Overview
Twilio Contact Center OverviewTwilio Inc
 
Twilio Signal 2016 WebRTC Reborn
Twilio Signal 2016 WebRTC RebornTwilio Signal 2016 WebRTC Reborn
Twilio Signal 2016 WebRTC RebornTwilio Inc
 
Twilio Signal 2016 Using Add-ons
Twilio Signal 2016 Using Add-onsTwilio Signal 2016 Using Add-ons
Twilio Signal 2016 Using Add-onsTwilio Inc
 
Twilio Signal 2016 Technical Blogging
Twilio Signal 2016 Technical Blogging Twilio Signal 2016 Technical Blogging
Twilio Signal 2016 Technical Blogging Twilio Inc
 
Twilio Signal 2016 Serverless Contact Center
Twilio Signal 2016 Serverless Contact CenterTwilio Signal 2016 Serverless Contact Center
Twilio Signal 2016 Serverless Contact CenterTwilio Inc
 
Twilio Signal 2016 Robots-IoT-Watson-Cognitive + Twilio
Twilio Signal 2016 Robots-IoT-Watson-Cognitive + TwilioTwilio Signal 2016 Robots-IoT-Watson-Cognitive + Twilio
Twilio Signal 2016 Robots-IoT-Watson-Cognitive + TwilioTwilio Inc
 
Twilio Signal 2016 Leading An Open Hardware Revolution
Twilio Signal 2016 Leading An Open Hardware RevolutionTwilio Signal 2016 Leading An Open Hardware Revolution
Twilio Signal 2016 Leading An Open Hardware RevolutionTwilio Inc
 
Twilio Signal 2016 IoT Using LittleBits and Twilio SMS
Twilio Signal 2016 IoT Using LittleBits and Twilio SMSTwilio Signal 2016 IoT Using LittleBits and Twilio SMS
Twilio Signal 2016 IoT Using LittleBits and Twilio SMSTwilio Inc
 
Twilio Signal 2016 Chaos Patterns
Twilio Signal 2016 Chaos PatternsTwilio Signal 2016 Chaos Patterns
Twilio Signal 2016 Chaos PatternsTwilio Inc
 
Twilio Signal 2016 How to Impact Non-profits
Twilio Signal 2016 How to Impact Non-profits Twilio Signal 2016 How to Impact Non-profits
Twilio Signal 2016 How to Impact Non-profits Twilio Inc
 
Twilio Signal 2016 Bringing P2P to the Masses with WebRTC
Twilio Signal 2016 Bringing P2P to the Masses with WebRTCTwilio Signal 2016 Bringing P2P to the Masses with WebRTC
Twilio Signal 2016 Bringing P2P to the Masses with WebRTCTwilio Inc
 
Twilio Signal 2016 Listing Services and Lead Generation
Twilio Signal 2016 Listing Services and Lead GenerationTwilio Signal 2016 Listing Services and Lead Generation
Twilio Signal 2016 Listing Services and Lead GenerationTwilio Inc
 

Más de Twilio Inc (20)

Building Blocks for Next Generation Contact Centers
Building Blocks for Next Generation Contact CentersBuilding Blocks for Next Generation Contact Centers
Building Blocks for Next Generation Contact Centers
 
Create an IVR that Keeps Up with Your Customers
Create an IVR that Keeps Up with Your CustomersCreate an IVR that Keeps Up with Your Customers
Create an IVR that Keeps Up with Your Customers
 
Salesforce’s Andy Kung on the Power of CRM Integrations
 Salesforce’s Andy Kung on the Power of CRM Integrations Salesforce’s Andy Kung on the Power of CRM Integrations
Salesforce’s Andy Kung on the Power of CRM Integrations
 
All Web Leads’ Lorena Lauv on How to Scale a Virtual Call Center
All Web Leads’ Lorena Lauv on How to Scale a Virtual Call CenterAll Web Leads’ Lorena Lauv on How to Scale a Virtual Call Center
All Web Leads’ Lorena Lauv on How to Scale a Virtual Call Center
 
Why Mobile Messaging Works?
Why Mobile Messaging Works?Why Mobile Messaging Works?
Why Mobile Messaging Works?
 
Understand How Consumers Use Messaging
Understand How Consumers Use MessagingUnderstand How Consumers Use Messaging
Understand How Consumers Use Messaging
 
What Can You Do With Twilio
What Can You Do With TwilioWhat Can You Do With Twilio
What Can You Do With Twilio
 
How To Track Calls Using Twilio?
How To Track Calls Using Twilio?How To Track Calls Using Twilio?
How To Track Calls Using Twilio?
 
Twilio Contact Center Overview
Twilio Contact Center OverviewTwilio Contact Center Overview
Twilio Contact Center Overview
 
Twilio Signal 2016 WebRTC Reborn
Twilio Signal 2016 WebRTC RebornTwilio Signal 2016 WebRTC Reborn
Twilio Signal 2016 WebRTC Reborn
 
Twilio Signal 2016 Using Add-ons
Twilio Signal 2016 Using Add-onsTwilio Signal 2016 Using Add-ons
Twilio Signal 2016 Using Add-ons
 
Twilio Signal 2016 Technical Blogging
Twilio Signal 2016 Technical Blogging Twilio Signal 2016 Technical Blogging
Twilio Signal 2016 Technical Blogging
 
Twilio Signal 2016 Serverless Contact Center
Twilio Signal 2016 Serverless Contact CenterTwilio Signal 2016 Serverless Contact Center
Twilio Signal 2016 Serverless Contact Center
 
Twilio Signal 2016 Robots-IoT-Watson-Cognitive + Twilio
Twilio Signal 2016 Robots-IoT-Watson-Cognitive + TwilioTwilio Signal 2016 Robots-IoT-Watson-Cognitive + Twilio
Twilio Signal 2016 Robots-IoT-Watson-Cognitive + Twilio
 
Twilio Signal 2016 Leading An Open Hardware Revolution
Twilio Signal 2016 Leading An Open Hardware RevolutionTwilio Signal 2016 Leading An Open Hardware Revolution
Twilio Signal 2016 Leading An Open Hardware Revolution
 
Twilio Signal 2016 IoT Using LittleBits and Twilio SMS
Twilio Signal 2016 IoT Using LittleBits and Twilio SMSTwilio Signal 2016 IoT Using LittleBits and Twilio SMS
Twilio Signal 2016 IoT Using LittleBits and Twilio SMS
 
Twilio Signal 2016 Chaos Patterns
Twilio Signal 2016 Chaos PatternsTwilio Signal 2016 Chaos Patterns
Twilio Signal 2016 Chaos Patterns
 
Twilio Signal 2016 How to Impact Non-profits
Twilio Signal 2016 How to Impact Non-profits Twilio Signal 2016 How to Impact Non-profits
Twilio Signal 2016 How to Impact Non-profits
 
Twilio Signal 2016 Bringing P2P to the Masses with WebRTC
Twilio Signal 2016 Bringing P2P to the Masses with WebRTCTwilio Signal 2016 Bringing P2P to the Masses with WebRTC
Twilio Signal 2016 Bringing P2P to the Masses with WebRTC
 
Twilio Signal 2016 Listing Services and Lead Generation
Twilio Signal 2016 Listing Services and Lead GenerationTwilio Signal 2016 Listing Services and Lead Generation
Twilio Signal 2016 Listing Services and Lead Generation
 

Último

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Último (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

  • 1. SCALING HIGH-AVAILABILITY INFRASTRUCTURE IN THE CLOUD OCT 11, 2011, WEB 2.0 twilio CLOUD COMMUNICATIONS EVAN COOKE CO-FOUNDER & CTO
  • 2. High-Availability Sounds good, we need that! eat! al M n ic Te ch mm umm Y
  • 3. High-Availability Sounds good, we need that! Uptime Availability = Uptime + Downtime
  • 4. High-Availability Sounds good, we need that! Availability % Downtime/yr Downtime/mo 99.9% ("three nines") 8.76 hours 43.2 minutes 99.99% ("four nines") 52.56 minutes 4.32 minutes 99.999% ("five nines") 5.26 minutes 25.9 seconds 99.9999% ("six nines") 31.5 seconds 2.59 seconds
  • 5. High-Availability Sounds good, we need that! Availability % Downtime/yr Downtime/mo 99.9% ("three nines") 8.76 hours 43.2 minutes 99.99% ("four nines") 52.56 minutes 4.32 minutes 99.999% ("five nines") 5.26 minutes 25.9 seconds 99.9999% ("six nines") 31.5 seconds 2.59 seconds Can’t rely on human to respond in a 5 min window! Must use automation.
  • 6. Happens to the best 2.5 Hours Down 11 Hours Down Hours September 23, 2010 October 4, 2010 November 14, 2010 “...we had to stop all “...At 6:30pm EST, we “...Before every run of traffic to this database determined the most our test suite we destroy cluster, which meant effective course of action then re-create the turning off the site. Once was to re-index the database... Due to the the databases had [database] shard, which configuration error recovered and the root would address the memory GitHub's production cause had been fixed, we fragmentation and usage database was slowly allowed more issues. The whole process, destroyed then re- people back onto the including extensive testing created. Not good.” site.” against data loss and data corruption, took about five hours.”
  • 7. Causes of Downtime Lack of best practice change control Lack of best practice monitoring of the relevant components Lack of best practice requirements and procurement Lack of best practice operations Lack of best practice avoidance of network failures Lack of best practice avoidance of internal application failures Lack of best practice avoidance of external services that fail Lack of best practice physical environment Lack of best practice network redundancy Lack of best practice technical solution of backup Lack of best practice process solution of backup Lack of best practice physical location Lack of best practice infrastructure redundancy Lack of best practice storage architecture redundancy E. Marcus and H. Stern, Blueprints for high availability, second edition. Indianapolis, IN, USA: John Wiley & Sons, Inc., 2003.
  • 8. Cloud Non-Cloud Data Change Operations Datacenter Persistence Control storage change control monitoring of avoidance of architecture the relevant network failures redundancy components physical technical requirements environment solution of procurement network backup operations redundancy process avoidance of physical location solution of internal app backup failures infrastructure avoidance of redundancy external services that fail
  • 9. Happens to the best 2.5 Hours Down 11 Hours Down Hours September 23, 2010 October 4, 2010 November 14, 2010 “...we had to stop all “...At 6:30pm EST, we “...Before every run of traffic to this database determined the most our test suite we destroy Database cluster, which meant turning off the site. Once Database effective course of action was to re-index the Database then re-create the database... Due to the the databases had [database] shard, which configuration error would address the memory recovered and the root cause had been fixed, we fragmentation and usage Change GitHub's production database was issues. The whole process, slowly allowed more people back onto the including extensive testing Control destroyed then re- created. Not good.” site.” against data loss and data corruption, took about five hours.”
  • 10. Data Change Operations Datacenter Persistence Control Today control storagechange monitoring of the relevant avoidance of network failures architecture Data Persistence redundancy components physical Change Control technical requirements environment solution of procurement network backup operations redundancy lessons learned process avoidance of solution@twilio physical location of internal app backup failures infrastructure avoidance of redundancy external services that fail
  • 11.
  • 12. Twilio provides web service APIs to automate Voice and SMS communications Carriers Inbound Calls Voice Outbound Calls Mobile/Browser VoIP Send To/From Phone SMS Numbers Short Codes Developer Phone Dynamically Buy Numbers Phone Numbers End User
  • 13. 2011 2010 2009 3 6 20 70+
  • 14. 100x Growth in Tx/Day over 1 Year 100X 10X X 1 Year
  • 15. 2011 2010 2009 100’s of 10’s of Servers 10 Servers Servers
  • 16. 2011 • 100’s of prod hosts in continuous operation • 80+ service types running in prod • 50+ prod database servers • Prod deployments several times/day across 7 engineering teams
  • 17. 2011 • Frameworks - PHP for frontend components - Python Twisted & gevent for async network services - Java for backend services • Storage technology - MySQL for core DB services - Redis for queuing and messaging
  • 18. Data persistence is hard (especially in the cloud)
  • 19. Data persistence is hard Data persistence is the hardest technical problem most scalable SaaS businesses face
  • 20. What is data persistence? Stuff that looks like this
  • 21. What is data persistence? Databases Queues Files
  • 22. Incoming Requests LB A A Tier 1 Data Q Q Persistence! SQL Tier 2 B B B B Files C C D D K/V Tier 3
  • 23. Why is persistence so hard? • Difficult to change structure - Huge inertia e.g., large schema migrations • Painful to recover from disk/node failures - “just boot a new node” doesn’t work • Woeful performance/scalability - I/O is huge bottleneck in modern servers (e.g. EC2) • Freak’in complex!!! - Atomic transactions/rollback, ACID, blah blah blah
  • 24. Difficult to Change Structure ALTER TABLE names DROP COLUMN Value Id Name Value Id Name 1 Bob 12 1 Bob 2 Jane 78 2 Jane 3 Steve 56 3 Steve ... 500 million rows HOURS later... ‣ You live with data decisions for a long time
  • 25. Painful to Recover from Failures Data on secondary? W R R How much data? R/W consistency? DB DB Primary Secondary ‣ Because of complexity, failover is human process
  • 26. Woeful Performance/Scalability ec2 m1.xlarge raid0 4x ephemeral Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 169.31 111.88 57.43 469.31 0.90 2.25 12.24 2.29 4.36 1.12 59.01 sdc 178.22 110.89 59.41 396.04 0.93 1.98 13.08 1.58 3.50 1.18 53.56 sdd 145.54 102.97 50.50 384.16 0.78 1.90 12.63 1.00 2.34 1.03 44.85 sde 166.34 95.05 54.46 337.62 0.85 1.69 13.27 1.12 2.84 1.22 47.92 md0 0.00 0.00 880.20 2007.92 3.44 7.82 7.99 0.00 0.00 0.00 0.00 ~10 MB/s write ‣ Poor I/O on cloud today, 100x slower than real HW
  • 27. Woeful Performance/Scalability DB DB DB DB DB DB ‣ Difficult to horizontally scale in the cloud
  • 28. @!#$%^&* Complex • Incredibly complex BUFFER POOL AND MEMORY ---------------------- Total memory allocated 11655168000; in configuration Internal hash tables (constant factor Adaptive hash index 223758224 (179 Page hash 11248264 - Billion knobs and buttons Dictionary cache 45048690 (449 File system 84400 (82672 - Whole companies exist just to Lock system Recovery system 28180376 (281 0 (0 + 0) tune DB’s Threads 428608 (406 Dictionary memory allocated 57346 • Lots of consistency/ Buffer pool size 693759 Buffer pool size, bytes 11366547456 Free buffers 1 transactional models Database pages Old database pages 691085 255087 • Multi-region data is Modified db pages 326490 Pending reads 0 Pending writes: LRU 0, flush list 0, s unsolved - Facebook and Pages made young 497782847, not young 24.78 youngs/s, 0.00 non-youngs/s Google struggle Pages read 447257683, created 16982810 24.82 reads/s, 1.14 creates/s, 33.36 w Buffer pool hit rate 993 / 1000, young
  • 29. Deep breath, step back Think about each problem (use @twilio examples) • Software that runs in the cloud • Open source
  • 30. 1 Difficult to Change Structure • Don’t have structure - key/value databases (SimpleDB, Cassandra) - document-orient databases (CouchDB, MongoDB) • Don’t store a lot of data...
  • 31. 1 Don’t Store Stuff • Outsource data as much as possible • But NOT to your customers
  • 32. 1 Don’t Store Stuff • Aggressively archive and move data offline S3/SimpleDB ~500M Rows (keep indices in memory) Build UX that supports longer/restricted access times to older data
  • 33. 1 Don’t Store Stuff • Avoid stateful systems/architectures where possible Web Browser Web Session DB Cookie: Web SessionID
  • 34. 1 Don’t Store Stuff • Avoid stateful systems/architectures where possible Store state in client Web browser Browser Web Session DB Cookie: Web enc($session)
  • 35. 2 Painful to Recover from Failures • Avoid single points of failure - E.g., master-master (active/active) - Complex to set up, complex failure modes - Sometimes it’s the only solution - Lots of great docs on web • Minimize number of stateful node, separate stateful & stateless components...
  • 36. 2 Separate Stateful and Stateless Components Req App A App B App C On failure, even App B if we boot replacement, we lose data
  • 37. 2 Separate Stateful and Stateless Components Req App A App B App C Queue Queue Queue On failure, even App B if we boot Queue replacement, we lose data
  • 38. 2 Separate Stateful and Stateless Components Keep connection open for whole app path! (hint: use evented framework) Req App AA App BB App C App A App Twilio’s App stack App C App B App C SMS uses this approach On failure, we don’t lose a single request
  • 39. 2 Painful to Recover from Failures • Avoid single points of failure - E.g., master-master (active/active) - Complex to set up, complex failure modes - Sometimes it’s the only solution - Lots of great blog docs on web • Minimize number of stateful nodes, separate stateful & stateless components • Build a data change control process to avoid mistakes and errors...
  • 40. • 100’s of prod hosts in continuous operation • 80+ service types running in prod • 50+ prod database servers • Prod deployments several times/day across 7 engineering teams Components deployed at different frequencies: Partially Continuous Deployment
  • 41. Website Deployment Content Frequency(Risk) 4 buckets Website Code Log Scale 1000x REST API Big DB 100x Schema 10x 1x CMS PHP/Ruby Python/Java SQL etc. etc.
  • 42. Website Deployment Content Processes Website Code REST API Big DB Schema One Click CI Tests CI Tests CI Tests One Click Human Sign-off Human Sign-off One Click Human Assisted Click
  • 43. 3 Woeful Performance/Scalability • If disk I/O is poor, avoid disk - Tune tune tune. Keep your indices in memory - Use an in-memory datastore e.g., Redis and configure replication such that if you have a master failure, you can always promote a slave • When disk I/O saturates, shard - LOTs of sharding info on web - Method of last resort, single point of failure becomes multiple single points of failure
  • 44. 4 @#$%^&* Complex • Bring the simplest tool to Magic Database the job - Use a strictly consistent store only if you need it - If you don’t need HA, don’t add the complexity • There is no magic database. Magic Database does it all. Decompose requirements, Consistency, Availability, mix-and-match datastores Partition-tolerance, it's got all three. as needed...
  • 45. 4 Twilio Data Lifecycle CREATE UPDATE UPDATE name:foo name:foo name:foo name:foo status:INIT status:QUEUED status:GOING status:DONE ret:0 ret:0 ret:0 ret:42 Twilio Examples: Call, SMS, Conference Other Examples: Order, Workflow, $
  • 46. 4 Twilio Data Lifecycle CREATE UPDATE UPDATE name:foo name:foo name:foo name:foo status:INIT status:QUEUED status:GOING status:DONE ret:0 ret:0 ret:0 ret:42 In-Flight Post-Flight
  • 47. 4 Twilio Data Lifecycle Applications • Atomically update • Billing part of a workflow • Log Access • Analytics • Reporting In-Flight Post-Flight
  • 48. 4 Twilio Data Lifecycle Properties High-Availability • Strict Consistency • Eventual Consistency • Key/Value • Range Queries w/ • ~20ms Filters • ~200ms In-Flight Post-Flight
  • 49. 4 Twilio Data Lifecycle Systems with very different access semantics Data Store A Data Store B In-Flight Post-Flight
  • 50. 4 In-Flight Post-Flight Eventual consistency Q Logs Range queries Filtered queries (REST API) ~200ms Billions Strict Consistency Eventual consistency Key/Value Arbitrary queries Q Reporting High Latency ~20ms Billions 10k-1M Idempotent Aggregation Q Billing Key/Value Billions
  • 51. 4 In-Flight Post-Flight SQL Sharded Logs Cassandra/Acunu Q MongoDb (REST API) Riak CouchDb MySQL PostgreSQL Q Reporting Hadoop Redis NDB SQL Sharded Q Billing Redis
  • 52. ata D
  • 53. Why is persistence so hard? • Difficult to change structure Don’t store stuff! - Huge inertia e.g.,schema-less Go large schema migrations • Painful to recover from disk/node failures Separate stateful/stateless Change control processes - “just boot a new node” doesn’t work • Woeful performance/scalability Memory FTW Shard - I/O is huge bottleneck in modern servers (e.g. EC2) • Freak’in complex!!! data lifecycle Decompose - AtomicMinimize complexity blah blah transactions/rollback, ACID, blah
  • 54. Incoming Requests LB A A Tier 1 Q Q SQL Tier 2 B B B B Files C C D D K/V Tier 3
  • 55. Incoming Requests Idempotent LB request path A A Aggregate into Tier 1 HA queues Master-Master Q Q SQL SQL MySQL Tier 2 B B B B Move K/V to SimpleDB w/ Move file store local cache to S3 S3 SimpleDB C C D D Tier 3
  • 56. Data Change Operations Datacenter Persistence Control storage architecture redundancy HA change control monitoring of the relevant components avoidance of network failures physical is technical requirements environment solution of procurement network backup operations redundancy process avoidance of Hard physical location solution of internal app backup failures infrastructure avoidance of redundancy external services that fail
  • 57. SCALING HIGH-AVAILABILITY INFRASTRUCTURE IN THE CLOUD Focus on data How you store it Where you store it When you can delete it Control changes to it
  • 58. Open Problems... In-Flight Post-Flight Massively scalable HA Logs Q range queries queue (REST API) filterable ~200ms Simple multi-AZ Simple multi-region Q Reporting HA Hadoop consistent Hadoop K/V Massively Q Billing scalable aggregator