SlideShare una empresa de Scribd logo
1 de 49
Descargar para leer sin conexión
Fault Tolerance in a
High Volume, Distributed System
Ben Christensen
Software Engineer – API Platform at Netflix
@benjchristensen
http://www.linkedin.com/in/benjchristensen




                                             1
Dozens of dependencies.

   One going down takes everything down.


99.99%30          = 99.7% uptime
     0.3% of 1 billion = 3,000,000 failures

            2+ hours downtime/month
even if all dependencies have excellent uptime.

          Reality is generally worse.


                                                  2
3
4
5
No single dependency should
 take down the entire app.

          Fail fast.
         Fail silent.
         Fallback.

        Shed load.



                              6
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              7
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              8
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              9
Semaphores (Tryable): Limited Concurrency


TryableSemaphore executionSemaphore = getExecutionSemaphore();
// acquire a permit
if (executionSemaphore.tryAcquire()) {
    try {
         return executeCommand();
    } finally {
         executionSemaphore.release();
    }
} else {
    circuitBreaker.markSemaphoreRejection();
    // permit not available so return fallback
    return getFallback();
}




                                                                 10
Semaphores (Tryable): Limited Concurrency


TryableSemaphore executionSemaphore = getExecutionSemaphore();
// acquire a permit
if (executionSemaphore.tryAcquire()) {
    try {
         return executeCommand();
    } finally {
         executionSemaphore.release();
    }
} else {
    circuitBreaker.markSemaphoreRejection();
    // permit not available so return fallback
    return getFallback();
}




                                                                 11
Semaphores (Tryable): Limited Concurrency


TryableSemaphore executionSemaphore = getExecutionSemaphore();
// acquire a permit
if (executionSemaphore.tryAcquire()) {
    try {
         return executeCommand();
    } finally {
         executionSemaphore.release();
    }
} else {
    circuitBreaker.markSemaphoreRejection();
    // permit not available so return fallback
    return getFallback();
}




                                                                 12
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              13
Separate Threads: Limited Concurrency


try {
    if (!threadPool.isQueueSpaceAvailable()) {
         // we are at the property defined max so want to throw the RejectedExecutionException to simulate
         // reaching the real max and go through the same codepath and behavior

        throw new RejectedExecutionException("Rejected command
          because thread-pool queueSize is at rejection threshold.");
    }
        ... define Callable that performs executeCommand() ...
    // submit the work to the thread-pool
    return threadPool.submit(command);
} catch (RejectedExecutionException e) {
    circuitBreaker.markThreadPoolRejection();
    // rejected so return fallback
    return getFallback();
}



                                                                                                             14
Separate Threads: Limited Concurrency


try {
    if (!threadPool.isQueueSpaceAvailable()) {
         // we are at the property defined max so want to throw the RejectedExecutionException to simulate
         // reaching the real max and go through the same codepath and behavior

        throw new RejectedExecutionException("Rejected command
                  RejectedExecutionException
          because thread-pool queueSize is at rejection threshold.");
    }
        ... define Callable that performs executeCommand() ...
    // submit the work to the thread-pool
    return threadPool.submit(command);
} catch (RejectedExecutionException e) {
    circuitBreaker.markThreadPoolRejection();
    // rejected so return fallback
    return getFallback();
}



                                                                                                             15
Separate Threads: Limited Concurrency


try {
    if (!threadPool.isQueueSpaceAvailable()) {
         // we are at the property defined max so want to throw the RejectedExecutionException to simulate
         // reaching the real max and go through the same codepath and behavior

        throw new RejectedExecutionException("Rejected command
                  RejectedExecutionException
          because thread-pool queueSize is at rejection threshold.");
    }
        ... define Callable that performs executeCommand() ...
    // submit the work to the thread-pool
    return threadPool.submit(command);
} catch (RejectedExecutionException e) {
    circuitBreaker.markThreadPoolRejection();
    // rejected so return fallback
    return getFallback();
}



                                                                                                             16
Separate Threads: Timeout

                                  Override of Future.get()

public K get() throws CancellationException, InterruptedException, ExecutionException {
    try {
        long timeout =
            getCircuitBreaker().getCommandTimeoutInMilliseconds();
        return get(timeout, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        // report timeout failure
        circuitBreaker.markTimeout(
                    System.currentTimeMillis() - startTime);

          // retrieve the fallback
          return getFallback();
     }
}




                                                                                          17
Separate Threads: Timeout

                                  Override of Future.get()

public K get() throws CancellationException, InterruptedException, ExecutionException {
    try {
        long timeout =
            getCircuitBreaker().getCommandTimeoutInMilliseconds();
        return get(timeout, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        // report timeout failure
        circuitBreaker.markTimeout(
                    System.currentTimeMillis() - startTime);

          // retrieve the fallback
          return getFallback();
     }
}




                                                                                          18
Separate Threads: Timeout

                                  Override of Future.get()

public K get() throws CancellationException, InterruptedException, ExecutionException {
    try {
        long timeout =
            getCircuitBreaker().getCommandTimeoutInMilliseconds();
        return get(timeout, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        // report timeout failure
        circuitBreaker.markTimeout(
                    System.currentTimeMillis() - startTime);

          // retrieve the fallback
          return getFallback();
     }
}




                                                                                          19
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker



                              20
Circuit Breaker




if (circuitBreaker.allowRequest()) {
         return executeCommand();
} else {
    // short-circuit and go directly to fallback
    circuitBreaker.markShortCircuited();
    return getFallback();
}




                                                   21
Circuit Breaker




if (circuitBreaker.allowRequest()) {
         return executeCommand();
} else {
    // short-circuit and go directly to fallback
    circuitBreaker.markShortCircuited();
    return getFallback();
}




                                                   22
Circuit Breaker




if (circuitBreaker.allowRequest()) {
         return executeCommand();
} else {
    // short-circuit and go directly to fallback
    circuitBreaker.markShortCircuited();
    return getFallback();
}




                                                   23
Netflix uses all 4 in combination




                                   24
25
Tryable semaphores for “trusted” clients and fallbacks

       Separate threads for “untrusted” clients

 Aggressive timeouts on threads and network calls
             to “give up and move on”

       Circuit breakers as the “release valve”



                                                     26
27
28
29
Benefits of Separate Threads

         Protection from client libraries

    Lower risk to accept new/updated clients

           Quick recovery from failure

             Client misconfiguration

Client service performance characteristic changes

              Built-in concurrency
                                                    30
Drawbacks of Separate Threads

    Some computational overhead

Load on machine can be pushed too far

                 ...

    Benefits outweigh drawbacks
    when clients are “untrusted”


                                        31
32
Visualizing Circuits in Realtime
      (generally sub-second latency)




      Video available at
https://vimeo.com/33576628




                                       33
Rolling 10 second counter – 1 second granularity

      Median Mean 90th 99th 99.5th

       Latent Error Timeout Rejected

                 Error Percentage
            (error+timeout+rejected)/
(success+latent success+error+timeout+rejected).

                                                   34
Netflix DependencyCommand Implementation




                                          35
Netflix DependencyCommand Implementation




                                          36
Netflix DependencyCommand Implementation




                                          37
Netflix DependencyCommand Implementation




                                          38
Netflix DependencyCommand Implementation




                                          39
Netflix DependencyCommand Implementation




                                          40
Netflix DependencyCommand Implementation

              Fallbacks

               Cache
         Eventual Consistency
            Stubbed Data
           Empty Response




                                          41
Netflix DependencyCommand Implementation




                                          42
Netflix DependencyCommand Implementation




                                          43
Rolling Number
Realtime Stats and
 Decision Making




                     44
Request Collapsing
Take advantage of resiliency to improve efficiency




                                                    45
Request Collapsing
Take advantage of resiliency to improve efficiency




                                                    46
47
Fail fast.
Fail silent.
Fallback.

Shed load.




               48
Questions & More Information


Fault Tolerance in a High Volume, Distributed System
    http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html



         Making the Netflix API More Resilient
   http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html




                        Ben Christensen
                          @benjchristensen
                 http://www.linkedin.com/in/benjchristensen



                                                                              49

Más contenido relacionado

La actualidad más candente

Writing Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason LeeWriting Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason Leejaxconf
 
你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能maclean liu
 
Xen server 6.0 xe command reference (1.1)
Xen server 6.0 xe command reference (1.1)Xen server 6.0 xe command reference (1.1)
Xen server 6.0 xe command reference (1.1)Timote Lima
 
pstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle databasepstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle databaseRiyaj Shamsudeen
 
A close encounter_with_real_world_and_odd_perf_issues
A close encounter_with_real_world_and_odd_perf_issuesA close encounter_with_real_world_and_odd_perf_issues
A close encounter_with_real_world_and_odd_perf_issuesRiyaj Shamsudeen
 
Performance tuning a quick intoduction
Performance tuning   a quick intoductionPerformance tuning   a quick intoduction
Performance tuning a quick intoductionRiyaj Shamsudeen
 
The Next Step in AS3 Framework Evolution
The Next Step in AS3 Framework EvolutionThe Next Step in AS3 Framework Evolution
The Next Step in AS3 Framework EvolutionFITC
 
Tracing Parallel Execution (UKOUG 2006)
Tracing Parallel Execution (UKOUG 2006)Tracing Parallel Execution (UKOUG 2006)
Tracing Parallel Execution (UKOUG 2006)Doug Burns
 
370410176 moshell-commands
370410176 moshell-commands370410176 moshell-commands
370410176 moshell-commandsnanker phelge
 
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321maclean liu
 
Introduction to Parallel Execution
Introduction to Parallel ExecutionIntroduction to Parallel Execution
Introduction to Parallel ExecutionDoug Burns
 
Riyaj real world performance issues rac focus
Riyaj real world performance issues rac focusRiyaj real world performance issues rac focus
Riyaj real world performance issues rac focusRiyaj Shamsudeen
 
Groovy 2.0 - Devoxx France 2012
Groovy 2.0 - Devoxx France 2012Groovy 2.0 - Devoxx France 2012
Groovy 2.0 - Devoxx France 2012Guillaume Laforge
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档YUCHENG HU
 
Groovy update - S2GForum London 2011 - Guillaume Laforge
Groovy update - S2GForum London 2011 - Guillaume LaforgeGroovy update - S2GForum London 2011 - Guillaume Laforge
Groovy update - S2GForum London 2011 - Guillaume LaforgeGuillaume Laforge
 
A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN Riyaj Shamsudeen
 
JavaPerformanceChapter_3
JavaPerformanceChapter_3JavaPerformanceChapter_3
JavaPerformanceChapter_3Saurav Basu
 
Varnish presentation for the Symfony Zaragoza user group
Varnish presentation for the Symfony Zaragoza user groupVarnish presentation for the Symfony Zaragoza user group
Varnish presentation for the Symfony Zaragoza user groupJorge Nerín
 

La actualidad más candente (20)

Virtual machine re building
Virtual machine re buildingVirtual machine re building
Virtual machine re building
 
Writing Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason LeeWriting Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason Lee
 
Rac introduction
Rac introductionRac introduction
Rac introduction
 
你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能
 
Xen server 6.0 xe command reference (1.1)
Xen server 6.0 xe command reference (1.1)Xen server 6.0 xe command reference (1.1)
Xen server 6.0 xe command reference (1.1)
 
pstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle databasepstack, truss etc to understand deeper issues in Oracle database
pstack, truss etc to understand deeper issues in Oracle database
 
A close encounter_with_real_world_and_odd_perf_issues
A close encounter_with_real_world_and_odd_perf_issuesA close encounter_with_real_world_and_odd_perf_issues
A close encounter_with_real_world_and_odd_perf_issues
 
Performance tuning a quick intoduction
Performance tuning   a quick intoductionPerformance tuning   a quick intoduction
Performance tuning a quick intoduction
 
The Next Step in AS3 Framework Evolution
The Next Step in AS3 Framework EvolutionThe Next Step in AS3 Framework Evolution
The Next Step in AS3 Framework Evolution
 
Tracing Parallel Execution (UKOUG 2006)
Tracing Parallel Execution (UKOUG 2006)Tracing Parallel Execution (UKOUG 2006)
Tracing Parallel Execution (UKOUG 2006)
 
370410176 moshell-commands
370410176 moshell-commands370410176 moshell-commands
370410176 moshell-commands
 
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
 
Introduction to Parallel Execution
Introduction to Parallel ExecutionIntroduction to Parallel Execution
Introduction to Parallel Execution
 
Riyaj real world performance issues rac focus
Riyaj real world performance issues rac focusRiyaj real world performance issues rac focus
Riyaj real world performance issues rac focus
 
Groovy 2.0 - Devoxx France 2012
Groovy 2.0 - Devoxx France 2012Groovy 2.0 - Devoxx France 2012
Groovy 2.0 - Devoxx France 2012
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档
 
Groovy update - S2GForum London 2011 - Guillaume Laforge
Groovy update - S2GForum London 2011 - Guillaume LaforgeGroovy update - S2GForum London 2011 - Guillaume Laforge
Groovy update - S2GForum London 2011 - Guillaume Laforge
 
A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN
 
JavaPerformanceChapter_3
JavaPerformanceChapter_3JavaPerformanceChapter_3
JavaPerformanceChapter_3
 
Varnish presentation for the Symfony Zaragoza user group
Varnish presentation for the Symfony Zaragoza user groupVarnish presentation for the Symfony Zaragoza user group
Varnish presentation for the Symfony Zaragoza user group
 

Similar a Fault Tolerance in a High Volume, Distributed System

Fork and join framework
Fork and join frameworkFork and join framework
Fork and join frameworkMinh Tran
 
Java Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and TrendsJava Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and TrendsCarol McDonald
 
Concurrent Programming in Java
Concurrent Programming in JavaConcurrent Programming in Java
Concurrent Programming in JavaRuben Inoto Soto
 
Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.wellD
 
Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.Michele Giacobazzi
 
Java - Concurrent programming - Thread's advanced concepts
Java - Concurrent programming - Thread's advanced conceptsJava - Concurrent programming - Thread's advanced concepts
Java - Concurrent programming - Thread's advanced conceptsRiccardo Cardin
 
Virtualizing Java in Java (jug.ru)
Virtualizing Java in Java (jug.ru)Virtualizing Java in Java (jug.ru)
Virtualizing Java in Java (jug.ru)aragozin
 
Java synchronizers
Java synchronizersJava synchronizers
Java synchronizersts_v_murthy
 
Se lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
Se lancer dans l'aventure microservices avec Spring Cloud - Julien RoySe lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
Se lancer dans l'aventure microservices avec Spring Cloud - Julien Royekino
 
Nanocloud cloud scale jvm
Nanocloud   cloud scale jvmNanocloud   cloud scale jvm
Nanocloud cloud scale jvmaragozin
 
10 Typical Problems in Enterprise Java Applications
10 Typical Problems in Enterprise Java Applications10 Typical Problems in Enterprise Java Applications
10 Typical Problems in Enterprise Java ApplicationsEberhard Wolff
 
Java 5 concurrency
Java 5 concurrencyJava 5 concurrency
Java 5 concurrencypriyank09
 

Similar a Fault Tolerance in a High Volume, Distributed System (20)

JavaCro'15 - Spring @Async - Dragan Juričić
JavaCro'15 - Spring @Async - Dragan JuričićJavaCro'15 - Spring @Async - Dragan Juričić
JavaCro'15 - Spring @Async - Dragan Juričić
 
Fork and join framework
Fork and join frameworkFork and join framework
Fork and join framework
 
Curator intro
Curator introCurator intro
Curator intro
 
Java Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and TrendsJava Concurrency, Memory Model, and Trends
Java Concurrency, Memory Model, and Trends
 
Concurrent Programming in Java
Concurrent Programming in JavaConcurrent Programming in Java
Concurrent Programming in Java
 
Resilience mit Hystrix
Resilience mit HystrixResilience mit Hystrix
Resilience mit Hystrix
 
Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.
 
Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.Repetition is bad, repetition is bad.
Repetition is bad, repetition is bad.
 
Concurrency-5.pdf
Concurrency-5.pdfConcurrency-5.pdf
Concurrency-5.pdf
 
Java - Concurrent programming - Thread's advanced concepts
Java - Concurrent programming - Thread's advanced conceptsJava - Concurrent programming - Thread's advanced concepts
Java - Concurrent programming - Thread's advanced concepts
 
Virtualizing Java in Java (jug.ru)
Virtualizing Java in Java (jug.ru)Virtualizing Java in Java (jug.ru)
Virtualizing Java in Java (jug.ru)
 
Java Concurrency
Java ConcurrencyJava Concurrency
Java Concurrency
 
Java synchronizers
Java synchronizersJava synchronizers
Java synchronizers
 
Resilience with Hystrix
Resilience with HystrixResilience with Hystrix
Resilience with Hystrix
 
Se lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
Se lancer dans l'aventure microservices avec Spring Cloud - Julien RoySe lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
Se lancer dans l'aventure microservices avec Spring Cloud - Julien Roy
 
Nanocloud cloud scale jvm
Nanocloud   cloud scale jvmNanocloud   cloud scale jvm
Nanocloud cloud scale jvm
 
Fault tolerance made easy
Fault tolerance made easyFault tolerance made easy
Fault tolerance made easy
 
10 Typical Problems in Enterprise Java Applications
10 Typical Problems in Enterprise Java Applications10 Typical Problems in Enterprise Java Applications
10 Typical Problems in Enterprise Java Applications
 
Celery
CeleryCelery
Celery
 
Java 5 concurrency
Java 5 concurrencyJava 5 concurrency
Java 5 concurrency
 

Último

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Último (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Fault Tolerance in a High Volume, Distributed System

  • 1. Fault Tolerance in a High Volume, Distributed System Ben Christensen Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen 1
  • 2. Dozens of dependencies. One going down takes everything down. 99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month even if all dependencies have excellent uptime. Reality is generally worse. 2
  • 3. 3
  • 4. 4
  • 5. 5
  • 6. No single dependency should take down the entire app. Fail fast. Fail silent. Fallback. Shed load. 6
  • 7. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 7
  • 8. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 8
  • 9. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 9
  • 10. Semaphores (Tryable): Limited Concurrency TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } 10
  • 11. Semaphores (Tryable): Limited Concurrency TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } 11
  • 12. Semaphores (Tryable): Limited Concurrency TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } 12
  • 13. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 13
  • 14. Separate Threads: Limited Concurrency try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } 14
  • 15. Separate Threads: Limited Concurrency try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command RejectedExecutionException because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } 15
  • 16. Separate Threads: Limited Concurrency try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command RejectedExecutionException because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } 16
  • 17. Separate Threads: Timeout Override of Future.get() public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } 17
  • 18. Separate Threads: Timeout Override of Future.get() public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } 18
  • 19. Separate Threads: Timeout Override of Future.get() public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } 19
  • 20. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 20
  • 21. Circuit Breaker if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } 21
  • 22. Circuit Breaker if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } 22
  • 23. Circuit Breaker if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } 23
  • 24. Netflix uses all 4 in combination 24
  • 25. 25
  • 26. Tryable semaphores for “trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve” 26
  • 27. 27
  • 28. 28
  • 29. 29
  • 30. Benefits of Separate Threads Protection from client libraries Lower risk to accept new/updated clients Quick recovery from failure Client misconfiguration Client service performance characteristic changes Built-in concurrency 30
  • 31. Drawbacks of Separate Threads Some computational overhead Load on machine can be pushed too far ... Benefits outweigh drawbacks when clients are “untrusted” 31
  • 32. 32
  • 33. Visualizing Circuits in Realtime (generally sub-second latency) Video available at https://vimeo.com/33576628 33
  • 34. Rolling 10 second counter – 1 second granularity Median Mean 90th 99th 99.5th Latent Error Timeout Rejected Error Percentage (error+timeout+rejected)/ (success+latent success+error+timeout+rejected). 34
  • 41. Netflix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response 41
  • 44. Rolling Number Realtime Stats and Decision Making 44
  • 45. Request Collapsing Take advantage of resiliency to improve efficiency 45
  • 46. Request Collapsing Take advantage of resiliency to improve efficiency 46
  • 47. 47
  • 49. Questions & More Information Fault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen 49