SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
Actionable Metrics
Enabling Decision-Making in Netflix’s Decentralized
                   Environment


              #lspe June 27, 2012
                 Roy Rapoport
         @royrapoport, rsr@netflix.com

                                                     1
Me

• Been in tech for about 20 years
• Systems engineering, networking, software
  development, QA, release management
• Time at Netflix: 1094 days (3y-2d)
• (Current) job at Netflix: Make things better
  (Security Monkey, Python Platform, Central Alert Gateway, ... )




                                                                    2
Metrics Humor




                                                                                                 3

I want to start with a joke. This is may be the world’s longest joke, given that I started telling
it about a year ago.
I had just attended a presentation at Velocity 2011 where someone said “collect all the
metrics you possibly can, because you don’t know what will prove useful.”

Exposing the timeline, see that I’ve got more than a year’s information here.

And that the numbers are actually constrained within a very small range.

And that I’m showing the percent of our instances in our production environment that have
even -- that is, divisible by two -- public IP addresses.
Metrics Humor




                                                                                                 3

I want to start with a joke. This is may be the world’s longest joke, given that I started telling
it about a year ago.
I had just attended a presentation at Velocity 2011 where someone said “collect all the
metrics you possibly can, because you don’t know what will prove useful.”

Exposing the timeline, see that I’ve got more than a year’s information here.

And that the numbers are actually constrained within a very small range.

And that I’m showing the percent of our instances in our production environment that have
even -- that is, divisible by two -- public IP addresses.
Metrics Humor



            % of instances with even public IP addresses




                                                                                                 3

I want to start with a joke. This is may be the world’s longest joke, given that I started telling
it about a year ago.
I had just attended a presentation at Velocity 2011 where someone said “collect all the
metrics you possibly can, because you don’t know what will prove useful.”

Exposing the timeline, see that I’ve got more than a year’s information here.

And that the numbers are actually constrained within a very small range.

And that I’m showing the percent of our instances in our production environment that have
even -- that is, divisible by two -- public IP addresses.
Technology Overview




                                                                                         4

Going into the cloud, went from heavy multi-purpose stacks to a highly distributed SOA
environment.

Tons of different services, dynamic binding and communication (no ESB)
Technology Overview
          • SoA, REST, Mostly Java




                                                                                         4

Going into the cloud, went from heavy multi-purpose stacks to a highly distributed SOA
environment.

Tons of different services, dynamic binding and communication (no ESB)
Technology Overview
          • SoA, REST, Mostly Java
          • Simple overall architecture:




                                                                                         4

Going into the cloud, went from heavy multi-purpose stacks to a highly distributed SOA
environment.

Tons of different services, dynamic binding and communication (no ESB)
Technology Overview
          • SoA, REST, Mostly Java
          • Simple overall architecture:




                                                                                         4

Going into the cloud, went from heavy multi-purpose stacks to a highly distributed SOA
environment.

Tons of different services, dynamic binding and communication (no ESB)
Culture Overview




                                                                                                                                                          5
Freedom and Responsibility means hiring smart people and assuming they’ll make smart decisions. We want to empower them to do whatever the heck
they think they need to be doing to make the business succeed, which also means giving them all the tools THEY say they need to be successful. We bias
for innovation speed rather than safety, and we try to create highly aligned, but very loosely coupled, teams.

Distributed ops means No NOC. No staring at dashboards. CORE - Cross-functional, tools builders

Bottom line, it’s about getting out of the way of developers. We give them the tools they need, and the infrastructure pieces to make building whatever they
want to build easier, and then stand back and let them create whatever they want -- even and especially when what they want to build is, for example ...

A small moon.
Culture Overview
    • Freedom and
           Responsibility




                                                                                                                                                          5
Freedom and Responsibility means hiring smart people and assuming they’ll make smart decisions. We want to empower them to do whatever the heck
they think they need to be doing to make the business succeed, which also means giving them all the tools THEY say they need to be successful. We bias
for innovation speed rather than safety, and we try to create highly aligned, but very loosely coupled, teams.

Distributed ops means No NOC. No staring at dashboards. CORE - Cross-functional, tools builders

Bottom line, it’s about getting out of the way of developers. We give them the tools they need, and the infrastructure pieces to make building whatever they
want to build easier, and then stand back and let them create whatever they want -- even and especially when what they want to build is, for example ...

A small moon.
Culture Overview
    • Freedom and
           Responsibility
    • Distributed
           Operations




                                                                                                                                                          5
Freedom and Responsibility means hiring smart people and assuming they’ll make smart decisions. We want to empower them to do whatever the heck
they think they need to be doing to make the business succeed, which also means giving them all the tools THEY say they need to be successful. We bias
for innovation speed rather than safety, and we try to create highly aligned, but very loosely coupled, teams.

Distributed ops means No NOC. No staring at dashboards. CORE - Cross-functional, tools builders

Bottom line, it’s about getting out of the way of developers. We give them the tools they need, and the infrastructure pieces to make building whatever they
want to build easier, and then stand back and let them create whatever they want -- even and especially when what they want to build is, for example ...

A small moon.
Culture Overview
    • Freedom and
           Responsibility
    • Distributed
           Operations
    • Get out of the
           way of
           Developers



                                                                                                                                                          5
Freedom and Responsibility means hiring smart people and assuming they’ll make smart decisions. We want to empower them to do whatever the heck
they think they need to be doing to make the business succeed, which also means giving them all the tools THEY say they need to be successful. We bias
for innovation speed rather than safety, and we try to create highly aligned, but very loosely coupled, teams.

Distributed ops means No NOC. No staring at dashboards. CORE - Cross-functional, tools builders

Bottom line, it’s about getting out of the way of developers. We give them the tools they need, and the infrastructure pieces to make building whatever they
want to build easier, and then stand back and let them create whatever they want -- even and especially when what they want to build is, for example ...

A small moon.
Culture Overview
    • Freedom and
           Responsibility
    • Distributed
           Operations
    • Get out of the
           way of
           Developers



                                                                                                                                                          5
Freedom and Responsibility means hiring smart people and assuming they’ll make smart decisions. We want to empower them to do whatever the heck
they think they need to be doing to make the business succeed, which also means giving them all the tools THEY say they need to be successful. We bias
for innovation speed rather than safety, and we try to create highly aligned, but very loosely coupled, teams.

Distributed ops means No NOC. No staring at dashboards. CORE - Cross-functional, tools builders

Bottom line, it’s about getting out of the way of developers. We give them the tools they need, and the infrastructure pieces to make building whatever they
want to build easier, and then stand back and let them create whatever they want -- even and especially when what they want to build is, for example ...

A small moon.
The Metric Lifecycle




                       6
The Metric Lifecycle

•   Send




                           6
The Metric Lifecycle

•Send
• Look

                        6
The Metric Lifecycle

•Send
• Look
• Alert

                        6
Systems

               • Flexible
               • Scalable
               • Self-Service


                                                               7
Developers own sending metrics
Developers specify what metrics to send
Smart aggregation (critical given churn potential)
Telemetry
                        Flexible, Scalable, Self-Service




                                                                                             8

On the fly definition of metrics
Very low barrier to entry
Java, Python, Perl (and if someone wanted to, they could -- pretty easily -- create a bash
interface)
Telemetry
                        Flexible, Scalable, Self-Service
 import netflix.metrics as NM
 [...]
     self.nm = NM.Metrics("core_cag")
 [...]
 def api(self):
     self.nm.counter("api")
     [...]
     app_label = “application_%s” % application
     self.nm.counter(app_label)
 [...]




                                                                                             8

On the fly definition of metrics
Very low barrier to entry
Java, Python, Perl (and if someone wanted to, they could -- pretty easily -- create a bash
interface)
Visualization
                       Flexible, Scalable, Self-Service




                                                          9

GUI helper for creating graphs, URL-driven fetching
Three engines: highcharts, dygraphs, RRD
flexible ‘today vs some other time’ capability
Alerting
                        Flexible, Scalable, Self-Service




                                                                                             10

For some patterns -- e.g. traffic -- you’ve got to have the ability to have your thresholds be
dynamic. We played with Holt-Winters, but found that it was expensive, and took too long to
calibrate. We’ve found that Double Exponential Smoothing has worked really well for us.

But once you start doing more interesting alerting configuration, it’s harder to know whether
or not you’ve set your thresholds correctly -- and the ability to map your alert configuration
to historical metric values to see when your alert WOULD have triggered makes a huge
difference in making your first attempt at rational threasholds most likely to be successful.
Alerting
                        Flexible, Scalable, Self-Service




  • Static vs Dynamic
       Thresholds




                                                                                             10

For some patterns -- e.g. traffic -- you’ve got to have the ability to have your thresholds be
dynamic. We played with Holt-Winters, but found that it was expensive, and took too long to
calibrate. We’ve found that Double Exponential Smoothing has worked really well for us.

But once you start doing more interesting alerting configuration, it’s harder to know whether
or not you’ve set your thresholds correctly -- and the ability to map your alert configuration
to historical metric values to see when your alert WOULD have triggered makes a huge
difference in making your first attempt at rational threasholds most likely to be successful.
Alerting
                        Flexible, Scalable, Self-Service




  • Static vs Dynamic
       Thresholds
  • Historical Testing



                                                                                             10

For some patterns -- e.g. traffic -- you’ve got to have the ability to have your thresholds be
dynamic. We played with Holt-Winters, but found that it was expensive, and took too long to
calibrate. We’ve found that Double Exponential Smoothing has worked really well for us.

But once you start doing more interesting alerting configuration, it’s harder to know whether
or not you’ve set your thresholds correctly -- and the ability to map your alert configuration
to historical metric values to see when your alert WOULD have triggered makes a huge
difference in making your first attempt at rational threasholds most likely to be successful.
For Example ...
             Last 3 hours’ core_tools.core_cag_api




                                    What the ...




                                                                                            11

core_tools.core_cag_api is the alert volume through our Central Alerting Gateway (CAG). I
went to look for this graph for this presentation when noticed we had dropped our volume
significantly over the last 20 minutes (the relatively flat last part of this graph). So I
expanded the time range to the last few days ...
For Example ...
                           Visualization (Continued)

              Last 4 days’ core_tools.core_cag_api




                              even more questions!



                                                                                            12

Which just raised more questions -- like what happened with the drop in alerts on Monday,
at 11AM and 11PM?

So let’s expand the range further back -- for the last two weeks ...
For Example ...
                           Visualization (Continued)

             Last 10 days’ core_tools.core_cag_api




                            What caused the spike?


                                                                                                 13

OK, that looks like basically we had a spike in alerts starting as of about 10 days ago or so,
so the drops on Monday were just going back to normal volume. But what caused the spike
earlier?

The good news is that since we send metrics not just for alerts but for alerts per application
(see the earlier code example), we could see alert volume per application...
For Example ...
                           Visualization (Continued)

                 Show alert volume per application




                     Someone had a rough few days...


                                                                                                14

The purple line is alerts for one of our applications -- which clearly had had a pretty rough
few days.

Now that I had the answers, let’s make sure we alert on alert volume ...
Don’t Like Surprises...
{
    "alerts": [
        {
            "applyTo": "cluster",
            "condition": {
                 "minPercent": 90.0,
                 "noise" : .2,
                 "maxPercent": 25.0,
                 "type": "DoubleExponential"
            },
            // here’s my number
            overrides : {
               ‘api_key’ : ‘93528d3baa599b727097d73cfdbd5934’
               }
            "metricName": "core_cag_api",
            // so call me maybe
            "severity": "major"
        }
    ],
    "clusters": [ "core_tools" ]
}


                                                                15
I Didn’t Mention

           • End-to-end testing and alerting
           • External availability and performance
           • Events
           • Open Connect
           • Jobs

                                                                                    16

Things I didn’t talk about:

* end-to-end testing and visibility into transactions through our system;

* making sure our site’s available to the world;

* how we’ve promoted events into first-class monitored objects in our environment;

* monitoring our new CDN, the Open Connect platform;

* the jobs we have open, at http://jobs.netflix.com
I Can Haz Question?




      (Photo credit: My wife)




                                17

Más contenido relacionado

Último

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Destacado

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destacado (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

LSPE Presentation: Actionable Metrics at Netflix

  • 1. Actionable Metrics Enabling Decision-Making in Netflix’s Decentralized Environment #lspe June 27, 2012 Roy Rapoport @royrapoport, rsr@netflix.com 1
  • 2. Me • Been in tech for about 20 years • Systems engineering, networking, software development, QA, release management • Time at Netflix: 1094 days (3y-2d) • (Current) job at Netflix: Make things better (Security Monkey, Python Platform, Central Alert Gateway, ... ) 2
  • 3. Metrics Humor 3 I want to start with a joke. This is may be the world’s longest joke, given that I started telling it about a year ago. I had just attended a presentation at Velocity 2011 where someone said “collect all the metrics you possibly can, because you don’t know what will prove useful.” Exposing the timeline, see that I’ve got more than a year’s information here. And that the numbers are actually constrained within a very small range. And that I’m showing the percent of our instances in our production environment that have even -- that is, divisible by two -- public IP addresses.
  • 4. Metrics Humor 3 I want to start with a joke. This is may be the world’s longest joke, given that I started telling it about a year ago. I had just attended a presentation at Velocity 2011 where someone said “collect all the metrics you possibly can, because you don’t know what will prove useful.” Exposing the timeline, see that I’ve got more than a year’s information here. And that the numbers are actually constrained within a very small range. And that I’m showing the percent of our instances in our production environment that have even -- that is, divisible by two -- public IP addresses.
  • 5. Metrics Humor % of instances with even public IP addresses 3 I want to start with a joke. This is may be the world’s longest joke, given that I started telling it about a year ago. I had just attended a presentation at Velocity 2011 where someone said “collect all the metrics you possibly can, because you don’t know what will prove useful.” Exposing the timeline, see that I’ve got more than a year’s information here. And that the numbers are actually constrained within a very small range. And that I’m showing the percent of our instances in our production environment that have even -- that is, divisible by two -- public IP addresses.
  • 6. Technology Overview 4 Going into the cloud, went from heavy multi-purpose stacks to a highly distributed SOA environment. Tons of different services, dynamic binding and communication (no ESB)
  • 7. Technology Overview • SoA, REST, Mostly Java 4 Going into the cloud, went from heavy multi-purpose stacks to a highly distributed SOA environment. Tons of different services, dynamic binding and communication (no ESB)
  • 8. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture: 4 Going into the cloud, went from heavy multi-purpose stacks to a highly distributed SOA environment. Tons of different services, dynamic binding and communication (no ESB)
  • 9. Technology Overview • SoA, REST, Mostly Java • Simple overall architecture: 4 Going into the cloud, went from heavy multi-purpose stacks to a highly distributed SOA environment. Tons of different services, dynamic binding and communication (no ESB)
  • 10. Culture Overview 5 Freedom and Responsibility means hiring smart people and assuming they’ll make smart decisions. We want to empower them to do whatever the heck they think they need to be doing to make the business succeed, which also means giving them all the tools THEY say they need to be successful. We bias for innovation speed rather than safety, and we try to create highly aligned, but very loosely coupled, teams. Distributed ops means No NOC. No staring at dashboards. CORE - Cross-functional, tools builders Bottom line, it’s about getting out of the way of developers. We give them the tools they need, and the infrastructure pieces to make building whatever they want to build easier, and then stand back and let them create whatever they want -- even and especially when what they want to build is, for example ... A small moon.
  • 11. Culture Overview • Freedom and Responsibility 5 Freedom and Responsibility means hiring smart people and assuming they’ll make smart decisions. We want to empower them to do whatever the heck they think they need to be doing to make the business succeed, which also means giving them all the tools THEY say they need to be successful. We bias for innovation speed rather than safety, and we try to create highly aligned, but very loosely coupled, teams. Distributed ops means No NOC. No staring at dashboards. CORE - Cross-functional, tools builders Bottom line, it’s about getting out of the way of developers. We give them the tools they need, and the infrastructure pieces to make building whatever they want to build easier, and then stand back and let them create whatever they want -- even and especially when what they want to build is, for example ... A small moon.
  • 12. Culture Overview • Freedom and Responsibility • Distributed Operations 5 Freedom and Responsibility means hiring smart people and assuming they’ll make smart decisions. We want to empower them to do whatever the heck they think they need to be doing to make the business succeed, which also means giving them all the tools THEY say they need to be successful. We bias for innovation speed rather than safety, and we try to create highly aligned, but very loosely coupled, teams. Distributed ops means No NOC. No staring at dashboards. CORE - Cross-functional, tools builders Bottom line, it’s about getting out of the way of developers. We give them the tools they need, and the infrastructure pieces to make building whatever they want to build easier, and then stand back and let them create whatever they want -- even and especially when what they want to build is, for example ... A small moon.
  • 13. Culture Overview • Freedom and Responsibility • Distributed Operations • Get out of the way of Developers 5 Freedom and Responsibility means hiring smart people and assuming they’ll make smart decisions. We want to empower them to do whatever the heck they think they need to be doing to make the business succeed, which also means giving them all the tools THEY say they need to be successful. We bias for innovation speed rather than safety, and we try to create highly aligned, but very loosely coupled, teams. Distributed ops means No NOC. No staring at dashboards. CORE - Cross-functional, tools builders Bottom line, it’s about getting out of the way of developers. We give them the tools they need, and the infrastructure pieces to make building whatever they want to build easier, and then stand back and let them create whatever they want -- even and especially when what they want to build is, for example ... A small moon.
  • 14. Culture Overview • Freedom and Responsibility • Distributed Operations • Get out of the way of Developers 5 Freedom and Responsibility means hiring smart people and assuming they’ll make smart decisions. We want to empower them to do whatever the heck they think they need to be doing to make the business succeed, which also means giving them all the tools THEY say they need to be successful. We bias for innovation speed rather than safety, and we try to create highly aligned, but very loosely coupled, teams. Distributed ops means No NOC. No staring at dashboards. CORE - Cross-functional, tools builders Bottom line, it’s about getting out of the way of developers. We give them the tools they need, and the infrastructure pieces to make building whatever they want to build easier, and then stand back and let them create whatever they want -- even and especially when what they want to build is, for example ... A small moon.
  • 19. Systems • Flexible • Scalable • Self-Service 7 Developers own sending metrics Developers specify what metrics to send Smart aggregation (critical given churn potential)
  • 20. Telemetry Flexible, Scalable, Self-Service 8 On the fly definition of metrics Very low barrier to entry Java, Python, Perl (and if someone wanted to, they could -- pretty easily -- create a bash interface)
  • 21. Telemetry Flexible, Scalable, Self-Service import netflix.metrics as NM [...] self.nm = NM.Metrics("core_cag") [...] def api(self): self.nm.counter("api") [...] app_label = “application_%s” % application self.nm.counter(app_label) [...] 8 On the fly definition of metrics Very low barrier to entry Java, Python, Perl (and if someone wanted to, they could -- pretty easily -- create a bash interface)
  • 22. Visualization Flexible, Scalable, Self-Service 9 GUI helper for creating graphs, URL-driven fetching Three engines: highcharts, dygraphs, RRD flexible ‘today vs some other time’ capability
  • 23. Alerting Flexible, Scalable, Self-Service 10 For some patterns -- e.g. traffic -- you’ve got to have the ability to have your thresholds be dynamic. We played with Holt-Winters, but found that it was expensive, and took too long to calibrate. We’ve found that Double Exponential Smoothing has worked really well for us. But once you start doing more interesting alerting configuration, it’s harder to know whether or not you’ve set your thresholds correctly -- and the ability to map your alert configuration to historical metric values to see when your alert WOULD have triggered makes a huge difference in making your first attempt at rational threasholds most likely to be successful.
  • 24. Alerting Flexible, Scalable, Self-Service • Static vs Dynamic Thresholds 10 For some patterns -- e.g. traffic -- you’ve got to have the ability to have your thresholds be dynamic. We played with Holt-Winters, but found that it was expensive, and took too long to calibrate. We’ve found that Double Exponential Smoothing has worked really well for us. But once you start doing more interesting alerting configuration, it’s harder to know whether or not you’ve set your thresholds correctly -- and the ability to map your alert configuration to historical metric values to see when your alert WOULD have triggered makes a huge difference in making your first attempt at rational threasholds most likely to be successful.
  • 25. Alerting Flexible, Scalable, Self-Service • Static vs Dynamic Thresholds • Historical Testing 10 For some patterns -- e.g. traffic -- you’ve got to have the ability to have your thresholds be dynamic. We played with Holt-Winters, but found that it was expensive, and took too long to calibrate. We’ve found that Double Exponential Smoothing has worked really well for us. But once you start doing more interesting alerting configuration, it’s harder to know whether or not you’ve set your thresholds correctly -- and the ability to map your alert configuration to historical metric values to see when your alert WOULD have triggered makes a huge difference in making your first attempt at rational threasholds most likely to be successful.
  • 26. For Example ... Last 3 hours’ core_tools.core_cag_api What the ... 11 core_tools.core_cag_api is the alert volume through our Central Alerting Gateway (CAG). I went to look for this graph for this presentation when noticed we had dropped our volume significantly over the last 20 minutes (the relatively flat last part of this graph). So I expanded the time range to the last few days ...
  • 27. For Example ... Visualization (Continued) Last 4 days’ core_tools.core_cag_api even more questions! 12 Which just raised more questions -- like what happened with the drop in alerts on Monday, at 11AM and 11PM? So let’s expand the range further back -- for the last two weeks ...
  • 28. For Example ... Visualization (Continued) Last 10 days’ core_tools.core_cag_api What caused the spike? 13 OK, that looks like basically we had a spike in alerts starting as of about 10 days ago or so, so the drops on Monday were just going back to normal volume. But what caused the spike earlier? The good news is that since we send metrics not just for alerts but for alerts per application (see the earlier code example), we could see alert volume per application...
  • 29. For Example ... Visualization (Continued) Show alert volume per application Someone had a rough few days... 14 The purple line is alerts for one of our applications -- which clearly had had a pretty rough few days. Now that I had the answers, let’s make sure we alert on alert volume ...
  • 30. Don’t Like Surprises... { "alerts": [ { "applyTo": "cluster", "condition": { "minPercent": 90.0, "noise" : .2, "maxPercent": 25.0, "type": "DoubleExponential" }, // here’s my number overrides : { ‘api_key’ : ‘93528d3baa599b727097d73cfdbd5934’ } "metricName": "core_cag_api", // so call me maybe "severity": "major" } ], "clusters": [ "core_tools" ] } 15
  • 31. I Didn’t Mention • End-to-end testing and alerting • External availability and performance • Events • Open Connect • Jobs 16 Things I didn’t talk about: * end-to-end testing and visibility into transactions through our system; * making sure our site’s available to the world; * how we’ve promoted events into first-class monitored objects in our environment; * monitoring our new CDN, the Open Connect platform; * the jobs we have open, at http://jobs.netflix.com
  • 32. I Can Haz Question? (Photo credit: My wife) 17