SlideShare una empresa de Scribd logo
1 de 45
Splunk All the Things:
        Our First 3 Months Monitoring
        Web Service APIs
       Dan Cundiff (@pmotch) and Eric Helgeson (@nulleric)
       Target Corporation



Copyright © 2012 Splunk Inc.
2
Agenda

Context

Problem

Solution

Examples

In progress and future stuff

Lessons and challenges

                               3
Context: Enterprise Services @ Target
Data and transactional APIs for all the domains in our business
–   Products (inventory, price, description, etc)
–   Locations
–   Coupons
–   etc
APIs exposed inside and outside
Mostly RESTful APIs, some pub sub/messaging
Used by mobile devices, applications, partners on the outside, etc.
Constantly evolving, rapidly improving, all the time

                                           4
Problem
First API go-live:
–   Millions of log events per day (grep/cut/sed/awk not cutting it)
–   Logs scattered everywhere
–   Limited access to logs
–   Needed end to end visibility of web services
–   Needed ability to discover information in logs
–   Can we be pro-active? Faster reactive?
Looming horizon:
– BILLIONS of log events coming
– Questions changing everyday from business, support, execs, developers


                                          5
Solution: Gave Splunk a try
Installed Splunk on a lab server
Hooked up Splunk to the logs
Quickly created 15+ searches and reports
Generated a dashboard for visibility and trending
Total time to do all this in Splunk:


                      ~4 hours
                                       6
Why Splunk
Understanding what’s “normal”
– Identify tolerances
– Identify actionable events vs. anomalies
You don’t know what you don’t know
– …but Splunk can tell you what you don’t know




                                       7
Why Splunk, part 2
Indicators when are things trending badly
– Proactive monitoring and recovery
– Standard deviations, percentage changes over time, outliers
Full stack visibility
–   API gateway
–   Network (load balancers, firewalls)
–   Web/app
–   OS




                                          8
Why Splunk, part 3
Quick and flexible dashboards
Drill down
Community (Splunkbase, blogs, etc)
Google-able™
App store!




                                 9
Locations Service Examples
What is “normal”?
Volume




                 11
What is “normal”?, part 2
API response time SLAs




                         12
What is “normal”?, part 3
Errors happen, but what is acceptable?




                                 13
404s
~1700 errors once a day every week
404s for stores that don’t exist
Bot?
– Who are they?
– Malicious? Competitor? Individual?
– Reach out to understand why




                                       14
Understanding consumers
Who and how is it being used?
What’s their experience?




                                15
Understanding consumers, part 2
Load testing in production?




                              16
Understanding infrastructure
Expected design vs actual implementation
Not balancing workload as expected




                                17
Understanding providers
How are providers responding?
Is overhead added to the API response?




                                 18
Requirements feedback loop
Requirement: 200 tps
Actual: ~20 tps




                       19
Business intelligence from APIs
Where are people searching?
Where should we build our next store?
How far are people traveling?
What time of day?
Mobile vs website?
iOS vs Android?
International?



                                20
Metrics for APIs
(source: http://blog.programmableweb.com/2012/08/02/the-api-measurement-secret-know-what-metrics-matter/)
Traffic Metrics                     Service Metrics                       Support Metrics
–   Total calls                     –   Performance                       – Support tickets
–   Top methods                     –   Availability                      – Response time
–   Call chains                     –   Error rates                       – Community metrics
–   Quota faults                    –   Code defects
                                                                          Business Metrics
Developer Metrics                   Marketing Metrics                     –   Direct revenue
– Total developer count             –   Developer registrations           –   Indirect revenue
– Number                            –   Developer portal funnel           –   Market share
  of active developers              –   Traffic sources                   –   Costs
– Top developers                    –   Event metrics
– Trending apps
– Retention
                                                   21
In progress and future stuff
Splunk all the things
Consumer apps
Provider systems
OS, firewalls, proxies
External API gateway logs
Anything in between (middleware, integrations, etc)
Correlate with logs from apps degrees away (e.g. .com web logs)


Development (perf test results, git, Jenkins/CI, wiki, etc)
Dashboards
Global dashboard summarizing all APIs
BI dashboards
Executive dashboards




                                24
Dashboards, part 2
Environment dashboards for each API
–   CI
–   Test
–   Stage
–   Prod




                                25
Dashboards, part 3
Alert trending dashboards for each API




                                 26
Splunking Continuous Integration
Drill down into CI results linked straight from Jenkins
– Filtered by date OR transaction GUID




                                         27
Splunking Continuous Integration, part 2
We practice code as documentation
Every commit, Jenkins runs, extracts documentation from code, puts it
in the respective wiki pages (pretty cool! – automated / no humans)
Splunk monitors wiki changes using the MediaWiki API
Monitor CI + human wiki changes


https://github.com/pmotch/wikislurp



                                  28
Common Logging Service
CLS is our strategy for getting logs from all places into Splunk
How
– Use UFs on end points everywhere
– Else, consolidate and mount Splunk
– Else, use CLS RESTful API
Enables end-to-end visibility
– Insert GUIDs across all the hops in the transaction
Use out of the box log formats (e.g. Log4j)



                                        29
Lessons and challenges
Lessons
RTFM
– Keep logs flat
– Keep timestamp (ISO8601) at the beginning
– k=v
Iterate quick, push to prod; minimal tweaks to Splunk
Flatten out of box audit events (XML)
– Toggle at runtime
Don’t re-invent the wheel, use what your system provides, Splunk can
handle it!


                                    31
Lessons, part 2
Don’t pre-optimize up front
–   Governance
–   Standards
–   Alerting
–   Access controls
Optimize as needed




                              32
Lessons, part 3
Create a community




                            33
Lessons, part 4
Create best practices, standards, etc in a wiki




                                    34
Challenges: Organizational
“Stop. We already have tools that do this. Use those.”
– tgtMAKE saves the day
– tgtMAKE = R&D
– R&D = $, servers, flak shelter, people network



Make it real strategy
– Demo to as many key players as possible
– Drum up interested
– Show actual value


                                       35
Challenges: Organizational, part 2




    http://knowyourmeme.com/photos/361379-shut-up-and-take-my-money

                                  36
Challenges: Organizational, part 3
The data can’t be trusted?




                             37
Challenges: OS
RHEL 6
SELinux
Ipfw
Install notes: http://nulleric.tumblr.com/post/13855621770/splunk-on-
redhat-6-install-notes




                                 38
Challenges: Infrastructure
VM requirement
Adhering to MDHA requirements
Universal Forwarder skepticism




                                 39
Challenges: Logs on the outside
Universal Forwarders on servers that we don’t manage
Firewalls
Multi-layered DMZs




                                40
Challenges: Splunk
…




            41
Challenges: Splunk (err, improvements)
Index improvements
–   Cheap servers, can fail, can expand
–   Replication, N=3
–   Replicas on N-1 subsequent nodes
–   Data is always available, smooth out across servers if they go down or expand
–   Multi-tenant
–   Think OpenStack Swift “Ring” concept or Cassandra
–   There’s that CAP Theorem thing; they say it’s a big deal.
GUI for deployment client configurations (lazy and for n00bs, we know)
Ability to extend charts with other libraries (like D3 or something)

                                        42
Recap


Be bold. Tooling matters. Sell it.
Splunk all the things!
Iterate, adapt, change quickly.



                        43
We’re hiring
  (come talk to us)



          44
Questions?

    45

Más contenido relacionado

Similar a Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splunk .conf2012

Splunk for Developers
Splunk for DevelopersSplunk for Developers
Splunk for DevelopersSplunk
 
Splunk for Developers
Splunk for DevelopersSplunk for Developers
Splunk for DevelopersSplunk
 
Splunk for Developers
Splunk for DevelopersSplunk for Developers
Splunk for DevelopersSplunk
 
Splunk for Developers Breakout Session
Splunk for Developers Breakout SessionSplunk for Developers Breakout Session
Splunk for Developers Breakout SessionSplunk
 
SplunkLive! Seattle - Splunk for Developers
SplunkLive! Seattle - Splunk for DevelopersSplunkLive! Seattle - Splunk for Developers
SplunkLive! Seattle - Splunk for DevelopersGrigori Melnik
 
Dublin Unity User Group Meetup Sept 2015
Dublin Unity User Group Meetup Sept 2015Dublin Unity User Group Meetup Sept 2015
Dublin Unity User Group Meetup Sept 2015Dominique Boutin
 
Platform governance, gestire un ecosistema di microservizi a livello enterprise
Platform governance, gestire un ecosistema di microservizi a livello enterprisePlatform governance, gestire un ecosistema di microservizi a livello enterprise
Platform governance, gestire un ecosistema di microservizi a livello enterpriseGiulio Roggero
 
Pandora FMS - Technical presentation
Pandora FMS - Technical presentationPandora FMS - Technical presentation
Pandora FMS - Technical presentationSancho Lerena
 
Making Observability Actionable At Scale - DBS DevConnect 2019
Making Observability Actionable At Scale - DBS DevConnect 2019Making Observability Actionable At Scale - DBS DevConnect 2019
Making Observability Actionable At Scale - DBS DevConnect 2019Squadcast Inc
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018Christophe Rochefolle
 
Splunk in Nordstrom: IT Operations
Splunk in Nordstrom: IT OperationsSplunk in Nordstrom: IT Operations
Splunk in Nordstrom: IT OperationsTimur Bagirov
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...InfluxData
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak PerformanceTodd Palino
 
An Introduction to Microservices
An Introduction to MicroservicesAn Introduction to Microservices
An Introduction to MicroservicesAd van der Veer
 
Cytoscape CI Chapter 2
Cytoscape CI Chapter 2Cytoscape CI Chapter 2
Cytoscape CI Chapter 2bdemchak
 
The Right Tool for the Right Project
The Right Tool for the Right ProjectThe Right Tool for the Right Project
The Right Tool for the Right ProjectOri Bendet
 
Maintaining and Releasing Open Source Software
Maintaining and Releasing Open Source SoftwareMaintaining and Releasing Open Source Software
Maintaining and Releasing Open Source SoftwareJoel Nothman
 
Be My API How to Implement an API Strategy Everyone will Love
Be My API How to Implement an API Strategy Everyone will Love Be My API How to Implement an API Strategy Everyone will Love
Be My API How to Implement an API Strategy Everyone will Love CA API Management
 
Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015Mirco Hering
 

Similar a Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splunk .conf2012 (20)

Splunk for Developers
Splunk for DevelopersSplunk for Developers
Splunk for Developers
 
Splunk for Developers
Splunk for DevelopersSplunk for Developers
Splunk for Developers
 
Splunk for Developers
Splunk for DevelopersSplunk for Developers
Splunk for Developers
 
Splunk for Developers Breakout Session
Splunk for Developers Breakout SessionSplunk for Developers Breakout Session
Splunk for Developers Breakout Session
 
SplunkLive! Seattle - Splunk for Developers
SplunkLive! Seattle - Splunk for DevelopersSplunkLive! Seattle - Splunk for Developers
SplunkLive! Seattle - Splunk for Developers
 
Dublin Unity User Group Meetup Sept 2015
Dublin Unity User Group Meetup Sept 2015Dublin Unity User Group Meetup Sept 2015
Dublin Unity User Group Meetup Sept 2015
 
Platform governance, gestire un ecosistema di microservizi a livello enterprise
Platform governance, gestire un ecosistema di microservizi a livello enterprisePlatform governance, gestire un ecosistema di microservizi a livello enterprise
Platform governance, gestire un ecosistema di microservizi a livello enterprise
 
Pandora FMS - Technical presentation
Pandora FMS - Technical presentationPandora FMS - Technical presentation
Pandora FMS - Technical presentation
 
Making Observability Actionable At Scale - DBS DevConnect 2019
Making Observability Actionable At Scale - DBS DevConnect 2019Making Observability Actionable At Scale - DBS DevConnect 2019
Making Observability Actionable At Scale - DBS DevConnect 2019
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
 
Splunk in Nordstrom: IT Operations
Splunk in Nordstrom: IT OperationsSplunk in Nordstrom: IT Operations
Splunk in Nordstrom: IT Operations
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
 
Innoslate 4.5 and Sopatra
Innoslate 4.5 and SopatraInnoslate 4.5 and Sopatra
Innoslate 4.5 and Sopatra
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
An Introduction to Microservices
An Introduction to MicroservicesAn Introduction to Microservices
An Introduction to Microservices
 
Cytoscape CI Chapter 2
Cytoscape CI Chapter 2Cytoscape CI Chapter 2
Cytoscape CI Chapter 2
 
The Right Tool for the Right Project
The Right Tool for the Right ProjectThe Right Tool for the Right Project
The Right Tool for the Right Project
 
Maintaining and Releasing Open Source Software
Maintaining and Releasing Open Source SoftwareMaintaining and Releasing Open Source Software
Maintaining and Releasing Open Source Software
 
Be My API How to Implement an API Strategy Everyone will Love
Be My API How to Implement an API Strategy Everyone will Love Be My API How to Implement an API Strategy Everyone will Love
Be My API How to Implement an API Strategy Everyone will Love
 
Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015Dev Ops for systems of record - Talk at Agile Australia 2015
Dev Ops for systems of record - Talk at Agile Australia 2015
 

Más de Dan Cundiff

Governance to Guidance to Awesome Product - DOES 2018
Governance to Guidance to Awesome Product - DOES 2018Governance to Guidance to Awesome Product - DOES 2018
Governance to Guidance to Awesome Product - DOES 2018Dan Cundiff
 
How Target Made It Super Easy for Developers to Contribute to Open Source - L...
How Target Made It Super Easy for Developers to Contribute to Open Source - L...How Target Made It Super Easy for Developers to Contribute to Open Source - L...
How Target Made It Super Easy for Developers to Contribute to Open Source - L...Dan Cundiff
 
From No Git to 3000 GitHub Users and How to Keep Them Happy - GitHub Universe...
From No Git to 3000 GitHub Users and How to Keep Them Happy - GitHub Universe...From No Git to 3000 GitHub Users and How to Keep Them Happy - GitHub Universe...
From No Git to 3000 GitHub Users and How to Keep Them Happy - GitHub Universe...Dan Cundiff
 
How to Build APIs - MHacks 2016
How to Build APIs - MHacks 2016How to Build APIs - MHacks 2016
How to Build APIs - MHacks 2016Dan Cundiff
 
Why DevOps != the Wild West and How Embracing it Can Improve Security - RSA C...
Why DevOps != the Wild West and How Embracing it Can Improve Security - RSA C...Why DevOps != the Wild West and How Embracing it Can Improve Security - RSA C...
Why DevOps != the Wild West and How Embracing it Can Improve Security - RSA C...Dan Cundiff
 
Jenkins User Conference 2014
Jenkins User Conference 2014Jenkins User Conference 2014
Jenkins User Conference 2014Dan Cundiff
 
Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Dan Cundiff
 

Más de Dan Cundiff (7)

Governance to Guidance to Awesome Product - DOES 2018
Governance to Guidance to Awesome Product - DOES 2018Governance to Guidance to Awesome Product - DOES 2018
Governance to Guidance to Awesome Product - DOES 2018
 
How Target Made It Super Easy for Developers to Contribute to Open Source - L...
How Target Made It Super Easy for Developers to Contribute to Open Source - L...How Target Made It Super Easy for Developers to Contribute to Open Source - L...
How Target Made It Super Easy for Developers to Contribute to Open Source - L...
 
From No Git to 3000 GitHub Users and How to Keep Them Happy - GitHub Universe...
From No Git to 3000 GitHub Users and How to Keep Them Happy - GitHub Universe...From No Git to 3000 GitHub Users and How to Keep Them Happy - GitHub Universe...
From No Git to 3000 GitHub Users and How to Keep Them Happy - GitHub Universe...
 
How to Build APIs - MHacks 2016
How to Build APIs - MHacks 2016How to Build APIs - MHacks 2016
How to Build APIs - MHacks 2016
 
Why DevOps != the Wild West and How Embracing it Can Improve Security - RSA C...
Why DevOps != the Wild West and How Embracing it Can Improve Security - RSA C...Why DevOps != the Wild West and How Embracing it Can Improve Security - RSA C...
Why DevOps != the Wild West and How Embracing it Can Improve Security - RSA C...
 
Jenkins User Conference 2014
Jenkins User Conference 2014Jenkins User Conference 2014
Jenkins User Conference 2014
 
Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014
 

Último

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Último (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splunk .conf2012

  • 1. Splunk All the Things: Our First 3 Months Monitoring Web Service APIs Dan Cundiff (@pmotch) and Eric Helgeson (@nulleric) Target Corporation Copyright © 2012 Splunk Inc.
  • 2. 2
  • 3. Agenda Context Problem Solution Examples In progress and future stuff Lessons and challenges 3
  • 4. Context: Enterprise Services @ Target Data and transactional APIs for all the domains in our business – Products (inventory, price, description, etc) – Locations – Coupons – etc APIs exposed inside and outside Mostly RESTful APIs, some pub sub/messaging Used by mobile devices, applications, partners on the outside, etc. Constantly evolving, rapidly improving, all the time 4
  • 5. Problem First API go-live: – Millions of log events per day (grep/cut/sed/awk not cutting it) – Logs scattered everywhere – Limited access to logs – Needed end to end visibility of web services – Needed ability to discover information in logs – Can we be pro-active? Faster reactive? Looming horizon: – BILLIONS of log events coming – Questions changing everyday from business, support, execs, developers 5
  • 6. Solution: Gave Splunk a try Installed Splunk on a lab server Hooked up Splunk to the logs Quickly created 15+ searches and reports Generated a dashboard for visibility and trending Total time to do all this in Splunk: ~4 hours 6
  • 7. Why Splunk Understanding what’s “normal” – Identify tolerances – Identify actionable events vs. anomalies You don’t know what you don’t know – …but Splunk can tell you what you don’t know 7
  • 8. Why Splunk, part 2 Indicators when are things trending badly – Proactive monitoring and recovery – Standard deviations, percentage changes over time, outliers Full stack visibility – API gateway – Network (load balancers, firewalls) – Web/app – OS 8
  • 9. Why Splunk, part 3 Quick and flexible dashboards Drill down Community (Splunkbase, blogs, etc) Google-able™ App store! 9
  • 12. What is “normal”?, part 2 API response time SLAs 12
  • 13. What is “normal”?, part 3 Errors happen, but what is acceptable? 13
  • 14. 404s ~1700 errors once a day every week 404s for stores that don’t exist Bot? – Who are they? – Malicious? Competitor? Individual? – Reach out to understand why 14
  • 15. Understanding consumers Who and how is it being used? What’s their experience? 15
  • 16. Understanding consumers, part 2 Load testing in production? 16
  • 17. Understanding infrastructure Expected design vs actual implementation Not balancing workload as expected 17
  • 18. Understanding providers How are providers responding? Is overhead added to the API response? 18
  • 19. Requirements feedback loop Requirement: 200 tps Actual: ~20 tps 19
  • 20. Business intelligence from APIs Where are people searching? Where should we build our next store? How far are people traveling? What time of day? Mobile vs website? iOS vs Android? International? 20
  • 21. Metrics for APIs (source: http://blog.programmableweb.com/2012/08/02/the-api-measurement-secret-know-what-metrics-matter/) Traffic Metrics Service Metrics Support Metrics – Total calls – Performance – Support tickets – Top methods – Availability – Response time – Call chains – Error rates – Community metrics – Quota faults – Code defects Business Metrics Developer Metrics Marketing Metrics – Direct revenue – Total developer count – Developer registrations – Indirect revenue – Number – Developer portal funnel – Market share of active developers – Traffic sources – Costs – Top developers – Event metrics – Trending apps – Retention 21
  • 22. In progress and future stuff
  • 23. Splunk all the things Consumer apps Provider systems OS, firewalls, proxies External API gateway logs Anything in between (middleware, integrations, etc) Correlate with logs from apps degrees away (e.g. .com web logs) Development (perf test results, git, Jenkins/CI, wiki, etc)
  • 24. Dashboards Global dashboard summarizing all APIs BI dashboards Executive dashboards 24
  • 25. Dashboards, part 2 Environment dashboards for each API – CI – Test – Stage – Prod 25
  • 26. Dashboards, part 3 Alert trending dashboards for each API 26
  • 27. Splunking Continuous Integration Drill down into CI results linked straight from Jenkins – Filtered by date OR transaction GUID 27
  • 28. Splunking Continuous Integration, part 2 We practice code as documentation Every commit, Jenkins runs, extracts documentation from code, puts it in the respective wiki pages (pretty cool! – automated / no humans) Splunk monitors wiki changes using the MediaWiki API Monitor CI + human wiki changes https://github.com/pmotch/wikislurp 28
  • 29. Common Logging Service CLS is our strategy for getting logs from all places into Splunk How – Use UFs on end points everywhere – Else, consolidate and mount Splunk – Else, use CLS RESTful API Enables end-to-end visibility – Insert GUIDs across all the hops in the transaction Use out of the box log formats (e.g. Log4j) 29
  • 31. Lessons RTFM – Keep logs flat – Keep timestamp (ISO8601) at the beginning – k=v Iterate quick, push to prod; minimal tweaks to Splunk Flatten out of box audit events (XML) – Toggle at runtime Don’t re-invent the wheel, use what your system provides, Splunk can handle it! 31
  • 32. Lessons, part 2 Don’t pre-optimize up front – Governance – Standards – Alerting – Access controls Optimize as needed 32
  • 33. Lessons, part 3 Create a community 33
  • 34. Lessons, part 4 Create best practices, standards, etc in a wiki 34
  • 35. Challenges: Organizational “Stop. We already have tools that do this. Use those.” – tgtMAKE saves the day – tgtMAKE = R&D – R&D = $, servers, flak shelter, people network Make it real strategy – Demo to as many key players as possible – Drum up interested – Show actual value 35
  • 36. Challenges: Organizational, part 2 http://knowyourmeme.com/photos/361379-shut-up-and-take-my-money 36
  • 37. Challenges: Organizational, part 3 The data can’t be trusted? 37
  • 38. Challenges: OS RHEL 6 SELinux Ipfw Install notes: http://nulleric.tumblr.com/post/13855621770/splunk-on- redhat-6-install-notes 38
  • 39. Challenges: Infrastructure VM requirement Adhering to MDHA requirements Universal Forwarder skepticism 39
  • 40. Challenges: Logs on the outside Universal Forwarders on servers that we don’t manage Firewalls Multi-layered DMZs 40
  • 42. Challenges: Splunk (err, improvements) Index improvements – Cheap servers, can fail, can expand – Replication, N=3 – Replicas on N-1 subsequent nodes – Data is always available, smooth out across servers if they go down or expand – Multi-tenant – Think OpenStack Swift “Ring” concept or Cassandra – There’s that CAP Theorem thing; they say it’s a big deal. GUI for deployment client configurations (lazy and for n00bs, we know) Ability to extend charts with other libraries (like D3 or something) 42
  • 43. Recap Be bold. Tooling matters. Sell it. Splunk all the things! Iterate, adapt, change quickly. 43
  • 44. We’re hiring (come talk to us) 44

Notas del editor

  1. Abig story to draw you in!Anonymizedlat/long data of guest searching for stores in the last 15 minutes.If a store wasn’t nearby those 61 people in Idaho, did they go somewhere else to by Tide, diapers, or socks?Conceptually, maybe we should build a store there (we don’t actually plan our stores with a sole data point like that, but it gives you an idea)?
  2. Here’s the context for all the material that follows. “Enterprise Services” program is all about…
  3. Logsscatted everywhere = complex ecosystemLooming horizon = data explosionStory: going live, millions of hits start coming in, try to figure out what is actually happening
  4. 4 hours. No joke.Wewere drawn to innovate; just try something new and see what happens.
  5. “You don’t know what you don’t know, but Splunk knows what you don’t know.” – that is, Splunk can help by telling you and helping discover what you don’t know.
  6. Drill down: filter to the essential places across logs to troubleshoot or discover business intelCommunity and Google-able: Splunkbase, documentation rules!, lots of Google results = good
  7. HTTP errors in a 24 period; what is normal? 500s are bad. Many 500s early on, but corrected, and much lower now.
  8. A list of consumers of the Locations service over a 24 hour period.Story:Identify bad API key before the developer knew what was wrong.
  9. We’re taking a look at our infrastructure design because of this.
  10. Able to report on non-functional requirements.Goingforward we can do a better job of not over-estimating infrastructure needs; thus saving a lot more money, not wasting idle inventory on the shelf, and open the door to putting the right money in the right places then.
  11. You saw the original map at the beginning of our presentation; aswe expose more APIs, what can we learn from them?
  12. How are we adhering to this advice? We have accomplished many of these metrics already. Most of these are achievable with Splunk.
  13. The more you have in Splunk, the more complete the monitoring picture can be.
  14. Great for perf/load testing; see all the errors in one place.Youcan even put the Jenkins logs in Splunk and show the results across all APIs being developed.
  15. Allow apps to have multiple ways to get logs into SplunkNo UF on consumer devicesBuild transactions across multiple layers of the infraUse UFs on end points everywhere = FASTESTElse, consolidate and mount Splunk = FASTElse, use CLS RESTful API = SLOW
  16. A flattering meme, but at that point, after the demos and the successful research, Splunk sells itself, and honestly at that point everyone is happy to move on, buy what’s needed, and get down to Splunking.
  17. Nothing is wrong. Your data is wrong. Getting people to trust what Splunk is telling us.Storyabout 1 of the nodes being down and initially people didn’t believe it was right.
  18. If developers bring Splunk in, take time and educate ops people on how it all works so they understand how the infrastructure is different and how it should be built. We suspect, normally it’s the other way around.
  19. Get those indexers behaving like Swift or Cassandra: multi-tenant, N-3 replication of the data so cheap servers can fail, scale, etc.