6. Yeah, but …
… what are you
achieving?
I’m gonna need you
to come in Sunday.
7. 7
Some DevOps Metrics that Might Matter
Culture
e.g.
• Retention
• Satisfaction
• Callouts
Process
e.g.
• Idea-to-cash
• MTTR
• Deliver time
Quality
e.g.
• Tests passed
• Tests failed
• Best/worst
Systems
e.g.
• Throughput
• Uptime
• Build times
Activity
e.g.
• Commits
• Tests run
• Releases
Impact
e.g.
• Signups
• Checkouts
• Revenue
8. From every tool, every process, every component, on-prem or off
Machine Data Is A
Critical Source Of
DevOps Metrics
9. 9
Industry Leading Platform for Machine Data
Any Machine Data
Online
Services Web
Services
Servers
Security GPS
Location
Storage
Desktops
Networks
Packaged
Applications
Custom
ApplicationsMessaging
Telecoms
Online
Shopping
Cart
Web
Clickstreams
Databases
Energy
Meters
Call Detail
Records
Smartphones
and Devices
RFID
Datacenter
Private
Cloud
Public
Cloud
Enterprise
Scalability
Search and
Investigation
Proactive
Monitoring
Operational
Visibility
Real-time
Business
Insights
Operational Intelligence
10. 10
Visibility Across the Ops Environment
API
SDKs UI
Server, Storage.
N/W
Server
Virtualization
Operating
Systems
Infrastructure
Applications
Mobile
Applications
Cloud Services
Other Tools
Ticketing/Help
Desk
No rigid schemas – add in data from any other source.
Custom
Applications
API Services
11. 11
Visibility Across the Dev Lifecycle
API
SDKs UI
Other Tools
Escalation/
Collaboration
No rigid schemas – add in data from any other source.
Plan Code Build Test/QA Stage Release Config Monitor
21. 21
CI / Build
Server
Code
Review
Task
Tracking
What Data Can You Splunk?
Which code has already been reviewed for this release/sprint? Who has
completed the most code reviews? What code has NOT been reviewed?
Who is changing files? What kinds of files are being changed? What branches are
most active? What types of activities are occurring for a branch?
Version
Control
How many builds completed today/this week/this month? Which check-in kicked
off this build? Which tests ran against this failed build?
Which tasks are assigned to which developers? What progress is being made to
complete assigned tasks? What tasks remain for this release/sprint?
23. 23
Industry Leading Platform for Machine Data
Any Machine Data
Online
Services Web
Services
Servers
Security GPS
Location
Storage
Desktops
Networks
Packaged
Applications
Custom
ApplicationsMessaging
Telecoms
Online
Shopping
Cart
Web
Clickstreams
Databases
Energy
Meters
Call Detail
Records
Smartphones
and Devices
RFID
Datacenter
Private
Cloud
Public
Cloud
Enterprise
Scalability
Search and
Investigation
Proactive
Monitoring
Operational
Visibility
Real-time
Business
Insights
Operational Intelligence
31. 31
curl -k https://<host>:8088/services/collector -H 'Authorization: Splunk <token>' -d
'{"event":"Hello Event Collector"}'
Applications IoT Devices
Agentless, direct data onboarding via a standard API
HTTP Event Collector
Scales to Millions of Events/Second
33. 33
Splunk App for Stream
Enables real-time
insights into private,
public and hybrid
cloud infrastructures
Delivers rapid
deployment, easy
scale out and efficient
wire data capture
Capture and analyze
critical events not
found in logs or with
other collection
methods.
1 2 3
Enhance Operational Intelligence With Wire Data Capture
34. 34
Splunk MINT for Mobile Data
Deliver Better
Performing, More
Reliable Apps
Deliver Real-Time
Analytics
Achieve End-to-End
Visibility
37. 37
Machine Data To Enable Continuous Improvement
Defect
Information
Capacity
Planning
Quality
Standards
Enhancement
Requests
Integration
Requirements
Acceptance
Metrics
Service Levels
and KPIs
Application Development Test and Acceptance Production
BuildCodePlan Test/QA Stage Release Config Monitor
Infrastructure
Dependencies
38. 38
Increase Delivery Velocity
DevOps Teams Iterate with
Continuous Insights
Product Managers
identify new
opportunities
Code Continuously delivered
to market
Auditors
have visibility
Customers
are happy
39. 39
Improve Code Quality
Code quality scans Static security scans
White BoxDevelopers
check in code
Automated
Acceptance Tests
Dynamic Security
Scans
Black Box
“Chaos Monkey”
tests
Test Fail:
Return
Test Fail:
Return
X
X
Production
QA Prod Pattern
QA Pattern Library
Test Pass:
Promote
Test Pass:
Promote to
Production
Pattern
library used
for test and
QA
42. 42
Real-time dashboards show error rate
in production and impact of pushing
new builds
Developers can search and visualize
web logs, Java logs, eventlogs etc;
trace tx without complex
instrumentation
Alerts notify developers as soon as a
problem arises
42
Find and Fix Issues Faster
43. 43
Gain end-to-end visibility to make
informed decisions
Analytics insights without the need for
additional analytics tools
Ask questions while exploring and
collecting data
43
Push Better Code Using Analytics
44. 44
Powerful Platform for Enterprise Developers
4
REST API
Build Splunk Apps Extend and Integrate Splunk
Simple XML
JavaScript/CSS Extensions C#
JavaScript
Python
Ruby
Java
PHP
Data Models
Search Extensibility
Modular Inputs
SDKs
KV Store
45. 45
Splunk Developer Guidance
Splunk Reference Apps
Complete, working real-world Splunk solutions
built together with partners (Conducive; Auth0)
̶ 2 (pseudo-) production releases
̶ entire code & test repos on GitHub
̶ under Apache 2.0
Associated Guidance
I. Start-to-Finish Journey Documentary
II. Essentials
dev.splunk.com/goto/devguide
48. 48
48
Improved DevOps Agility
Key Customer Benefits
• Increased success rate of deployments
• Detect issues before they affect broad production
• Monitoring deployment process several times per day
-Robert Gonsalves,
Web Operations
“It’s like we were working
without peripheral vision
before and now we have it.”
49. 49
Deliver Better Code Quality
Key Customer Benefits
-Principal Engineer,
Apollo Group
“Developers are now able to look
for errors and troubleshoot issues
five to ten times faster by having
all their event data centralized in
Splunk.”
• Provide full visibility into QA sanity and load testing
before production
• Exceed SLA thresholds with full visibility and benchmark
key infrastructure metrics and errors
• Easier troubleshooting if tests do not contain the
expected results
50. 50
50
Enable Data-driven Continuous Delivery
-Alison Perkins,
Senior Systems Engineer
“ Dump all the logs into Splunk,
and it starts looking like one big
system, instead of a bazillion
teeny ones that hate each
other.”
Key Customer Benefits
• Quickly validate and troubleshoot code pushes to
production
• Ensure that new code does not negatively impact
performance or user experience
• Reduced one application’s error rate by 2 orders of
magnitude in a matter of weeks
53. 53
Where to go for more Info
• DevOps Videos, Customer Stories, Whitepapers
• http://splunk.com/DevOps
• Developer Tutorials, Code Samples, Downloads
– http://dev.splunk.com
• Splunk Apps and Plugins
– https://splunkbase.splunk.com
• Blogs for Dev, Ops, and DevOps
– http://blogs.splunk.com
53
Splunk Enterprise is fully featured, platform for collecting, searching, monitoring and analyzing machine data and getting operational intelligence. You can monitor both real-time (as the data is streaming) and historical data. Splunk collects machine data securely and reliably from wherever it’s generated in any formant. It stores and indexes the data in real time in a centralized location and protects it with role-based access controls. You can troubleshoot your network problems and investigate security incidents in minutes (not hours or days). Monitor your end-to-end infrastructure to avoid service degradation or outages. Gain real-time visibility and critical insights into customer experience, transactions and behavior.
Splunk can provide insight across the entire application delivery lifecycle. Developers can search and visualize data from entire build pipeline and production environments without needing to access production machines.
When using Splunk to mine machine data our customers and prospects can
1) INCREASE APP DELIVERY VELOCITY
2) IMPROVE CODE QUALITY
3) INCREASE BUSINESS IMPACT OF APPLICATION DELIVERY
Pro-actively identify business issues
Visualize the source of the issue
Collect & Analyze diagnostics data automatically
Take pro-active actions to mitigate the problem
Engage the right people immediately
Proactively Identify:
Stuck deals are identified before the call is made to support
Visualize: Business Activity Monitoring tracks deals; Yellow indicates where the deal is stuck. Red would indicate a catastrophic system failure.
Splunk Enterprise is fully featured, platform for collecting, searching, monitoring and analyzing machine data and getting operational intelligence. You can monitor both real-time (as the data is streaming) and historical data. Splunk collects machine data securely and reliably from wherever it’s generated in any formant. It stores and indexes the data in real time in a centralized location and protects it with role-based access controls. You can troubleshoot your network problems and investigate security incidents in minutes (not hours or days). Monitor your end-to-end infrastructure to avoid service degradation or outages. Gain real-time visibility and critical insights into customer experience, transactions and behavior.
The new app collects and analyzes performance data from Puppet Enterprise. Customers get visibility into critical services, such as PuppetDB, the Puppet Serve, and console services. The app also helps to reduce troubleshooting times and proactively fix health issues in the Puppet environment, and includes the following insights:
Console services response times to benchmark and actively plan console resources
The number of request errors by Puppet clients to help recognize potential code or infrastructure issues
Role-based access control dashboards to monitor user activity, including authentication errors to help with potential security issues
Requests from PuppetDB to identify commonly executed or failed queries—isolating potential infrastructure bottlenecks
PuppetDB node deactivation activity for isolation of security or automation issues
Commonly submitted PuppetDB commands from client IPs to assist in pinpointing potential security issues
Puppet Server compilation metrics that help teams evaluate the health of their automation environment and appropriately assign resource
The Chef Analytics App for Splunk is available for free on Splunkbase Splunk app marketplace and provides Chef users with visibility into metrics such as success / failure rates, most active users and most active organizations. The Chef Analytics App for Splunk also helps you understand the frequency of the details of errors across infrastructure so that you can catch and troubleshoot high impact issues, like a major bug in a cookbook or an infrastructure issue like network connectivity, in real time
More details: https://www.chef.io/blog/2015/04/17/integrating-chef-analytics-with-splunk/
Why Splunk for AWS?
Security Intelligence (Cloudtrail, Config Cloudwatch, VPC)
Operational Intelligence (Cloudwatch, Config, ELB, Cloudfront)
DevOps Intelligence (Cloudwatch, Lambda)
Big Data Insights (Kinesis, EMR, IoT, S3)
Data sources in Splunk App for AWS
AWS Cloudtrail
Service that delivers logs of admin activity on AWS infrastructure
Examples:
Start/Stop/Create instance
Change of User roles/rights
Modification of Network Configuration
Delivers log files to customers; no UI, display, analysis, search
AWS Config
Provides resource inventory
Provides configuration history & change information
Enables security & governance
Amazon Cloudwatch Metrics
IP traffic information to/from VPC network interfaces
Data stored and accessible from AWS Cloudwatch Logs
Amazon Cloudwatch VPC Flow Logs
IP traffic information to/from VPC network interfaces
Data stored and accessible from AWS Cloudwatch Logs
AWS Access Logs
Elastic Load Balancing (ELB)
Cloudfront CDN
S3
AWS Billing
Current Month via Cloudwatch metrics
Monthly Detailed Billing
Secure - Supports TLS / SSL
Support for configurable data collection to simplify data classification and access control
index
sourcetype
source
Support for collection of container labels and env keys which can further help with data classification
Simple
Easy to setup e.g., no need to deploy Splunk Universal Forwarder (UF) and scale it
Much easier to collect data in Splunk Cloud deployments
Scalable built on top of HTTP Event Collector (HEC)
Support encryption via SSL as long as Splunk has a TCP SSL endpoint opened. The difference I would say is that with syslog over SSL there is no auth model other than the cert itself and you can only have one SSL cert for all of Splunk. With HEC you have fine grained control of which servers can send data and which indexes they can access as that can be configured per token.
Simplified classification by source and labels, you can do that in syslog driver as well. See labels and env in https://docs.docker.com/engine/admin/logging/overview/
Additional configurations on Splunk logging driver to see “process” field, https://docs.docker.com/engine/admin/logging/log_tags/. Available by default in Syslog.
Splunk Cloud: HEC is fully supported in Splunk Cloud. You cannot however arbitrarily open a TCP/UDP port in Splunk Cloud which would be required if you needed to forward Syslog data to a UF running in Splunk Cloud. Need to confirm how to scale HEC in Splunk Cloud.
https://docs.docker.com/engine/admin/logging/splunk/
HTTP Event Collector is easy way to send data to Splunk Enterprise. Notably, the EC enables you to send data over HTTP/ HTTPS directly to Splunk Enterprise from your application. The EC was developed with application developers in mind, so that all it takes is a few lines of code added to an app for the app to send data. Also, the EC is token-based, so you never need to hard-code your Splunk Enterprise credentials in your app or supporting files. HTTP Event Collector provides a new way for developers to send application logging and metrics directly to Splunk Enterprise via HTTP in a highly efficient, scalable and secure manner
AWS Lambda is an Amazon Web Services compute service that runs your back-end code in response to events, and manages compute resources for you. In cooperation with Amazon, Splunk is pleased to provide a built-in AWS Lambda Node.js blueprint for HTTP Event Collector. The blueprint makes it easy to get started quickly, sending events from AWS Lambda to HTTP Event Collector running on Splunk Enterprise or Splunk Cloud. You can also write a Lambda function from scratch, either in JavaScript using Node.js or in Java. AWS Lambda can receive event data from Amazon Kinesis, Amazon DynamoDB, Amazon S3, and other Amazon services, and then send it on to HTTP Event Collector. You can collect the data using HTTP Event Collector in Splunk Cloud, which also runs on AWS, or in Splunk Enterprise on-premises.
Splunk App for Stream is a free App that enables you to capture, visualize and analyze data in much more granular way then ever before. You can see everything – ALL user and applications behavior ],response times from every layer, DNS information, storage traffic, network traffic, your websites content, connections. Once this data is in Splunk you can correlate it with other data for much more comprehensive visibility. First Splunk App for Stream is a way of get wire data into Splunk Enterprise. By adding this comprehensive source of machine data, it enables you to extend Operational Intelligence use cases across IT security and the business. It is a software only solution with the ability that can be installed on VM on any host, it enables real-time insights into multi-cloud environments. And as such, it is easy to install anywhere on most of standard machines, it is a passive very efficient way to capture data.
To address the needs of developers, operations and product management, you need Operational Intelligence for your mobile apps. This is what we call mobile intelligence. Mobile intelligence provides real-time insight on how your mobile apps are performing, and can correlate with and enhance Operational Intelligence.
Splunk software enables organizations to search, monitor, analyze and visualize machine-generated data from websites, applications, servers, networks, sensors and mobile devices. Splunk MINT helps organizations monitor mobile app usage and performance, gain deep visibility into mobile app transactions and accelerate development
Deliver better performing, more reliable apps
When a user has a problem with a mobile app, the issue could be isolated or spread across all app versions, handsets and OS types. With Splunk MINT, you can see issues with app performance or availability in real time. Bugs can be addressed quickly, and app developers can gain a head start in creating and delivering valuable app updates.
Achieve End-to-End visibility
When mobile apps fail, there are many potential sources of failure. With Splunk MINT, you can analyze overall transaction performance. And using Splunk MINT, you can correlate this data with information from back-end apps to gain detailed insight on transaction problems. As a result, operations can reduce MTTR and better anticipate future mobile app back-end requirements.
Deliver real-time analytics
Mobile apps give enterprises new ways of conducting digital business. With mobile app information in Splunk Enterprise, you can correlate usage and performance information— some call this omni-channel analytics—to better understand how users are engaging all aspects of your organization.
Fast feedback obtained with analytics improves app delivery velocity. From code definition to production, having insight for all the teams leads to less bugs, faster testing, faster releases, less production issues and accelerated innovation.
Deliver end-to-end visibility across every DevOps toolchain component
Iterate faster with correlated insight across the application delivery lifecycle
Improve DevOps team efficiency by measuring and benchmarking release contributions
Pinpoint and resolve code issues before they impact customers .Find and fix production issues faster Use objective metrics to ensure code is operational and meets quality SLAs
Order Flow, message queues, Garbage Collection, Java Heap
identify errors by java class, thread
alert actions - jira ticket, service now ticket, webhook
PM’s love to look at feature usage; are new features being used?
How do we allocate developer time to create/enhance features
Need to check all of them.
The NBC Universal’s Web Operations team is using Splunk platform to track performance of their releases from pre-production, QA and in production. Splunk software frees up their DevOps teams and enables them to focus on innovation while pushing code to production daily.
The way they test these deployments is to release to half of our servers first while they are out of production rotation but still accessible through other non-public DNS paths for testing. Their “Gamma” environment is a subset of production servers that have been released to for testing, but aren’t taking public traffic. Theyuse Splunk to monitor new errors on the servers after releasing changes to them, which is easy to do since just before the servers were released to they already had an established error volume baseline from being in production. If any new errors appear or the volume of traffic for 400/500 errors suddenly increases after the release, They immediately see the spike in our monitoring. Alerting on thresholds for event log errors - roll-back problematic deployments or diagnose a build in a subset of the production environment
Splunk is Used Enterprise-wide at Apollo Group
Provide full visibility into sanity and Load testing in QA environments before deploying to production
Using the Splunk SDK for Java to gather key machine metrics (CPU, memory, disk utilization) and log file errors metrics are compared to benchmarks and SLA thresholds
Helps troubleshooting if tests do not contain the expected results
Product Lifecycle Monitoring
Custom Application Monitoring
Mobile App Monitoring/Analytics
Customer Analytics
Security and Forensics/Analytics
Network Monitoring/Analytics
AWS Billing Monitoring/Analytics
Cloud and Physical Server Monitoring/Analytics
Issue resolution from days to minutes in both QA/staging and production environments – significant productivity/quality impact
ROI in the millions - Engineering/IT teams are significantly more productive & Fewer outages
Challenges:
No single place to access and visualize machine data
Manual diagnosing and searching through data generated by servers and applications
To retrieve information, sysadmins have to ssh into production machines before sending off to developers to grep through the logs
With Splunk:
Quickly validate and troubleshoot code pushes to production
Ensure that new code does not negatively impact performance or user experience
Reduced one application’s error rate by 2 orders of magnitude in a matter of weeks
When using Splunk to mine machine data our customers and prospects can
1) INCREASE APP DELIVERY VELOCITY
2) IMPROVE CODE QUALITY
3) INCREASE BUSINESS IMPACT OF APPLICATION DELIVERY