Join well known industry thought leaders and experts from local New York companies for a 1/2 day event focused on the latest and greatest in DevOps practices.
2. JASON HAND |
DevOps Evangelist
• Over 17 years of experience as a
developer, system administrator,
and support specialist
• Fully emerged into the world of
agile development and the
DevOps movement with Colorado
tech startups
#DevOpsRoadTrip
@jasonhand
4. A little about VictorOps…
Incident Management platform combining
real-time data, rich context, automation, and
group collaboration enabling teams to handle
incidents as they occur as well as continuously
improve people, processes, and tools over
time.
#DevOpsRoadTrip
23. Words are how we think – stories are how we link.
- Christina Baldwin
Oral narrative is and for a long time has been the
chief basis of culture itself.
- John D. Niles
Stories from the road
24.
25.
26.
27.
28. Jason Yee
Datadog
Technical Writer/Evangelist
Jason is a community builder,
developer, traveler and chef. He
currently works as a technical writer &
evangelist at Datadog. When he’s not
working you can often find him on a
plane trying to accumulate more miles
or in a restaurant cooking.
30. @GITBISECT
EVANGELIST / DOCUMENTARIAN
DEVOPSDAYS PDX / WEB
TRAVEL HACKER
about me:
@DATADOGHQ
SAAS BASED MONITORING
TRILLION DATA POINTS PER DAY
WE’RE HIRING! BIT.LY/DATADOG-JOBS
about Datadog:
31. DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES
WHAT IS DEVOPS? ▸ Culture
▸ Automation
▸ Metrics
▸ Sharing
TW: @gitbisect @datadoghq
46. METRICS 101
HOW GRANULAR?
▸ AWS: “You can send custom metrics to
Amazon CloudWatch as frequently as you like,
but statistics will only be available at one
minute granularity.”
▸ MS Azure: 1m up to 24h; 1h up to 7d; 1d up
to 30d
▸ Google Stackdriver: 1m
TW: @gitbisect @datadoghq
50. Query Based Monitoring
“What’s the average throughput of
application:nginx per version ?”
“Alert me when one of my pod from replication
controller:foo is not behaving like the others?”
“Show me rate of HTTP 500 responses from nginx”
“… across all data centers”
“… running my app version 2….”
TW: @gitbisect @datadoghq
52. METRICS 101
HOW LONG?
▸ AWS: “Metrics data are available for 2 weeks.
For longer term (>2 weeks) offline storage,
please retrieve your metrics using our
GetMetricStatistics API…”
▸ MS Azure: Minute: 48h; Hour: 30d; Day: 90d
▸ Google Stackdriver: “Time series data points
are kept for 6 weeks and then deleted.”
TW: @gitbisect @datadoghq
57. RECURSE UNTIL YOU FIND THE TECHNICAL CAUSES
TW: @gitbisect @datadoghq
58. A PARTICLE AT REST
OR IN MOTION
CONTINUES IN THAT
STATE UNLESS
COMPELLED BY
FORCES TO CHANGE IT.
Newton’s First Law of
Motion
TW: @gitbisect @datadoghq
62. Aaron Aldrich
Cage Data
Director of Support Services
Aaron Aldrich is the Director of Support Services
and DevOps Evangelist at Cage Data. Following
a career that’s been user-facing and
operationally focussed has given him an
empathetic view of technology’s role in our lives.
Aaron is a founding organizer of the DevOps CT
meetup. His writing can occasionally be found on
cagedata.com/blog if it’s greater than 140
characters or on twitter @crayzeigh if fewer.
75. “Never let a serious crisis go to waste. What I mean
by that it’s an opportunity to do things you think
you could not do before.”
-Rahm Emanuel
@CrayZeigh
95. Breakout Sessions
◻ Burnout – Aaron Aldrich
◻ Why + How ChatOps? – Jason Hand
◻ Learning Reviews – Dave Zwieback
◻ Monitoring – Jason Yee
#DevOpsRoadTrip
96.
97. Dave Zwieback
x.ai
VP of Engineering and Data Science
Dave Zwieback has been working with complex,
mission-critical I.T. services and teams for two
decades. His career spans small high-tech
startups, non-profits, and behemoth engineering,
financial services, and pharmaceutical firms. He
is currently the VP of Engineering and Data
Science at x.ai. Dave is the author of “Beyond
Blame: Learning From Failure and Success”
(http://bit.ly/beyondblame). You can follow follow
Dave @mindweather.
125. TimeToRepair(TTR)
Continuous Improvement Efforts
Reactive (0 – 4)
(chaotic)
Tactical (5 – 9)
(obvious)
Integrated (10 -14)
(complicated)
Strategic (15 – 18)
(complex)
✓ No automation
✓ No operational stack
awareness
✓ Poor collaboration between
teams (Dev & Ops)
✓ Documentation not available
✓ No standardized
communication
✓ High focus on consistent
continuous learning
✓ Uses a NOC
✓ Some monitoring & alerting
instrumentation
✓ Collaboration in crisis
✓ "Mission critical" processes are
available
✓ Understood crisis
communication protocols
✓ Remediation data available to
IT Operations
✓ Team rotations, paging
policies, role hunting
✓ Continuous improvement of
key health indicators
✓ Technical collaboration across
all incidents
✓ Docs up to date and easily
accessible
✓ Consistent real-time
communication practices
✓ Automated docs and remediation
✓ Actionable Alerts with full context
✓ High collaboration among all teams
✓ Documentation part of remediation
✓ Targeted, proactive crisis comms
✓ High focus on continuous learning
Incident
Management
Maturity