21. 21
Software Life Cycle: Contrived Lifecycle
Time
Readiness
1) Idea!
2) Production Ready 3) End of Life
2.9) "It’ll be time to wind this service down
when ___ happens and ___ comes online."
R&D
22. 22
Software Life Cycle: Dose of Reality
Time
Production
1) Idea!
2) Production Ready
4) End of Life
"Production Supported"
3) "Oops"
R&D
23. 23
Software Life Cycle: Do NOT Pass Go, No $200
Time
Production
1) Idea!
N) End of Life
"Production Supported"
Forced to fix code or docs.
R&D
24. 24
Software Life Cycle: Why the fails?
Time
Production
1) Idea!
2) Production Ready
N) End of Life
"Production Supported"
"Drug feet to produce docs."
[3,M) "Oops"
R&D
N-1) "That’s it, we’ve had enough…"
25. 25
Software Life Cycle
Time
Production
1) Idea!
2) Production Ready
N) End of Life
"Production Supported"
[3,M) "Oops"
R&D
N-2) "That’s it, we’ve had enough…"
N-1) "Just support it until
the next version is out"
26. 26
Software Life Cycle: Detecting Problems Early
Time
Production
1) Idea!
2) Production Ready
4) End of Life
"Production Supported"
3) "Oops"
R&D
WTB Alerting Here
43. HASHICORP
CPU Scheduler
Web Server -Thread 1
CPU - Core 1
CPU - Core 2
Web Server -Thread 2
Redis -Thread 1
Kernel -Thread 1
Work (Input) Resources
CPU
Scheduler
44. HASHICORP
CPU Scheduler
Web Server -Thread 1
CPU - Core 1
CPU - Core 2
Web Server -Thread 2
Redis -Thread 1
Kernel -Thread 1
Work (Input) Resources
CPU
Scheduler
64. 64
Metrics: Gauge
•Counter - Monotonic Number
•Bytes transmitted
•Number of 2XX requests
•Gauge - Non-monotonic number
•Load average
•Number of services in a critical state
66. 66
Metrics: Histogram
•Counter - Monotonic Number
•Bytes transmitted
•Number of 2XX requests
•Gauge - Non-monotonic number
•Load average
•Number of services in a critical state
•Histograms - Distribution of Streams of Values
•Latency of an individual request
•Disk IO latency
•Bytes per response
73. 73
Data Sizes to Problem Specificity
AMOUNT OF DATA NECESSARY TO
ANSWER THE QUESTION
IPSUM
SCOPE OR SPECIFICITY OF THE QUESTION IS THERE A
PROBLEM?
WHERE IS THE
PROBLEM?
WHAT IS THE
PROBLEM?
84. $ nomad status atlas-4119-b246fd8fa2
ID = atlas-4119-b246fd8fa2
Name = atlas-4119
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
console 0 0 1 0 0 0
frontend 0 0 2 0 0 0
worker 0 0 1 0 0 0
Allocations
ID Eval ID Node ID Task Group Desired Status Created At
24e12544 9fedfef9 b7d7483e console run running 01/25/17 23:14:28 UTC
87f46c82 9fedfef9 d6b60eb1 worker run running 01/25/17 23:14:28 UTC
d5ea84f2 9fedfef9 70ba3d96 frontend run running 01/25/17 23:14:28 UTC
eff8882a 9fedfef9 bbb7b28f frontend run running 01/25/17 23:14:28 UTC
WTF?
85. $ nomad status atlas-4119-b246fd8fa2
ID = atlas-4119-b246fd8fa2
Name = atlas-4119
Type = service
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
console 0 0 1 0 0 0
frontend 0 0 2 0 0 0
worker 0 0 1 0 0 0
Allocations
ID Eval ID Node ID Task Group Desired Status Created At
24e12544 9fedfef9 b7d7483e console run running 01/25/17 23:14:28 UTC
87f46c82 9fedfef9 d6b60eb1 worker run running 01/25/17 23:14:28 UTC
d5ea84f2 9fedfef9 70ba3d96 frontend run running 01/25/17 23:14:28 UTC
eff8882a 9fedfef9 bbb7b28f frontend run running 01/25/17 23:14:28 UTC
WTF?
86. $ nomad alloc-status 87f46c82
ID = 87f46c82
Eval ID = 9fedfef9
Name = atlas-4119.worker[0]
Node ID = d6b60eb1
Job ID = atlas-4119-b246fd8fa2
Client Status = running
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created At = 01/25/17 23:14:28 UTC
Task "worker" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
47/256 MHz 218 MiB/2.0 GiB 0 B 0
Recent Events:
Time Type Description
01/25/17 23:19:36 UTC Started Task started by client
01/25/17 23:14:28 UTC Downloading Artifacts Client is downloading artifacts
01/25/17 23:14:28 UTC Received Task received by client
87. $ nomad alloc-status d5ea84f2
ID = d5ea84f2
Eval ID = 9fedfef9
Name = atlas-4119.frontend[1]
Node ID = 70ba3d96
Job ID = atlas-4119-b246fd8fa2
Client Status = running
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created At = 01/25/17 23:14:28 UTC
Task "frontend" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
370/1024 MHz 673 MiB/2.0 GiB 0 B 0 atlasfrontend: 10.151.2.227:80
Recent Events:
Time Type Description
01/25/17 23:19:18 UTC Started Task started by client
01/25/17 23:14:28 UTC Downloading Artifacts Client is downloading artifacts
01/25/17 23:14:28 UTC Received Task received by client
NOT STATIC
92. % cat ../modules/nomad-job/interface.tf
# *-description's taken from https://www.nomadproject.io/docs/agent/telemetry.html
variable "cpu-kernel-description" {
type = "string"
default = "Total CPU resources consumed by the task in the system space"
}
variable "cpu-throttled-periods-description" {
type = "string"
default = "Number of periods when the container hit its throttling limit (`nr_throttled`)"
}
variable "cpu-throttled-time-description" {
type = "string"
default = "Total time that the task was throttled (`throttled_time`)"
}
variable "cpu-total-percentage-description" {
type = "string"
default = "Total CPU resources consumed by the task across all cores"
}
93. variable "cpu-total-ticks-description" {
type = "string"
default = "CPU ticks consumed by the process in the last collection interval"
}
variable "cpu-user-description" {
type = "string"
default = "An aggregation of all userland CPU usage for this Nomad job."
}
variable "environment" {
type = "string"
}
variable "human_name" {
description = "The human-friendly name for this job"
type = "string"
}
variable "job_name" {
type = "string"
description = "The Nomad Job Name (or its prefix)"
}
94. variable "job_tags" {
type = "list"
description = "Tags that should be added to this job's resources"
}
variable "memory-cache-description" {
type = "string"
default = "Amount of memory cached by the task"
}
variable "memory-kernel-usage-description" {
type = "string"
default = "Amount of memory used by the kernel for this task"
}
variable "memory-max-usage-description" {
type = "string"
default = "Maximum amount of memory ever used by the kernel for this task"
}
variable "memory-kernel-max-usage-description" {
type = "string"
default = "Maximum amount of memory ever used by the tasks in this job."
}
95. variable "memory-rss-description" {
type = "string"
default = "An aggregation of all resident memory for this Nomad job."
}
variable "memory-swap-description" {
type = "string"
default = "Amount of memory swapped by the task"
}
variable "nomad-tags" {
type = "list"
default = [ "source:nomad" ]
}
variable "task_group" {
type = "string"
description = "The name of the task group"
}
96. % cat ../modules/nomad-job/stream-groups.tf
resource "circonus_stream_group" "cpu-kern" {
name = "${var.human_name} CPU Kernel"
description = "${var.cpu-kernel-description}"
group {
query = "*`${var.job_name}-${var.task_group}`cpu`system"
type = "average"
}
tags = [ "${var.nomad-tags}", "${var.job_tags}", "resource:cpu", "use:utilization" ]
# unit = "%"
}
resource "circonus_stream_group" "memory-rss" {
name = "${var.human_name} Memory RSS"
description = "${var.memory-rss-description}"
group {
query = "*`${var.job_name}-${var.task_group}`memory`rss"
type = "average"
}
tags = [ "${var.nomad-tags}", "${var.job_tags}", "resource:memory", "use:utilization" ]
}
97. resource "circonus_trigger" "rss-alarm" {
check = "${circonus_check.usage.checks[0]}"
stream_name = "${var.used_metric_name}"
if {
value {
absent = "3600s"
}
then {
notify = [
"${circonus_contact_group.circonus-owners-slack.id}",
"${circonus_contact_group.circonus-owners-slack-escalation.id}",
]
severity = 1
}
}
if {
value {
# SEV1 if we're over 4GB
more = "${4 * 1024 * 1024 * 1024}"
}
...
106. 106
Parting Thoughts
•Be an engineer. Put rigid constraints around your app.
•Don't confuse static with rigid.
•Work top to bottom.
•Develop an error budget and prioritize.
•Be consistent in your observability regimen.
107. 107
Parting Thoughts
•Expose HTTP Endpoints for stats (both monotonic counters and gauges)
•Trap Metrics to a broker frequently to create a histogram (e.g. 100ms)
•Expose or export JSON Histograms
•Valuable metrics tend to record the behavior of edges, not vertices