Más contenido relacionado La actualidad más candente (20) Similar a Deep-Dive-with-Cloud-Monitoring-with-Amazon-EKS-and-Prometheus (20) Más de Amazon Web Services (20) Deep-Dive-with-Cloud-Monitoring-with-Amazon-EKS-and-Prometheus1. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Deep Dive in Cloud Monitoring
with Amazon EKS and Prometheus
Pahud Hsieh
Specialist SA, Serverless
Amazon Web Services
Kakashi Liu
Infra Lead
UmboCV
3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon EKS in the Past Year
● Started in us-east-1 and us-west-2
● Released VPC CNI 1.0
● HIPPA Support
● Released AMI build scripts on Github
● Released VPC CNI 1.1
● Enabled GPU Support
● Support API Aggregation
● Support HPA
● Support eu-west-1
● CLI support for writing the kubeconfig
● Support for Admission Controllers
4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon EKS in the Past Year
● Released VPC CNI 1.2
● Allow for additional VPC CIDR ranges
● Support for us-east-2
● Official support for ALB Ingress
● Container Marketplace
● CloudMap Integration
● Support for AWS App Mesh
● Support for eu-central1, ap-southeast-1, ap-southeast-2, ap-
northeast-1
● Support for ap-northeast-2
● Added the SLA
5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Immediately after that
● Achieved ISO and PCI compliance
● Support for ap-south-1, eu-west-2, eu-west-3
● Released VPC CNI 1.3
● Added a new qiuckstart
● Allowed private API Endpoints
● Launched an App Mesh controller at GA
● Public Preview for Windows nodes
● Deep Learning container launch
● Added 1.2 with a new cluster update API
● Released CSI Drivers for FSx and EFS
● Control plane logs
● Public Preview of A1 instances
● Released a Machine Learning Benchmark tool
7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
CloudWatch Container Insights(preview)
9. Pod Metrics
• pod_cpu_reserved_capacity
• pod_cpu_utilization
• pod_cpu_utilization_over_pod_li
mit
• pod_memory_reserved_capacity
• pod_memory_utilization
• pod_memory_utilization_over_p
od_limit
• pod_network_rx_bytes
• pod_network_tx_bytes
10. Other Metrics
• cluster_failed_node_count
• cluster_node_count
• namespace_number_of_runni
ng_pods
• node_cpu_limit
• node_cpu_reserved_capacity
• node_cpu_usage_total
• node_cpu_utilization
• node_filesystem_utilization
• node_memory_limit
• node_memory_reserved_capa
city
• node_memory_utilization
• node_memory_working_set
• node_network_total_bytes
• node_number_of_running_containers
• node_number_of_running_pods
• service_number_of_running_pods
Reference - https://amzn.to/2HFtHDt
12. Amazon EKS and Prometheus
Prometheus
Why Prometheus?
Community
Number of integrations
Ease of use
Why not Prometheus?
Manage it yourself
Complexity in large setups
Possibility: Hybrid Approach
Use Prometheus to collect metrics that
are exposed on /metrics endpoints
Send a subset of critical metrics to
Amazon CloudWatch or a third party
solution.
13. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Hello!
I am kakashi
- Infra Lead @Umbo CV
- Co-organizer @Golang Taipei Gathering
15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Traditional
Solutions
Umbo
Light
17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Agenda
Why monitoring
Umbo CV Monitoring pipeline
Prometheus: Why and What
Prometheus with EKS
Use cases
18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Why monitoring
19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Why monitoring
Alerting Long-term trends
20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Umbo CV Monitoring pipeline
21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Monitoring types
Infrastructure
Application
22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Application monitoring
EC2
Metrics
Store
container
container
exporter
exporter
exporter
/metrics
EC2 /metrics
Collect
Alert
Expose
Metrics
23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Prometheus: Why and What
● Graduates Within CNCF.
● Can handle multi-dimensional metrics.
● Performance: can ingest millions of samples per second.
● Powerful query language: PromQL.
● Built-in alerting tool and service discovery mechanism.
24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Prometheus metrics
EC2 /metrics
EC2 /metrics
User request
http_requests_total{code=200, path="/api/user"} 10
metric_name labels value
25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
PromQL example
Total requests / second
sum(rate(http_requests_total[5m]))
Total 5xx requests / second
sum(rate(http_requests_total{code=~"5.*"}[5
m]))
Current percentage of errors across all instances
sum(rate(http_requests_total{code=~"5.*"}[5m])) /
sum(rate(http_requests_total[5m]))
26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Alerting rule
alert: Percentage_Of_Errors_Is_High
expr: sum(rate(http_requests_total{code=~"5.*"}[5m]))
/
sum(rate(http_requests_total[5m])) > 5
for: 5m
labels:
severity: critical
27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Prometheus with EKS
28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Prometheus ❤ EKS
● Monitoring system is critical.
● Running Prometheus on Kubernetes can
easily achieve HA.
● Prometheus operator makes it ever easier
○ Automated management and upgrades of
Prometheus.
○ Native k8s configuration.
29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Install Prometheus on EKS by helm
1. Install Promethues Operator chart
2. Verify
$ helm install --name prom --namespace monitoring stable/prometheus-operator
$ kubectl --namespace monitoring get pods
NAME READY STATUS RESTARTS AGE
alertmanager-prom-op-alertmanager-0 2/2 Running 0 1m
prometheus-prom-op-prometheus-0 3/3 Running 1 1m
prom-op-grafana-5c59ddfb9d-zqfqt 2/2 Running 0 2m
prom-op-kube-state-metrics-76786cc9b4-8q4bj 1/1 Running 0 2m
prom-op-prometheus-node-exporter-6jclc 1/1 Running 0 2m
prom-op-prometheus-node-exporter-bxr49 1/1 Running 0 2m
prom-op-prometheus-operato-operator-6cbf5d5cfd-z6fz4 1/1 Running 0 2m
30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Prometheus Operator CRD
● Prometheus & AlertManager
○ Define Prometheus and AlertManager deployment.
● ServiceMonitor
○ Used to specify how metric of k8s services can be
scraped.
● PrometheusRule
○ Can be loaded by a Prometheus instance containing
Prometheus alerting and recording rules.
31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
EKS cluster monitoring
32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
EKS application monitoring through ServiceMonitor
apiVersion:
monitoring.coreos.com/v1
kind: Servicemonitor
metadata:
name: api-servicemonitor
spec:
selector:
matchLabels:
app: api-server
Labels:
app: api-server
Labels:
app: api-server2
33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Alerting by PrometheusRule
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
spec:
groups:
- name: api.rules
rules:
- alert: Percentage_Of_Errors_Is_High
expr:
sum(rate(http_requests_total{code=~"5.*"}[5m])) /
sum(rate(http_requests_total[5m])) > 5
for: 5m
labels:
severity: critical
34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Dashboard for EKS cluster
35. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Monitoring camera detection pipeline
Media
Serve
r
CV
Detectio
n
API
Serve
r
37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Monitoring camera detection pipeline
Media
Serve
r
CV
Detectio
n
API
Serve
r
# of
frames # cv
requests
# of events
38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Service discovery
Media
Serve
r
CV
Detectio
n
API
Serve
r
Scraping through EC2 service
discovery
39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Service discovery
Media
Server
CV
Detection
API
Server
Scraping
global:
scrape_interval: 1s
evaluation_interval: 1s
scrape_configs:
- job_name: 'node'
ec2_sd_configs:
- region: eu-east-1
access_key:
<ACCESS_KEY_HERE>
secret_key:
<SECRET_KEY_HERE>
port: 9273
40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Application metrics
Media
Serve
r
CV
Detectio
n
API
Serve
r
ms_frames_total{env="production", service="ms", cameraId="ID-123456"}
1000
# of frames
# of cv requests cvreqest_total{env="production", service="cv", cameraId="ID-123456"} 300
# of events event_total{env="production", service="cv", cameraId="ID-123456"} 5
# of frames # of cv request # of events
41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Dashboard
42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Alerting
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
spec:
groups:
- name: camera.rules
rules:
- alert: FpsLow
annotations:
message: "{{ $labels.cameraid }} fps is lower than 2fps"
expr: sum(rate(ms_frames_total{env="production", cameraId=".+"}[10m])) < 2
for: 30mins
labels:
severity: critical
43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Thank you!
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.