SlideShare una empresa de Scribd logo
1 de 107
Descargar para leer sin conexión
1
AWS CloudWatch
Observability and Monitoring
2
Rick Hwang
rick_kyhwang@hotmail.com
2017/12/28
3
http://www.cwb.gov.tw/V7/observe/satellite/Sat_T.htm?type=1
4
https://env.healthinfo.tw/air/
5
CloudWatch
Overview, Event-Driven, Automation
AI / ML
6
Agenda
● CloudWatch Metric
● CloudWatch Dashboard
● CloudWatch Alarm
● CloudWatch Event / Rules
● CloudWatch Logs
7
● SNS: Simple Notification Service
● SES: Simple Email Service
● SQS: Simple Queue Service
● Lambda: Serverless
● Auto Scaling
● CloudTrail
Related AWS Services
8
Questions
● 怎麼知道系統的狀況?
● 系統的指標是怎麼來的?
● 系統有哪一些層級要知道?哪些人要知道?怎麼知道?
● 知道之後做什麼?怎麼做?主動、被動?
● 什麼是監、控?
9
How Amazon CloudWatch Works
CloudWatch Basic Concepts
10
11
EC2 Instances
Log Shipper
Logs
Log Groups
Log Stream A
Log Stream B
Log Stream C
Log Stream N
Alarms
Filters
[ts, hostname, scope=NGX, tcp_all, tcp_time_wait, tcp_established, ...]
/var/log/app/*.log
2017-06-11T08:45:01 app1 NGX 47 0 47 0 0 0
2017-06-11T08:45:01 app2 NGX 52 0 52 0 0 0
2017-06-11T08:46:01 app1 NGX 53 0 52 0 0 0
2017-06-11T08:46:01 app2 NGX 52 0 51 0 0 0
2017-06-11T08:47:01 app1 NGX 53 0 53 0 0 0
2017-06-11T08:47:01 app2 NGX 53 0 53 0 0 0
2017-06-11T08:48:01 app1 NGX 59 0 59 0 0 0
2017-06-11T08:48:01 app2 NGX 52 0 51 0 0 0
2017-06-11T08:49:01 app1 NGX 48 0 48 0 0 0
Dashboard
Metrics
S3
Amazon ESLambda
SNS Topics
Export
Streaming
Push
Lambda
12
出處:AWS Summit 2016: Big Data Architectural Patterns and Best Practices
Key Points
13
● 產生結構化、有意義的 Log
○ 結構化: csv, json
○ 有意義: 可統計的資料 → sum, max, min, average, count …
○ 可以下 SQL
● 想想系統上線後需要知道什麼?這些東西哪裡來?
● 盡可能不要動用到 ETL (Extract, Transform, Load)
○ 成本很高、浪費
○ 維護成本
○ 溝通成本
14
CloudWatch Metrics
每個指標背後都有不同的故事
15
16Source: http://booklook.morningstar.com.tw/pdf/0139022.pdf
健檢報告的指標,都是經過無數 臨床經驗 (測試)
與科學實驗 (量測、觀察) 得來的。
Metric - CPU Utilization
17
UTC
CloudWatch Metric
18
● Period: 每次取樣的時間週期
○ EC2 預設為 5m (Free), 可以調整為 1m (另外計費)
○ ELB 預設為 1m
○ Custom metirc supports high resolution: 1s
● Statistics: 統計方式,不同指標有預設的方式
○ Sum
○ Average
○ Max
○ Min
○ Sample Count
● Unit: 單位
○ Percent
○ Count
○ Bytes
Wikipedia: 長尾
Statistics - Long Tail
19Amazon CloudWatch Update – Percentile Statistics and New Dashboard Widgets
Metric Types
● Metrics Provided by AWS
● Custom Metric
○ 透過 AWS CLI / SDK 上傳取樣資料 (json) → 不好做,容易出錯
○ 透過 awslogs or CloudWatch Agent (New) 上傳到 CloudWatch Logs,自訂 Filter 產生 Metric
■ 流程長,但是不難做
■ 推薦這個做法
20
EC2 Metrics
每個指標背後代表不同的現象
21
Amazon EC2 Metrics and Dimensions
22
Metric Description
CPUUtilization The percentage of allocated EC2 compute units that are currently in use on the instance. This metric identifies the processing
power required to run an application upon a selected instance.
To use the percentiles statistic, you must enable detailed monitoring.
Depending on the instance type, tools in your operating system can show a lower percentage than CloudWatch when the
instance is not allocated a full processor core.
Units: Percent
DiskReadOps Completed read operations from all instance store volumes available to the instance in a specified period of time.
To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number
of seconds in that period.
Units: Count
DiskWriteOps Completed write operations to all instance store volumes available to the instance in a specified period of time.
To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number
of seconds in that period.
Units: Count
23
Metric Description
DiskReadBytes Bytes read from all instance store volumes available to the instance.
This metric is used to determine the volume of the data the application reads from the hard disk of the instance. This can be used to
determine the speed of the application.
The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide this
number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
Units: Bytes
DiskWriteBytes Bytes written to all instance store volumes available to the instance.
This metric is used to determine the volume of the data the application writes onto the hard disk of the instance. This can be used to
determine the speed of the application.
The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide this
number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
Units: Bytes
24
Metric Description
NetworkIn The number of bytes received on all network interfaces by the instance. This metric identifies the volume of incoming network traffic to a
single instance.
The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide
this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
Units: Bytes
NetworkOut The number of bytes sent out on all network interfaces by the instance. This metric identifies the volume of outgoing network traffic from
a single instance.
The number reported is the number of bytes sent during the period. If you are using basic (five-minute) monitoring, you can divide this
number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
Units: Bytes
NetworkPacketsIn The number of packets received on all network interfaces by the instance. This metric identifies the volume of incoming traffic in terms of
the number of packets on a single instance. This metric is available for basic monitoring only.
Units: Count
Statistics: Minimum, Maximum, Average
NetworkPacketsOut The number of packets sent out on all network interfaces by the instance. This metric identifies the volume of outgoing traffic in terms of
the number of packets on a single instance. This metric is available for basic monitoring only.
Units: Count
Statistics: Minimum, Maximum, Average
EC2 Metrics
● 預設 Period = 5min (Free)
○ Detail Monitoring: period = 1min ($$)
● memory, disk 不支援,需要透過其他方式
○ CloudWatch Agent (201712 release)
○ telegraf, collectd, cacti, nagios ….
25
ELB Metrics
負載平衡
26
Elastic Load Balancing Metrics and Dimensions
27
Metric Description
Latency [HTTP listener] The total time elapsed, in seconds, from the time the load balancer sent the request to a registered instance until the
instance started to send the response headers.
[TCP listener] The total time elapsed, in seconds, for the load balancer to successfully establish a connection to a registered
instance.
Reporting criteria: There is a nonzero value
Statistics: The most useful statistic is Average. Use Maximum to determine whether some requests are taking substantially longer
than the average. Note that Minimum is typically not useful.
Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that requests sent to 1
instance in us-west-2a have a higher latency. The average for us-west-2a has a higher value than the average for us-west-2b.
RequestCount The number of requests completed or connections made during the specified interval (1 or 5 minutes).
[HTTP listener] The number of requests received and routed, including HTTP error responses from the registered instances.
[TCP listener] The number of connections made to the registered instances.
Reporting criteria: There is a nonzero value
Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average all return 1.
Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that 100 requests are
sent to the load balancer. There are 60 requests sent to us-west-2a, with each instance receiving 30 requests, and 40 requests sent
to us-west-2b, with each instance receiving 20 requests. With the AvailabilityZone dimension, there is a sum of 60 requests in
us-west-2a and 40 requests in us-west-2b. With the LoadBalancerName dimension, there is a sum of 100 requests.
28
Metric Description
HealthyHostCount The number of healthy instances registered with your load balancer. A newly registered instance is considered healthy after it passes
the first health check. If cross-zone load balancing is enabled, the number of healthy instances for the LoadBalancerName dimension
is calculated across all Availability Zones. Otherwise, it is calculated per Availability Zone.
Reporting criteria: There are registered instances
Statistics: The most useful statistics are Average and Maximum. These statistics are determined by the load balancer nodes. Note
that some load balancer nodes might determine that an instance is unhealthy for a brief period while other nodes determine that it is
healthy.
Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, us-west-2a has 1 unhealthy
instance, and us-west-2b has no unhealthy instances. With the AvailabilityZone dimension, there is an average of 1 healthy and 1
unhealthy instance in us-west-2a, and an average of 2 healthy and 0 unhealthy instances in us-west-2b.
UnHealthyHostCount The number of unhealthy instances registered with your load balancer. An instance is considered unhealthy after it exceeds the
unhealthy threshold configured for health checks. An unhealthy instance is considered healthy again after it meets the healthy
threshold configured for health checks.
Reporting criteria: There are registered instances
Statistics: The most useful statistics are Average and Minimum. These statistics are determined by the load balancer nodes. Note that
some load balancer nodes might determine that an instance is unhealthy for a brief period while other nodes determine that it is
healthy.
Example: See HealthyHostCount.
29
Metric Description
HTTPCode_Backend_2XX,
HTTPCode_Backend_3XX,
HTTPCode_Backend_4XX,
HTTPCode_Backend_5XX
[HTTP listener] The number of HTTP response codes generated by registered instances. This count does not include any response
codes generated by the load balancer.
Reporting criteria: There is a nonzero value
Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average are all 1.
Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that requests sent to 1
instance in us-west-2a result in HTTP 500 responses. The sum for us-west-2a includes these error responses, while the sum for
us-west-2b does not include them. Therefore, the sum for the load balancer equals the sum for us-west-2a.
HTTPCode_ELB_4XX [HTTP listener] The number of HTTP 4XX client error codes generated by the load balancer. Client errors are generated when a
request is malformed or incomplete.
Reporting criteria: There is a nonzero value
Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average are all 1.
Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that client requests include a malformed
request URL. As a result, client errors would likely increase in all Availability Zones. The sum for the load balancer is the sum of the
values for the Availability Zones.
HTTPCode_ELB_5XX [HTTP listener] The number of HTTP 5XX server error codes generated by the load balancer. This count does not include any
response codes generated by the registered instances. The metric is reported if there are no healthy instances registered to the load
balancer, or if the request rate exceeds the capacity of the instances (spillover) or the load balancer.
Reporting criteria: There is a nonzero value
Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average are all 1.
Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are
experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer nodes in us-west-2a
fills and clients receive a 503 error. If us-west-2b continues to respond normally, the sum for the load balancer equals the sum for
us-west-2a.
30
Metric Description
BackendConnectionErrors The number of connections that were not successfully established between the load balancer and the registered instances. Because
the load balancer retries the connection when there are errors, this count can exceed the request rate. Note that this count also
includes any connection errors related to health checks.
Reporting criteria: There is a nonzero value
Statistics: The most useful statistic is Sum. Note that Average, Minimum, and Maximum are reported per load balancer node and are
not typically useful. However, the difference between the minimum and maximum (or peak to average or average to trough) might be
useful to determine whether a load balancer node is an outlier.
Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that attempts to connect
to 1 instance in us-west-2a result in back-end connection errors. The sum for us-west-2a includes these connection errors, while the
sum for us-west-2b does not include them. Therefore, the sum for the load balancer equals the sum for us-west-2a.
31
Metric Description
SpilloverCount The total number of requests that were rejected because the surge queue is full.
[HTTP listener] The load balancer returns an HTTP 503 error code.
[TCP listener] The load balancer closes the connection.
Reporting criteria: There is a nonzero value
Statistics: The most useful statistic is Sum. Note that Average, Minimum, and Maximum are reported per load balancer node and are
not typically useful.
Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are
experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer node in us-west-2a
fills, resulting in spillover. If us-west-2b continues to respond normally, the sum for the load balancer will be the same as the sum for
us-west-2a.
SurgeQueueLength The total number of requests that are pending routing. The load balancer queues a request if it is unable to establish a connection
with a healthy instance in order to route the request. The maximum size of the queue is 1,024. Additional requests are rejected when
the queue is full. For more information, see SpilloverCount.
Reporting criteria: There is a nonzero value.
Statistics: The most useful statistic is Maximum, because it represents the peak of queued requests. The Average statistic can be
useful in combination with Minimum and Maximum to determine the range of queued requests. Note that Sum is not useful.
Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are
experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer nodes in us-west-2a
fills, with clients likely experiencing increased response times. If this continues, the load balancer will likely have spillovers (see the
SpilloverCount metric). If us-west-2b continues to respond normally, the max for the load balancer will be the same as the max for
us-west-2a.
請參閱:Amazon CloudWatch Metrics
and Dimensions Reference
族繁不及備載 ...
32
● EC2
● EBS
● ELB: CLB, ALB, NLB
○ Classic Load Balancing
○ Application Load Balancing
○ Network Load Balancing
需要了解的 Metrics
33
每個指標背後
都有故事可以說。
34
Question and Think:
EC2 / ELB 的指標是怎麼來的?
35
36
CloudWatch Dashboard
拉高視野,看見全局
37
38StarTrek (星艦企業號)
39
Passenger (星艦過客)
40
Passenger (星艦過客)
41
CloudWatch Dashboard
● widget: line, stacked, number, text (markdown)
● auto refresh
● local timezone
○ EC2 metric is UTC
● time range
● Horizontal annotation
● Right / Left Y axis
● full screen (dark / light mode)
● Dashboard 可以 import / export 成 json
● 可以透過 API 自動更新
● $3.00 per dashboard per month (ap-northeast-1)
● Time zone
42
Tips
43
LetSSL - System Level
44
LetSSL - Application Level
Demo: CloudWatch Dashboard
Widgets, X/Y Axis, Annotation
45
46
CloudWatch Alarm
Event-driven, Feedback
47
CloudWatch Alarm
48
● 達到門檻值 (Threshold) 之後觸發的動作
○ 五分鐘之內
○ CPU >= 80%
○ 五次
● 動作類型
○ EC2 actions: reboot, stop, terminate. 通常結合 EC2 System Status 使用。
○ SNS to:
■ SES
■ SQS
■ Lambda
■ HTTP Request
CloudWatch Alarm - Status
49
● ALARM: over threshold
● INSUFFICIENT: INSUFFICIENT DATA
● OK
Demo: CloudWatch Alarm
50
Event-driven → Feedback → Automation
51來源:『自動化XXX』的陷阱
CW Alarm
52
CloudWatch Events
Rules, Cron, Scheduler
53
54
CloudWatch Event
● Event Source
○ Event Pattern
○ Schedule
● Targets
○ Multiple 5 targets (fixed)
○ Type: Lambda, EC2, Stream, ECS, SSM, Step Function, Pipeline, SNS, SQS …..
55
CloudWatch Events
● Event Source
○ Event Pattern: DynamoDB, EC2, AutoScaling, RDS …. 太多了
○ Schedule
● Targets
○ Multiple 5 targets (fixed)
○ Type: Lambda, EC2, Stream, ECS, SSM, Step Function, Pipeline, SNS, SQS ….. 太多了
56
常用情境
● EC2 預防性自動化:
○ 不該關機的機器被關機,自動重 啟
○ 機器硬體故障,自動重 啟
○ 狀態改變的行為
● S3 Action 之後
○ Action: PutObject
○ Trigger: Lambda, Put Message to SQS
Demo: CloudWatch Events
57
58
CloudWatch Logs
Filter, Custom Metric, Log Shipper
59
60
EC2 Instances
Log Shipper
Logs
Log Groups
Log Stream A
Log Stream B
Log Stream C
Log Stream N
Alarms
Filters
[ts, hostname, scope=NGX, tcp_all, tcp_time_wait, tcp_established, ...]
/var/log/app/*.log
2017-06-11T08:45:01 app1 NGX 47 0 47 0 0 0
2017-06-11T08:45:01 app2 NGX 52 0 52 0 0 0
2017-06-11T08:46:01 app1 NGX 53 0 52 0 0 0
2017-06-11T08:46:01 app2 NGX 52 0 51 0 0 0
2017-06-11T08:47:01 app1 NGX 53 0 53 0 0 0
2017-06-11T08:47:01 app2 NGX 53 0 53 0 0 0
2017-06-11T08:48:01 app1 NGX 59 0 59 0 0 0
2017-06-11T08:48:01 app2 NGX 52 0 51 0 0 0
2017-06-11T08:49:01 app1 NGX 48 0 48 0 0 0
Dashboard
Metrics
S3
Amazon ESLambda
SNS Topics
Export
Streaming
Push
Lambda
● 前提:EC2 要安裝 awslogs driver or CloudWatch agent
○ ECS Instance 用選的就可以
● 即時把 Log 傳到 CWL
○ 可以在 CWL 直接 Query Log (堪用)
○ 不用擔心 Storage 會爆炸 or 維護
○ 可以設定 Log Rotation
● 透過 Filter 建立 Custom Metric
○ 可以建立 Dashboard
○ 可以建立 Alarm → Event-driven
■ To Lambda, Slack
■ ETL
■ Automation … 無限可能
CloudWatch Logs (CWL)
61
● 透過取樣 (Sampling) 待測目標得來的資料
○ 單位時間的資料,例如每毫秒、每秒、每分
● 取樣頻率越高,數據越精準
● 聲音的音質 (sample rate per second)
○ CD Quality: 44.1kHz
○ 錄音室錄音:192kHz
● 攝影的解析度 (Resolution)
○ HD
○ Full-HD
○ 4k
指標 (Metric)
62
Demo: CloudWatch Logs
63
64
上述講的東西,都可以 `as Code`
65
Questions
● 怎麼知道系統的狀況?
○ 觀測 (Observe)、量測 (Measure)
● 系統的指標是怎麼來的?
○ 指標是經過系統性測試 (System Test) 後,分析 Log 找出來的
● 系統有哪一些層級要知道?哪些人要知道?
○ Business、Application、OS/Hardware、Network
● 知道之後做什麼?怎麼做?主動、被動?
● 什麼是監、控?
○ 監: Watch
○ 控: Control
66
67
本質性問題
68
什麼是監控?
What is Monitoring?
69
監
70
監 控
71
監 控
72
監 控
Watch
Monitor
Observe
Measure
73
監 控
Watch
Monitor
Observe
Measure
Control
Command
Handle
Manage
74
監 控
Watch
Monitor
Observe
Measure
Control
Command
Handle
Manage
75
Dashboard
(儀表板)
監 控
Watch
Monitor
Observe
Measure
Control
Command
Handle
Manage
76
Dashboard
(儀表板)
Console
(主控台)
Dashboard (儀表板)
77
StarTrek (星艦企業號)
Console (主控台)
78
演唱會 Mixer
79
80
Target Services /
Systems
81
Target Services /
Systems
Watchers
82
Target Services /
Systems
Watchers Controllers
Dashboard => Show Something
● Health Status
● Sum of Biz TX
● Sys Resources
● …
83
Target Services /
Systems
Watchers Controllers
Push or Pull Data
(Observability, Measure)
Dashboard => Show Something
● Health Status
● Sum of Biz TX
● Sys Resources
● …
Push or Pull Data
(Observability, Measure)
84
Target Services /
Systems
Watchers Controllers
Events
(Conditions / Thresholds)
Console => Do Something
● Reset or Clean Cache
● On / Off Functions
● Notification
● ...
Commands
Dashboard => Show Something
● Health Status
● Sum of Biz TX
● Sys Resources
● …
85
Target Services /
Systems
Watchers Controllers
Events
(Conditions / Thresholds)
Console => Do Something
● Reset or Clean Cache
● On / Off Functions
● Notification
● ...
Push or Pull Data
(Observability, Measure)
Commands
Dashboard => Show Something
● Health Status
● Sum of Biz TX
● Sys Resources
● …
86
Target Services /
Systems
Watchers Controllers
Events
(Conditions / Thresholds)
Console => Do Something
● Reset or Clean Cache
● On / Off Functions
● Notification
● ...
Feedback
(Adjust Conditions / Thresholds by ML)
Push or Pull Data
(Observability, Measure)
Commands
Dashboard => Show Something
● Health Status
● Sum of Biz TX
● Sys Resources
● …
87
Target Services /
Systems
Watchers Controllers
Events
(Conditions / Thresholds)
Console => Do Something
● Reset or Clean Cache
● On / Off Functions
● Notification
● ...
Feedback
(Adjust Conditions / Thresholds by ML)
Push or Pull Data
(Observability, Measure)
監
Commands
Dashboard => Show Something
● Health Status
● Sum of Biz TX
● Sys Resources
● …
88
Target Services /
Systems
Watchers Controllers
Events
(Conditions / Thresholds)
Console => Do Something
● Reset or Clean Cache
● On / Off Functions
● Notification
● ...
Feedback
(Adjust Conditions / Thresholds by ML)
Push or Pull Data
(Observability, Measure)
監 控
89
Observability vs Monitoring
● 量測:Measure
● 觀測:Observe
● 氣象局
○ Observability 觀測
○ Measurement 量測
● 政府
○ Monitoring
○ Alert
○ Action
○ Feedback
90
http://www.cwb.gov.tw/V7/observe/satellite/Sat_T.htm?type=1
91
量測 (Measure) → Sample from Log
觀測 (Observe) → Metric
回饋 (Feedback) → Analyze, Condition, Alarm
控制 (Control) → Automation, 躺著幹
無法量測,就無法觀測
無法觀測,則沒有回饋
沒有回饋,就不能控制
92
Log 很重要
沒有結構化的 Log or Data
會付出很多 ETL 的成本與時間
93
Event-driven → Feedback → Automation
94來源:『自動化XXX』的陷阱
CW Alarm
95
Why CloudWatch
● Serverless Monitoring System
● Event-driven
● Programmable and Automation
● Realtime and Backup
● Monitoring Monitoring System at Netflix - 2017/05/22
● CloudWatch 滿足 “Basic Montioring” 的需求
96
97
Source: Microservice Prerequisites
為什麼不選其它監控工具?
● 不想自己蓋機器、養機器
● 監控系統做得再好,都只是成本
● 監控系統不是 Big Data
● 有些 Solution 的架構沒有考慮 HA, ex: Prometheus
98
99
100
Alarm System using Serverless
EC2
CloudWatch Alarms
Operators
CloudWatch Event
(time-based)
SNS-Adapter
Slack-Notifier
SNS Topic
Info, Warning
Info
Developers
Health-Checker
Auto Scaling
SNS Topic
Urgent SMS
Warning
系統架構: CloudWatch + SNS + Lambda + Slack
Testers
● Urgent: SMS, Slack
● Warning: Slack w/ tag
● Info: Slack w/o tag
102
CloudWatch Reporter - System Architecture
CloudWatch
Reporter / Alamer
CloudWatch Event
(time-based)
Info / Alert
Channels
Operators
(值班)
Operators
Developers
(On Call)
Metric Configs
(Namespace, Stats)
Target Services
Loading
maintain
PR
Read
CW Metrics
Schedule
maintain
Developers
development
Feature Request
103
Best Practice
● 盡量活用 Cloud SaaS,像是 AWS CloudWatch, GCP Stackdriver
● 把部署設定過程設計成 Configurable
● 把 Log 設計成結構化格式 (csv or json)
● 利用 Big Data Solution 處理 Log Query 需求,像是 AWS Athena or GCP
BigQuery
● Log 透過 Shipper (awslogs, statsd, collectd, fluentd, telegraf ... ) 同時傳到
○ S3 備份,以符合稽核需求
○ CloudWatch 作為 Debug / 監控需求
● 巨量 Log Streaming 資料需要有 Queue 協助
○ AWS Kinesis
○ GCP Pub/Sub
104
● CloudWatch User Guide
● CloudWatch Events User Guide
● CloudWatch Log User Guide
Reference - User Guide
105
● AWS re:Invent 2015: Log, Monitor and Analyze your IT with Amazon
CloudWatch (DVO315)
● Amazon CloudWatch Update – Percentile Statistics and New Dashboard
Widgets
● New – High-Resolution Custom Metrics and Alarms for Amazon CloudWatch
● 淺談系統監控與 CloudWatch 的應用 - AWS User Group Taiwan
● Study Notes - CloudWatch
● SRE CH6 Monitoring Distributed Systems (監控分散式系統)
● 高品質微服務 - CH6 監控
Reference - Youtube, Blog
106
107
/* End of Slide */

Más contenido relacionado

La actualidad más candente

AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020
AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020 AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020
AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020 AWSKRUG - AWS한국사용자모임
 
AWS 클라우드 비용 최적화를 위한 TIP - 임성은 AWS 매니저
AWS 클라우드 비용 최적화를 위한 TIP - 임성은 AWS 매니저AWS 클라우드 비용 최적화를 위한 TIP - 임성은 AWS 매니저
AWS 클라우드 비용 최적화를 위한 TIP - 임성은 AWS 매니저Amazon Web Services Korea
 
Full Stack Monitoring with Azure Monitor
Full Stack Monitoring with Azure MonitorFull Stack Monitoring with Azure Monitor
Full Stack Monitoring with Azure MonitorKnoldus Inc.
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overviewgjuljo
 
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingCloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingAmazon Web Services Korea
 
Microservices Architectures on Amazon Web Services
Microservices Architectures on Amazon Web ServicesMicroservices Architectures on Amazon Web Services
Microservices Architectures on Amazon Web ServicesAmazon Web Services
 
보안 사고 예방을 위한 주요 AWS 모범 사례 – 신은수, AWS 보안 담당 솔루션즈 아키텍트:: AWS 온라인 이벤트 – 클라우드 보안 특집
보안 사고 예방을 위한 주요 AWS 모범 사례 – 신은수, AWS 보안 담당 솔루션즈 아키텍트:: AWS 온라인 이벤트 – 클라우드 보안 특집보안 사고 예방을 위한 주요 AWS 모범 사례 – 신은수, AWS 보안 담당 솔루션즈 아키텍트:: AWS 온라인 이벤트 – 클라우드 보안 특집
보안 사고 예방을 위한 주요 AWS 모범 사례 – 신은수, AWS 보안 담당 솔루션즈 아키텍트:: AWS 온라인 이벤트 – 클라우드 보안 특집Amazon Web Services Korea
 
AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017
AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017
AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017Amazon Web Services Korea
 
Deep Dive on Amazon EC2 Systems Manager
Deep Dive on Amazon EC2 Systems ManagerDeep Dive on Amazon EC2 Systems Manager
Deep Dive on Amazon EC2 Systems ManagerAmazon Web Services
 
AWS Security Week: Incident Response
AWS Security Week: Incident ResponseAWS Security Week: Incident Response
AWS Security Week: Incident ResponseAmazon Web Services
 
Amazon CloudWatch Tutorial | AWS Certification | Cloud Monitoring Tools | AWS...
Amazon CloudWatch Tutorial | AWS Certification | Cloud Monitoring Tools | AWS...Amazon CloudWatch Tutorial | AWS Certification | Cloud Monitoring Tools | AWS...
Amazon CloudWatch Tutorial | AWS Certification | Cloud Monitoring Tools | AWS...Edureka!
 

La actualidad más candente (20)

AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020
AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020 AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020
AWS SAM으로 서버리스 아키텍쳐 운영하기 - 이재면(마이뮤직테이스트) :: AWS Community Day 2020
 
AWS 클라우드 비용 최적화를 위한 TIP - 임성은 AWS 매니저
AWS 클라우드 비용 최적화를 위한 TIP - 임성은 AWS 매니저AWS 클라우드 비용 최적화를 위한 TIP - 임성은 AWS 매니저
AWS 클라우드 비용 최적화를 위한 TIP - 임성은 AWS 매니저
 
Full Stack Monitoring with Azure Monitor
Full Stack Monitoring with Azure MonitorFull Stack Monitoring with Azure Monitor
Full Stack Monitoring with Azure Monitor
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overview
 
Amazon ECS
Amazon ECSAmazon ECS
Amazon ECS
 
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingCloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
CloudWatch 성능 모니터링과 신속한 대응을 위한 노하우 - 박선용 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 
Intro to Amazon ECS
Intro to Amazon ECSIntro to Amazon ECS
Intro to Amazon ECS
 
Microservices Architectures on Amazon Web Services
Microservices Architectures on Amazon Web ServicesMicroservices Architectures on Amazon Web Services
Microservices Architectures on Amazon Web Services
 
보안 사고 예방을 위한 주요 AWS 모범 사례 – 신은수, AWS 보안 담당 솔루션즈 아키텍트:: AWS 온라인 이벤트 – 클라우드 보안 특집
보안 사고 예방을 위한 주요 AWS 모범 사례 – 신은수, AWS 보안 담당 솔루션즈 아키텍트:: AWS 온라인 이벤트 – 클라우드 보안 특집보안 사고 예방을 위한 주요 AWS 모범 사례 – 신은수, AWS 보안 담당 솔루션즈 아키텍트:: AWS 온라인 이벤트 – 클라우드 보안 특집
보안 사고 예방을 위한 주요 AWS 모범 사례 – 신은수, AWS 보안 담당 솔루션즈 아키텍트:: AWS 온라인 이벤트 – 클라우드 보안 특집
 
Deep Dive - CI/CD on AWS
Deep Dive - CI/CD on AWSDeep Dive - CI/CD on AWS
Deep Dive - CI/CD on AWS
 
AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017
AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017
AWS 빅데이터 아키텍처 패턴 및 모범 사례- AWS Summit Seoul 2017
 
Deep Dive on Amazon EC2 Systems Manager
Deep Dive on Amazon EC2 Systems ManagerDeep Dive on Amazon EC2 Systems Manager
Deep Dive on Amazon EC2 Systems Manager
 
Databases on AWS Workshop.pdf
Databases on AWS Workshop.pdfDatabases on AWS Workshop.pdf
Databases on AWS Workshop.pdf
 
Introduction to Amazon EKS
Introduction to Amazon EKSIntroduction to Amazon EKS
Introduction to Amazon EKS
 
AWS Cloud trail
AWS Cloud trailAWS Cloud trail
AWS Cloud trail
 
AWS Security Week: Incident Response
AWS Security Week: Incident ResponseAWS Security Week: Incident Response
AWS Security Week: Incident Response
 
Auto Scaling Groups
Auto Scaling GroupsAuto Scaling Groups
Auto Scaling Groups
 
IaC on AWS Cloud
IaC on AWS CloudIaC on AWS Cloud
IaC on AWS Cloud
 
Amazon CloudWatch Tutorial | AWS Certification | Cloud Monitoring Tools | AWS...
Amazon CloudWatch Tutorial | AWS Certification | Cloud Monitoring Tools | AWS...Amazon CloudWatch Tutorial | AWS Certification | Cloud Monitoring Tools | AWS...
Amazon CloudWatch Tutorial | AWS Certification | Cloud Monitoring Tools | AWS...
 
AWS Cloud Watch
AWS Cloud WatchAWS Cloud Watch
AWS Cloud Watch
 

Similar a Amazon CloudWatch - Observability and Monitoring

Sql server lesson12
Sql server lesson12Sql server lesson12
Sql server lesson12Ala Qunaibi
 
Sql server lesson12
Sql server lesson12Sql server lesson12
Sql server lesson12Ala Qunaibi
 
Monitoring - deeper dive
Monitoring  - deeper diveMonitoring  - deeper dive
Monitoring - deeper diveRobert Kubiś
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
 
AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture f...
AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture f...AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture f...
AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture f...Amazon Web Services
 
Salesforce enabling real time scenarios at scale using kafka
Salesforce enabling real time scenarios at scale using kafkaSalesforce enabling real time scenarios at scale using kafka
Salesforce enabling real time scenarios at scale using kafkaThomas Alex
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLSeveralnines
 
Kks sre book_ch10
Kks sre book_ch10Kks sre book_ch10
Kks sre book_ch10Chris Huang
 
ENT203 Monitoring and Autoscaling, a Match Made in Heaven
ENT203 Monitoring and Autoscaling, a Match Made in HeavenENT203 Monitoring and Autoscaling, a Match Made in Heaven
ENT203 Monitoring and Autoscaling, a Match Made in HeavenAmazon Web Services
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent MonitoringIntelie
 
Cloudwatch - The In's and Out's
Cloudwatch - The In's and Out'sCloudwatch - The In's and Out's
Cloudwatch - The In's and Out'sbeaknit
 
Performance monitoring for Docker - Lucerne meetup
Performance monitoring for Docker - Lucerne meetupPerformance monitoring for Docker - Lucerne meetup
Performance monitoring for Docker - Lucerne meetupStijn Polfliet
 
A Framework for Performance Analysis of Computing Clouds
A Framework for Performance Analysis of Computing CloudsA Framework for Performance Analysis of Computing Clouds
A Framework for Performance Analysis of Computing Cloudsijsrd.com
 
SplunkLive! Zurich 2018: Splunk for Security at Swisscom CSIRT
SplunkLive! Zurich 2018: Splunk for Security at Swisscom CSIRTSplunkLive! Zurich 2018: Splunk for Security at Swisscom CSIRT
SplunkLive! Zurich 2018: Splunk for Security at Swisscom CSIRTSplunk
 
The present and future of serverless observability
The present and future of serverless observabilityThe present and future of serverless observability
The present and future of serverless observabilityYan Cui
 
Iwsm2014 performance measurement for cloud computing applications using iso...
Iwsm2014   performance measurement for cloud computing applications using iso...Iwsm2014   performance measurement for cloud computing applications using iso...
Iwsm2014 performance measurement for cloud computing applications using iso...Nesma
 
Webinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computingWebinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computingCREATE-NET
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitterTwitter Developers
 
Part 1: DRS and DPM Implementation in Virtualized Environment, Part 2: Large ...
Part 1: DRS and DPM Implementation in Virtualized Environment, Part 2: Large ...Part 1: DRS and DPM Implementation in Virtualized Environment, Part 2: Large ...
Part 1: DRS and DPM Implementation in Virtualized Environment, Part 2: Large ...Akshay Wattal
 

Similar a Amazon CloudWatch - Observability and Monitoring (20)

Sql server lesson12
Sql server lesson12Sql server lesson12
Sql server lesson12
 
Sql server lesson12
Sql server lesson12Sql server lesson12
Sql server lesson12
 
Monitoring - deeper dive
Monitoring  - deeper diveMonitoring  - deeper dive
Monitoring - deeper dive
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture f...
AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture f...AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture f...
AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture f...
 
Salesforce enabling real time scenarios at scale using kafka
Salesforce enabling real time scenarios at scale using kafkaSalesforce enabling real time scenarios at scale using kafka
Salesforce enabling real time scenarios at scale using kafka
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
 
Kks sre book_ch10
Kks sre book_ch10Kks sre book_ch10
Kks sre book_ch10
 
ENT203 Monitoring and Autoscaling, a Match Made in Heaven
ENT203 Monitoring and Autoscaling, a Match Made in HeavenENT203 Monitoring and Autoscaling, a Match Made in Heaven
ENT203 Monitoring and Autoscaling, a Match Made in Heaven
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
Cloudwatch - The In's and Out's
Cloudwatch - The In's and Out'sCloudwatch - The In's and Out's
Cloudwatch - The In's and Out's
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
Performance monitoring for Docker - Lucerne meetup
Performance monitoring for Docker - Lucerne meetupPerformance monitoring for Docker - Lucerne meetup
Performance monitoring for Docker - Lucerne meetup
 
A Framework for Performance Analysis of Computing Clouds
A Framework for Performance Analysis of Computing CloudsA Framework for Performance Analysis of Computing Clouds
A Framework for Performance Analysis of Computing Clouds
 
SplunkLive! Zurich 2018: Splunk for Security at Swisscom CSIRT
SplunkLive! Zurich 2018: Splunk for Security at Swisscom CSIRTSplunkLive! Zurich 2018: Splunk for Security at Swisscom CSIRT
SplunkLive! Zurich 2018: Splunk for Security at Swisscom CSIRT
 
The present and future of serverless observability
The present and future of serverless observabilityThe present and future of serverless observability
The present and future of serverless observability
 
Iwsm2014 performance measurement for cloud computing applications using iso...
Iwsm2014   performance measurement for cloud computing applications using iso...Iwsm2014   performance measurement for cloud computing applications using iso...
Iwsm2014 performance measurement for cloud computing applications using iso...
 
Webinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computingWebinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computing
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
 
Part 1: DRS and DPM Implementation in Virtualized Environment, Part 2: Large ...
Part 1: DRS and DPM Implementation in Virtualized Environment, Part 2: Large ...Part 1: DRS and DPM Implementation in Virtualized Environment, Part 2: Large ...
Part 1: DRS and DPM Implementation in Virtualized Environment, Part 2: Large ...
 

Más de Rick Hwang

在生命轉彎的地方 - 從軟體開發職涯,探索人生
在生命轉彎的地方 - 從軟體開發職涯,探索人生在生命轉彎的地方 - 從軟體開發職涯,探索人生
在生命轉彎的地方 - 從軟體開發職涯,探索人生Rick Hwang
 
20230829 - 探索職涯,複利人生
20230829 - 探索職涯,複利人生20230829 - 探索職涯,複利人生
20230829 - 探索職涯,複利人生Rick Hwang
 
2023 08 - SRE 實踐與開發平台指南 - 書友見面會
2023 08 - SRE 實踐與開發平台指南 - 書友見面會2023 08 - SRE 實踐與開發平台指南 - 書友見面會
2023 08 - SRE 實踐與開發平台指南 - 書友見面會Rick Hwang
 
20230215 - 凝聚團隊共識的溝通方法 (Effective Team Communication)
20230215 - 凝聚團隊共識的溝通方法 (Effective Team Communication)20230215 - 凝聚團隊共識的溝通方法 (Effective Team Communication)
20230215 - 凝聚團隊共識的溝通方法 (Effective Team Communication)Rick Hwang
 
軟體測試實務新書發表會 - 從品質與測試,讓軟體再次偉大
軟體測試實務新書發表會 - 從品質與測試,讓軟體再次偉大軟體測試實務新書發表會 - 從品質與測試,讓軟體再次偉大
軟體測試實務新書發表會 - 從品質與測試,讓軟體再次偉大Rick Hwang
 
CH02 API Governance
CH02 API Governance CH02 API Governance
CH02 API Governance Rick Hwang
 
Chapter 8. Partial updates and retrievals.pdf
Chapter 8. Partial updates and retrievals.pdfChapter 8. Partial updates and retrievals.pdf
Chapter 8. Partial updates and retrievals.pdfRick Hwang
 
Ch09 Custom Methods
Ch09 Custom MethodsCh09 Custom Methods
Ch09 Custom MethodsRick Hwang
 
AWS Career Exploration Day
AWS Career Exploration DayAWS Career Exploration Day
AWS Career Exploration DayRick Hwang
 
從理想、到現實的距離,開啟品味軟體測試之路 - 台灣軟體工程協會 (20220813)
從理想、到現實的距離,開啟品味軟體測試之路 - 台灣軟體工程協會 (20220813)從理想、到現實的距離,開啟品味軟體測試之路 - 台灣軟體工程協會 (20220813)
從理想、到現實的距離,開啟品味軟體測試之路 - 台灣軟體工程協會 (20220813)Rick Hwang
 
SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路
SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路
SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路Rick Hwang
 
導讀持續交付 2.0 - CH02 價值探索環
導讀持續交付 2.0 - CH02 價值探索環 導讀持續交付 2.0 - CH02 價值探索環
導讀持續交付 2.0 - CH02 價值探索環 Rick Hwang
 
2020 AWS Summit - 如何有效管理 AWS 的成本結構與系統架構
2020 AWS Summit - 如何有效管理 AWS 的成本結構與系統架構2020 AWS Summit - 如何有效管理 AWS 的成本結構與系統架構
2020 AWS Summit - 如何有效管理 AWS 的成本結構與系統架構Rick Hwang
 
災難演練 @ AWS 實戰分享 (Using AWS for Disaster Recovery)
災難演練 @ AWS 實戰分享 (Using AWS for Disaster Recovery)災難演練 @ AWS 實戰分享 (Using AWS for Disaster Recovery)
災難演練 @ AWS 實戰分享 (Using AWS for Disaster Recovery)Rick Hwang
 
Software Development Process v1.5 - 20121214
Software Development Process v1.5 - 20121214Software Development Process v1.5 - 20121214
Software Development Process v1.5 - 20121214Rick Hwang
 
第三章 建立良好的人際關係網路
第三章 建立良好的人際關係網路第三章 建立良好的人際關係網路
第三章 建立良好的人際關係網路Rick Hwang
 
Wiki in Teamroom - Connected Mind
Wiki in Teamroom - Connected MindWiki in Teamroom - Connected Mind
Wiki in Teamroom - Connected MindRick Hwang
 
導讀持續交付 2.0 - 談當代軟體交付之虛實融合
導讀持續交付 2.0 - 談當代軟體交付之虛實融合導讀持續交付 2.0 - 談當代軟體交付之虛實融合
導讀持續交付 2.0 - 談當代軟體交付之虛實融合Rick Hwang
 
Study Notes - Event-Driven Data Management for Microservices
Study Notes - Event-Driven Data Management for MicroservicesStudy Notes - Event-Driven Data Management for Microservices
Study Notes - Event-Driven Data Management for MicroservicesRick Hwang
 
Study Notes - Using an API Gateway
Study Notes - Using an API GatewayStudy Notes - Using an API Gateway
Study Notes - Using an API GatewayRick Hwang
 

Más de Rick Hwang (20)

在生命轉彎的地方 - 從軟體開發職涯,探索人生
在生命轉彎的地方 - 從軟體開發職涯,探索人生在生命轉彎的地方 - 從軟體開發職涯,探索人生
在生命轉彎的地方 - 從軟體開發職涯,探索人生
 
20230829 - 探索職涯,複利人生
20230829 - 探索職涯,複利人生20230829 - 探索職涯,複利人生
20230829 - 探索職涯,複利人生
 
2023 08 - SRE 實踐與開發平台指南 - 書友見面會
2023 08 - SRE 實踐與開發平台指南 - 書友見面會2023 08 - SRE 實踐與開發平台指南 - 書友見面會
2023 08 - SRE 實踐與開發平台指南 - 書友見面會
 
20230215 - 凝聚團隊共識的溝通方法 (Effective Team Communication)
20230215 - 凝聚團隊共識的溝通方法 (Effective Team Communication)20230215 - 凝聚團隊共識的溝通方法 (Effective Team Communication)
20230215 - 凝聚團隊共識的溝通方法 (Effective Team Communication)
 
軟體測試實務新書發表會 - 從品質與測試,讓軟體再次偉大
軟體測試實務新書發表會 - 從品質與測試,讓軟體再次偉大軟體測試實務新書發表會 - 從品質與測試,讓軟體再次偉大
軟體測試實務新書發表會 - 從品質與測試,讓軟體再次偉大
 
CH02 API Governance
CH02 API Governance CH02 API Governance
CH02 API Governance
 
Chapter 8. Partial updates and retrievals.pdf
Chapter 8. Partial updates and retrievals.pdfChapter 8. Partial updates and retrievals.pdf
Chapter 8. Partial updates and retrievals.pdf
 
Ch09 Custom Methods
Ch09 Custom MethodsCh09 Custom Methods
Ch09 Custom Methods
 
AWS Career Exploration Day
AWS Career Exploration DayAWS Career Exploration Day
AWS Career Exploration Day
 
從理想、到現實的距離,開啟品味軟體測試之路 - 台灣軟體工程協會 (20220813)
從理想、到現實的距離,開啟品味軟體測試之路 - 台灣軟體工程協會 (20220813)從理想、到現實的距離,開啟品味軟體測試之路 - 台灣軟體工程協會 (20220813)
從理想、到現實的距離,開啟品味軟體測試之路 - 台灣軟體工程協會 (20220813)
 
SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路
SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路
SRE Conf 2022 - 91APP 在 AWS 上的 SRE 實踐之路
 
導讀持續交付 2.0 - CH02 價值探索環
導讀持續交付 2.0 - CH02 價值探索環 導讀持續交付 2.0 - CH02 價值探索環
導讀持續交付 2.0 - CH02 價值探索環
 
2020 AWS Summit - 如何有效管理 AWS 的成本結構與系統架構
2020 AWS Summit - 如何有效管理 AWS 的成本結構與系統架構2020 AWS Summit - 如何有效管理 AWS 的成本結構與系統架構
2020 AWS Summit - 如何有效管理 AWS 的成本結構與系統架構
 
災難演練 @ AWS 實戰分享 (Using AWS for Disaster Recovery)
災難演練 @ AWS 實戰分享 (Using AWS for Disaster Recovery)災難演練 @ AWS 實戰分享 (Using AWS for Disaster Recovery)
災難演練 @ AWS 實戰分享 (Using AWS for Disaster Recovery)
 
Software Development Process v1.5 - 20121214
Software Development Process v1.5 - 20121214Software Development Process v1.5 - 20121214
Software Development Process v1.5 - 20121214
 
第三章 建立良好的人際關係網路
第三章 建立良好的人際關係網路第三章 建立良好的人際關係網路
第三章 建立良好的人際關係網路
 
Wiki in Teamroom - Connected Mind
Wiki in Teamroom - Connected MindWiki in Teamroom - Connected Mind
Wiki in Teamroom - Connected Mind
 
導讀持續交付 2.0 - 談當代軟體交付之虛實融合
導讀持續交付 2.0 - 談當代軟體交付之虛實融合導讀持續交付 2.0 - 談當代軟體交付之虛實融合
導讀持續交付 2.0 - 談當代軟體交付之虛實融合
 
Study Notes - Event-Driven Data Management for Microservices
Study Notes - Event-Driven Data Management for MicroservicesStudy Notes - Event-Driven Data Management for Microservices
Study Notes - Event-Driven Data Management for Microservices
 
Study Notes - Using an API Gateway
Study Notes - Using an API GatewayStudy Notes - Using an API Gateway
Study Notes - Using an API Gateway
 

Último

Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksMagic Marks
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...Health
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxnuruddin69
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 

Último (20)

Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 

Amazon CloudWatch - Observability and Monitoring

  • 1. 1
  • 2. AWS CloudWatch Observability and Monitoring 2 Rick Hwang rick_kyhwang@hotmail.com 2017/12/28
  • 5. 5
  • 7. Agenda ● CloudWatch Metric ● CloudWatch Dashboard ● CloudWatch Alarm ● CloudWatch Event / Rules ● CloudWatch Logs 7
  • 8. ● SNS: Simple Notification Service ● SES: Simple Email Service ● SQS: Simple Queue Service ● Lambda: Serverless ● Auto Scaling ● CloudTrail Related AWS Services 8
  • 9. Questions ● 怎麼知道系統的狀況? ● 系統的指標是怎麼來的? ● 系統有哪一些層級要知道?哪些人要知道?怎麼知道? ● 知道之後做什麼?怎麼做?主動、被動? ● 什麼是監、控? 9
  • 10. How Amazon CloudWatch Works CloudWatch Basic Concepts 10
  • 11. 11 EC2 Instances Log Shipper Logs Log Groups Log Stream A Log Stream B Log Stream C Log Stream N Alarms Filters [ts, hostname, scope=NGX, tcp_all, tcp_time_wait, tcp_established, ...] /var/log/app/*.log 2017-06-11T08:45:01 app1 NGX 47 0 47 0 0 0 2017-06-11T08:45:01 app2 NGX 52 0 52 0 0 0 2017-06-11T08:46:01 app1 NGX 53 0 52 0 0 0 2017-06-11T08:46:01 app2 NGX 52 0 51 0 0 0 2017-06-11T08:47:01 app1 NGX 53 0 53 0 0 0 2017-06-11T08:47:01 app2 NGX 53 0 53 0 0 0 2017-06-11T08:48:01 app1 NGX 59 0 59 0 0 0 2017-06-11T08:48:01 app2 NGX 52 0 51 0 0 0 2017-06-11T08:49:01 app1 NGX 48 0 48 0 0 0 Dashboard Metrics S3 Amazon ESLambda SNS Topics Export Streaming Push Lambda
  • 12. 12 出處:AWS Summit 2016: Big Data Architectural Patterns and Best Practices
  • 13. Key Points 13 ● 產生結構化、有意義的 Log ○ 結構化: csv, json ○ 有意義: 可統計的資料 → sum, max, min, average, count … ○ 可以下 SQL ● 想想系統上線後需要知道什麼?這些東西哪裡來? ● 盡可能不要動用到 ETL (Extract, Transform, Load) ○ 成本很高、浪費 ○ 維護成本 ○ 溝通成本
  • 14. 14
  • 17. Metric - CPU Utilization 17 UTC
  • 18. CloudWatch Metric 18 ● Period: 每次取樣的時間週期 ○ EC2 預設為 5m (Free), 可以調整為 1m (另外計費) ○ ELB 預設為 1m ○ Custom metirc supports high resolution: 1s ● Statistics: 統計方式,不同指標有預設的方式 ○ Sum ○ Average ○ Max ○ Min ○ Sample Count ● Unit: 單位 ○ Percent ○ Count ○ Bytes
  • 19. Wikipedia: 長尾 Statistics - Long Tail 19Amazon CloudWatch Update – Percentile Statistics and New Dashboard Widgets
  • 20. Metric Types ● Metrics Provided by AWS ● Custom Metric ○ 透過 AWS CLI / SDK 上傳取樣資料 (json) → 不好做,容易出錯 ○ 透過 awslogs or CloudWatch Agent (New) 上傳到 CloudWatch Logs,自訂 Filter 產生 Metric ■ 流程長,但是不難做 ■ 推薦這個做法 20
  • 22. 22 Metric Description CPUUtilization The percentage of allocated EC2 compute units that are currently in use on the instance. This metric identifies the processing power required to run an application upon a selected instance. To use the percentiles statistic, you must enable detailed monitoring. Depending on the instance type, tools in your operating system can show a lower percentage than CloudWatch when the instance is not allocated a full processor core. Units: Percent DiskReadOps Completed read operations from all instance store volumes available to the instance in a specified period of time. To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number of seconds in that period. Units: Count DiskWriteOps Completed write operations to all instance store volumes available to the instance in a specified period of time. To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number of seconds in that period. Units: Count
  • 23. 23 Metric Description DiskReadBytes Bytes read from all instance store volumes available to the instance. This metric is used to determine the volume of the data the application reads from the hard disk of the instance. This can be used to determine the speed of the application. The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60. Units: Bytes DiskWriteBytes Bytes written to all instance store volumes available to the instance. This metric is used to determine the volume of the data the application writes onto the hard disk of the instance. This can be used to determine the speed of the application. The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60. Units: Bytes
  • 24. 24 Metric Description NetworkIn The number of bytes received on all network interfaces by the instance. This metric identifies the volume of incoming network traffic to a single instance. The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60. Units: Bytes NetworkOut The number of bytes sent out on all network interfaces by the instance. This metric identifies the volume of outgoing network traffic from a single instance. The number reported is the number of bytes sent during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60. Units: Bytes NetworkPacketsIn The number of packets received on all network interfaces by the instance. This metric identifies the volume of incoming traffic in terms of the number of packets on a single instance. This metric is available for basic monitoring only. Units: Count Statistics: Minimum, Maximum, Average NetworkPacketsOut The number of packets sent out on all network interfaces by the instance. This metric identifies the volume of outgoing traffic in terms of the number of packets on a single instance. This metric is available for basic monitoring only. Units: Count Statistics: Minimum, Maximum, Average
  • 25. EC2 Metrics ● 預設 Period = 5min (Free) ○ Detail Monitoring: period = 1min ($$) ● memory, disk 不支援,需要透過其他方式 ○ CloudWatch Agent (201712 release) ○ telegraf, collectd, cacti, nagios …. 25
  • 26. ELB Metrics 負載平衡 26 Elastic Load Balancing Metrics and Dimensions
  • 27. 27 Metric Description Latency [HTTP listener] The total time elapsed, in seconds, from the time the load balancer sent the request to a registered instance until the instance started to send the response headers. [TCP listener] The total time elapsed, in seconds, for the load balancer to successfully establish a connection to a registered instance. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Average. Use Maximum to determine whether some requests are taking substantially longer than the average. Note that Minimum is typically not useful. Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that requests sent to 1 instance in us-west-2a have a higher latency. The average for us-west-2a has a higher value than the average for us-west-2b. RequestCount The number of requests completed or connections made during the specified interval (1 or 5 minutes). [HTTP listener] The number of requests received and routed, including HTTP error responses from the registered instances. [TCP listener] The number of connections made to the registered instances. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average all return 1. Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that 100 requests are sent to the load balancer. There are 60 requests sent to us-west-2a, with each instance receiving 30 requests, and 40 requests sent to us-west-2b, with each instance receiving 20 requests. With the AvailabilityZone dimension, there is a sum of 60 requests in us-west-2a and 40 requests in us-west-2b. With the LoadBalancerName dimension, there is a sum of 100 requests.
  • 28. 28 Metric Description HealthyHostCount The number of healthy instances registered with your load balancer. A newly registered instance is considered healthy after it passes the first health check. If cross-zone load balancing is enabled, the number of healthy instances for the LoadBalancerName dimension is calculated across all Availability Zones. Otherwise, it is calculated per Availability Zone. Reporting criteria: There are registered instances Statistics: The most useful statistics are Average and Maximum. These statistics are determined by the load balancer nodes. Note that some load balancer nodes might determine that an instance is unhealthy for a brief period while other nodes determine that it is healthy. Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, us-west-2a has 1 unhealthy instance, and us-west-2b has no unhealthy instances. With the AvailabilityZone dimension, there is an average of 1 healthy and 1 unhealthy instance in us-west-2a, and an average of 2 healthy and 0 unhealthy instances in us-west-2b. UnHealthyHostCount The number of unhealthy instances registered with your load balancer. An instance is considered unhealthy after it exceeds the unhealthy threshold configured for health checks. An unhealthy instance is considered healthy again after it meets the healthy threshold configured for health checks. Reporting criteria: There are registered instances Statistics: The most useful statistics are Average and Minimum. These statistics are determined by the load balancer nodes. Note that some load balancer nodes might determine that an instance is unhealthy for a brief period while other nodes determine that it is healthy. Example: See HealthyHostCount.
  • 29. 29 Metric Description HTTPCode_Backend_2XX, HTTPCode_Backend_3XX, HTTPCode_Backend_4XX, HTTPCode_Backend_5XX [HTTP listener] The number of HTTP response codes generated by registered instances. This count does not include any response codes generated by the load balancer. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average are all 1. Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that requests sent to 1 instance in us-west-2a result in HTTP 500 responses. The sum for us-west-2a includes these error responses, while the sum for us-west-2b does not include them. Therefore, the sum for the load balancer equals the sum for us-west-2a. HTTPCode_ELB_4XX [HTTP listener] The number of HTTP 4XX client error codes generated by the load balancer. Client errors are generated when a request is malformed or incomplete. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average are all 1. Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that client requests include a malformed request URL. As a result, client errors would likely increase in all Availability Zones. The sum for the load balancer is the sum of the values for the Availability Zones. HTTPCode_ELB_5XX [HTTP listener] The number of HTTP 5XX server error codes generated by the load balancer. This count does not include any response codes generated by the registered instances. The metric is reported if there are no healthy instances registered to the load balancer, or if the request rate exceeds the capacity of the instances (spillover) or the load balancer. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Minimum, Maximum, and Average are all 1. Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer nodes in us-west-2a fills and clients receive a 503 error. If us-west-2b continues to respond normally, the sum for the load balancer equals the sum for us-west-2a.
  • 30. 30 Metric Description BackendConnectionErrors The number of connections that were not successfully established between the load balancer and the registered instances. Because the load balancer retries the connection when there are errors, this count can exceed the request rate. Note that this count also includes any connection errors related to health checks. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Average, Minimum, and Maximum are reported per load balancer node and are not typically useful. However, the difference between the minimum and maximum (or peak to average or average to trough) might be useful to determine whether a load balancer node is an outlier. Example: Suppose that your load balancer has 2 instances in us-west-2a and 2 instances in us-west-2b, and that attempts to connect to 1 instance in us-west-2a result in back-end connection errors. The sum for us-west-2a includes these connection errors, while the sum for us-west-2b does not include them. Therefore, the sum for the load balancer equals the sum for us-west-2a.
  • 31. 31 Metric Description SpilloverCount The total number of requests that were rejected because the surge queue is full. [HTTP listener] The load balancer returns an HTTP 503 error code. [TCP listener] The load balancer closes the connection. Reporting criteria: There is a nonzero value Statistics: The most useful statistic is Sum. Note that Average, Minimum, and Maximum are reported per load balancer node and are not typically useful. Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer node in us-west-2a fills, resulting in spillover. If us-west-2b continues to respond normally, the sum for the load balancer will be the same as the sum for us-west-2a. SurgeQueueLength The total number of requests that are pending routing. The load balancer queues a request if it is unable to establish a connection with a healthy instance in order to route the request. The maximum size of the queue is 1,024. Additional requests are rejected when the queue is full. For more information, see SpilloverCount. Reporting criteria: There is a nonzero value. Statistics: The most useful statistic is Maximum, because it represents the peak of queued requests. The Average statistic can be useful in combination with Minimum and Maximum to determine the range of queued requests. Note that Sum is not useful. Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled, and that instances in us-west-2a are experiencing high latency and are slow to respond to requests. As a result, the surge queue for the load balancer nodes in us-west-2a fills, with clients likely experiencing increased response times. If this continues, the load balancer will likely have spillovers (see the SpilloverCount metric). If us-west-2b continues to respond normally, the max for the load balancer will be the same as the max for us-west-2a.
  • 32. 請參閱:Amazon CloudWatch Metrics and Dimensions Reference 族繁不及備載 ... 32
  • 33. ● EC2 ● EBS ● ELB: CLB, ALB, NLB ○ Classic Load Balancing ○ Application Load Balancing ○ Network Load Balancing 需要了解的 Metrics 33
  • 35. Question and Think: EC2 / ELB 的指標是怎麼來的? 35
  • 36. 36
  • 41. 41 CloudWatch Dashboard ● widget: line, stacked, number, text (markdown) ● auto refresh ● local timezone ○ EC2 metric is UTC ● time range ● Horizontal annotation ● Right / Left Y axis ● full screen (dark / light mode)
  • 42. ● Dashboard 可以 import / export 成 json ● 可以透過 API 自動更新 ● $3.00 per dashboard per month (ap-northeast-1) ● Time zone 42 Tips
  • 45. Demo: CloudWatch Dashboard Widgets, X/Y Axis, Annotation 45
  • 46. 46
  • 48. CloudWatch Alarm 48 ● 達到門檻值 (Threshold) 之後觸發的動作 ○ 五分鐘之內 ○ CPU >= 80% ○ 五次 ● 動作類型 ○ EC2 actions: reboot, stop, terminate. 通常結合 EC2 System Status 使用。 ○ SNS to: ■ SES ■ SQS ■ Lambda ■ HTTP Request
  • 49. CloudWatch Alarm - Status 49 ● ALARM: over threshold ● INSUFFICIENT: INSUFFICIENT DATA ● OK
  • 51. Event-driven → Feedback → Automation 51來源:『自動化XXX』的陷阱 CW Alarm
  • 52. 52
  • 54. 54 CloudWatch Event ● Event Source ○ Event Pattern ○ Schedule ● Targets ○ Multiple 5 targets (fixed) ○ Type: Lambda, EC2, Stream, ECS, SSM, Step Function, Pipeline, SNS, SQS …..
  • 55. 55 CloudWatch Events ● Event Source ○ Event Pattern: DynamoDB, EC2, AutoScaling, RDS …. 太多了 ○ Schedule ● Targets ○ Multiple 5 targets (fixed) ○ Type: Lambda, EC2, Stream, ECS, SSM, Step Function, Pipeline, SNS, SQS ….. 太多了
  • 56. 56 常用情境 ● EC2 預防性自動化: ○ 不該關機的機器被關機,自動重 啟 ○ 機器硬體故障,自動重 啟 ○ 狀態改變的行為 ● S3 Action 之後 ○ Action: PutObject ○ Trigger: Lambda, Put Message to SQS
  • 58. 58
  • 59. CloudWatch Logs Filter, Custom Metric, Log Shipper 59
  • 60. 60 EC2 Instances Log Shipper Logs Log Groups Log Stream A Log Stream B Log Stream C Log Stream N Alarms Filters [ts, hostname, scope=NGX, tcp_all, tcp_time_wait, tcp_established, ...] /var/log/app/*.log 2017-06-11T08:45:01 app1 NGX 47 0 47 0 0 0 2017-06-11T08:45:01 app2 NGX 52 0 52 0 0 0 2017-06-11T08:46:01 app1 NGX 53 0 52 0 0 0 2017-06-11T08:46:01 app2 NGX 52 0 51 0 0 0 2017-06-11T08:47:01 app1 NGX 53 0 53 0 0 0 2017-06-11T08:47:01 app2 NGX 53 0 53 0 0 0 2017-06-11T08:48:01 app1 NGX 59 0 59 0 0 0 2017-06-11T08:48:01 app2 NGX 52 0 51 0 0 0 2017-06-11T08:49:01 app1 NGX 48 0 48 0 0 0 Dashboard Metrics S3 Amazon ESLambda SNS Topics Export Streaming Push Lambda
  • 61. ● 前提:EC2 要安裝 awslogs driver or CloudWatch agent ○ ECS Instance 用選的就可以 ● 即時把 Log 傳到 CWL ○ 可以在 CWL 直接 Query Log (堪用) ○ 不用擔心 Storage 會爆炸 or 維護 ○ 可以設定 Log Rotation ● 透過 Filter 建立 Custom Metric ○ 可以建立 Dashboard ○ 可以建立 Alarm → Event-driven ■ To Lambda, Slack ■ ETL ■ Automation … 無限可能 CloudWatch Logs (CWL) 61
  • 62. ● 透過取樣 (Sampling) 待測目標得來的資料 ○ 單位時間的資料,例如每毫秒、每秒、每分 ● 取樣頻率越高,數據越精準 ● 聲音的音質 (sample rate per second) ○ CD Quality: 44.1kHz ○ 錄音室錄音:192kHz ● 攝影的解析度 (Resolution) ○ HD ○ Full-HD ○ 4k 指標 (Metric) 62
  • 64. 64
  • 66. Questions ● 怎麼知道系統的狀況? ○ 觀測 (Observe)、量測 (Measure) ● 系統的指標是怎麼來的? ○ 指標是經過系統性測試 (System Test) 後,分析 Log 找出來的 ● 系統有哪一些層級要知道?哪些人要知道? ○ Business、Application、OS/Hardware、Network ● 知道之後做什麼?怎麼做?主動、被動? ● 什麼是監、控? ○ 監: Watch ○ 控: Control 66
  • 67. 67
  • 79. 79
  • 83. Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … 83 Target Services / Systems Watchers Controllers Push or Pull Data (Observability, Measure)
  • 84. Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … Push or Pull Data (Observability, Measure) 84 Target Services / Systems Watchers Controllers Events (Conditions / Thresholds) Console => Do Something ● Reset or Clean Cache ● On / Off Functions ● Notification ● ...
  • 85. Commands Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … 85 Target Services / Systems Watchers Controllers Events (Conditions / Thresholds) Console => Do Something ● Reset or Clean Cache ● On / Off Functions ● Notification ● ... Push or Pull Data (Observability, Measure)
  • 86. Commands Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … 86 Target Services / Systems Watchers Controllers Events (Conditions / Thresholds) Console => Do Something ● Reset or Clean Cache ● On / Off Functions ● Notification ● ... Feedback (Adjust Conditions / Thresholds by ML) Push or Pull Data (Observability, Measure)
  • 87. Commands Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … 87 Target Services / Systems Watchers Controllers Events (Conditions / Thresholds) Console => Do Something ● Reset or Clean Cache ● On / Off Functions ● Notification ● ... Feedback (Adjust Conditions / Thresholds by ML) Push or Pull Data (Observability, Measure) 監
  • 88. Commands Dashboard => Show Something ● Health Status ● Sum of Biz TX ● Sys Resources ● … 88 Target Services / Systems Watchers Controllers Events (Conditions / Thresholds) Console => Do Something ● Reset or Clean Cache ● On / Off Functions ● Notification ● ... Feedback (Adjust Conditions / Thresholds by ML) Push or Pull Data (Observability, Measure) 監 控
  • 89. 89 Observability vs Monitoring ● 量測:Measure ● 觀測:Observe ● 氣象局 ○ Observability 觀測 ○ Measurement 量測 ● 政府 ○ Monitoring ○ Alert ○ Action ○ Feedback
  • 91. 91 量測 (Measure) → Sample from Log 觀測 (Observe) → Metric 回饋 (Feedback) → Analyze, Condition, Alarm 控制 (Control) → Automation, 躺著幹
  • 93. Log 很重要 沒有結構化的 Log or Data 會付出很多 ETL 的成本與時間 93
  • 94. Event-driven → Feedback → Automation 94來源:『自動化XXX』的陷阱 CW Alarm
  • 95. 95
  • 96. Why CloudWatch ● Serverless Monitoring System ● Event-driven ● Programmable and Automation ● Realtime and Backup ● Monitoring Monitoring System at Netflix - 2017/05/22 ● CloudWatch 滿足 “Basic Montioring” 的需求 96
  • 98. 為什麼不選其它監控工具? ● 不想自己蓋機器、養機器 ● 監控系統做得再好,都只是成本 ● 監控系統不是 Big Data ● 有些 Solution 的架構沒有考慮 HA, ex: Prometheus 98
  • 99. 99
  • 100. 100 Alarm System using Serverless
  • 101. EC2 CloudWatch Alarms Operators CloudWatch Event (time-based) SNS-Adapter Slack-Notifier SNS Topic Info, Warning Info Developers Health-Checker Auto Scaling SNS Topic Urgent SMS Warning 系統架構: CloudWatch + SNS + Lambda + Slack Testers ● Urgent: SMS, Slack ● Warning: Slack w/ tag ● Info: Slack w/o tag
  • 102. 102 CloudWatch Reporter - System Architecture CloudWatch Reporter / Alamer CloudWatch Event (time-based) Info / Alert Channels Operators (值班) Operators Developers (On Call) Metric Configs (Namespace, Stats) Target Services Loading maintain PR Read CW Metrics Schedule maintain Developers development Feature Request
  • 103. 103
  • 104. Best Practice ● 盡量活用 Cloud SaaS,像是 AWS CloudWatch, GCP Stackdriver ● 把部署設定過程設計成 Configurable ● 把 Log 設計成結構化格式 (csv or json) ● 利用 Big Data Solution 處理 Log Query 需求,像是 AWS Athena or GCP BigQuery ● Log 透過 Shipper (awslogs, statsd, collectd, fluentd, telegraf ... ) 同時傳到 ○ S3 備份,以符合稽核需求 ○ CloudWatch 作為 Debug / 監控需求 ● 巨量 Log Streaming 資料需要有 Queue 協助 ○ AWS Kinesis ○ GCP Pub/Sub 104
  • 105. ● CloudWatch User Guide ● CloudWatch Events User Guide ● CloudWatch Log User Guide Reference - User Guide 105
  • 106. ● AWS re:Invent 2015: Log, Monitor and Analyze your IT with Amazon CloudWatch (DVO315) ● Amazon CloudWatch Update – Percentile Statistics and New Dashboard Widgets ● New – High-Resolution Custom Metrics and Alarms for Amazon CloudWatch ● 淺談系統監控與 CloudWatch 的應用 - AWS User Group Taiwan ● Study Notes - CloudWatch ● SRE CH6 Monitoring Distributed Systems (監控分散式系統) ● 高品質微服務 - CH6 監控 Reference - Youtube, Blog 106
  • 107. 107 /* End of Slide */