9. • EVERY Engineers MUST care about their application
statuses
• EVERY Engineers MUST do on-call rotate
• NO "application engineer" who write code only
• We have a dedicate team to provide them stable tools
to care about their application status at best
CULTURE
15. • Metrics
• Most simplest form is a triple
• (name, value, timestamp)
• Could be represent as graph
METRICS
16. • System Metrics
• CPU/Disk IO/Network/DiskUsage...
• MUST: have alert for critical metrics by default (users
don't know what to monitor, and don't know the good
threshold)
• Application Metrics
• Internal queue size, endpoint latency tail (p50, p95,
p99), request size, request count
METRICS
17. • In LINE we care A LOT about Application Metrics
• We try to instrument every single new added logic
• Some of our heavy servers exported over 10000
metrics per server
METRICS
20. • In LINE All error / warning logs MUST be
• Permanent stored (for trouble shooting later)
• Used for alerting
• Easy to query (you should not go to each host,
and do grep access log)
LOGGING
24. • Not a common concept in normal service
• Very helpful in microservice or fully async
system , when a response could come from
multiple services or multiple async threads.
TRACING
27. • We call it IMON
• IMON could
• Aggregate metrics from dozen of thousands of hosts, and
do alert
• Aggregate warn/error logs from application and do alert
• (on going) Tracing requests across services
33. •Shard-ing MySQL cluster (~50 servers)
•Partition by “customers”
•Batching write for better throughput
METRICS DATABASE
34. • MySQL is not fit for time series database
• "Good TSDB"?
• Compression
• Optimize for write, but read MUST fast enough
• Flexible query (topK, rate, delta)
• Fast aggregate
• We're moving to OpenTSDB
METRICS DATABASE
35. • ElasticSearch to store warn/error log
• ElasticSearch is very good at writing (with support
of batching write from application layer)
• However, some bad read query will kill the server
LOGGING DATABASE
36. • Wrote our own in golang
• Similar architect with telegraf (but with buffer)
• Fully managed
• Monitor all agents CPU / memory usage..
• Monitor all agents error
• Automatically roll-out
TELEMETRY AGENT
37. • Flexbile routing rules
• Dedicated collector for big customer
• Drop request by dynamic configuration
• Written by armeria and centraldogma
ROUTING GATEWAY
https://github.com/line/armeria
https://github.com/line/centraldogma
38. • Faster, more stable TSDB
• Wire everything together
• For every alert, see the big image with metrics/
log/tracing in same place
• Autonomous alerting
• With help of Machine Learning
FUTURE
39. FINALLY
• How you monitor reflect your engineering
culture
• Data driven culture
• Stability driven culture
• Monitoring IS NOT for devops engineer or
sysadmin only, but for EVERY
ENGINEERS