SlideShare a Scribd company logo
1 of 40
Download to read offline
Metrics driven development,
a observability perspective
Huy Do
LINE corp
Introduction
• Huy Do
• Software Engineer at Observability Team
• Founded kipalog.com & Ruby Vietnam group
Agenda
• Metrics driven culture at LINE
• Introduce our observability stack
LINE
• A lot of end users (~170M active)
• A lot of traffics
• A lot of services (delivery, taxi, games,
manga…)
What we care
• User Experience
• One important prospect of User Experience
is Reliability
RELIABILITY
• No Downtime
• Low MTTR (Mean Time To Repair)
• Fast Response
• Fair response time
• Fair percentile latency : p99, p95, p50
HOW
CULTURE
• EVERY Engineers MUST care about their application
statuses
• EVERY Engineers MUST do on-call rotate
• NO "application engineer" who write code only
• We have a dedicate team to provide them stable tools
to care about their application status at best
CULTURE
APPLICATION STATUS?
OBSERVABILITY
– Wikipedia
“observability is a measure of how
well internal states of a system can be
inferred from knowledge of its external
outputs”
METRICS
LOGGING
TRACING
https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing
METRICS
• Metrics
• Most simplest form is a triple
• (name, value, timestamp)
• Could be represent as graph
METRICS
• System Metrics
• CPU/Disk IO/Network/DiskUsage...
• MUST: have alert for critical metrics by default (users
don't know what to monitor, and don't know the good
threshold)
• Application Metrics
• Internal queue size, endpoint latency tail (p50, p95,
p99), request size, request count
METRICS
• In LINE we care A LOT about Application Metrics
• We try to instrument every single new added logic
• Some of our heavy servers exported over 10000
metrics per server
METRICS
LOGGING
Warn / Error / Fatal log
for alerting
• In LINE All error / warning logs MUST be
• Permanent stored (for trouble shooting later)
• Used for alerting
• Easy to query (you should not go to each host,
and do grep access log)
LOGGING
LOGGING
Real time error/warn log analysis with help of 

Elasticsearch / Kibana
LOGGING
Daily report for error trend
TRACING
• Not a common concept in normal service
• Very helpful in microservice or fully async
system , when a response could come from
multiple services or multiple async threads.
TRACING
TRACING
OpenZipkin
LINE OBSERVABILITY
STACK
• We call it IMON
• IMON could
• Aggregate metrics from dozen of thousands of hosts, and
do alert
• Aggregate warn/error logs from application and do alert
• (on going) Tracing requests across services
HOW BIG?
• ~ 20 millions metrics per minute
• And keep growing every day
• ~ 500k log received per minute (peak time could
up to few millions)
ARCHITECTURE
DETAILS
•Shard-ing MySQL cluster (~50 servers)
•Partition by “customers”
•Batching write for better throughput
METRICS DATABASE
• MySQL is not fit for time series database
• "Good TSDB"?
• Compression
• Optimize for write, but read MUST fast enough
• Flexible query (topK, rate, delta)
• Fast aggregate
• We're moving to OpenTSDB
METRICS DATABASE
• ElasticSearch to store warn/error log
• ElasticSearch is very good at writing (with support
of batching write from application layer)
• However, some bad read query will kill the server
LOGGING DATABASE
• Wrote our own in golang
• Similar architect with telegraf (but with buffer)
• Fully managed
• Monitor all agents CPU / memory usage..
• Monitor all agents error
• Automatically roll-out
TELEMETRY AGENT
• Flexbile routing rules
• Dedicated collector for big customer
• Drop request by dynamic configuration
• Written by armeria and centraldogma
ROUTING GATEWAY
https://github.com/line/armeria
https://github.com/line/centraldogma
• Faster, more stable TSDB
• Wire everything together
• For every alert, see the big image with metrics/
log/tracing in same place
• Autonomous alerting
• With help of Machine Learning
FUTURE
FINALLY
• How you monitor reflect your engineering
culture
• Data driven culture
• Stability driven culture
• Monitoring IS NOT for devops engineer or
sysadmin only, but for EVERY
ENGINEERS
Thank you for listening

More Related Content

What's hot

Modern APM Approaches
Modern APM ApproachesModern APM Approaches
Modern APM ApproachesKen Ahrens
 
Schemas Beyond The Edge
Schemas Beyond The EdgeSchemas Beyond The Edge
Schemas Beyond The Edgeconfluent
 
APidays Paris 2019 - Reason for Asynchronous APIs by John Carter, Software AG
APidays Paris 2019 - Reason for Asynchronous APIs by John Carter, Software AGAPidays Paris 2019 - Reason for Asynchronous APIs by John Carter, Software AG
APidays Paris 2019 - Reason for Asynchronous APIs by John Carter, Software AGapidays
 
LINEデリマでのElasticsearchの運用と監視の話
LINEデリマでのElasticsearchの運用と監視の話LINEデリマでのElasticsearchの運用と監視の話
LINEデリマでのElasticsearchの運用と監視の話LINE Corporation
 
OPEN'17_2_Customer Experience_Essent
OPEN'17_2_Customer Experience_EssentOPEN'17_2_Customer Experience_Essent
OPEN'17_2_Customer Experience_EssentKangaroot
 
Real User Monitoring (RUM)
Real User Monitoring (RUM)Real User Monitoring (RUM)
Real User Monitoring (RUM)Site24x7
 
[Old] Site24x7 Real Browser Monitoring
[Old] Site24x7 Real Browser Monitoring[Old] Site24x7 Real Browser Monitoring
[Old] Site24x7 Real Browser MonitoringSite24x7
 
ONAP Overview Webinar - Aarna Networks & Cloudify
ONAP Overview Webinar - Aarna Networks & CloudifyONAP Overview Webinar - Aarna Networks & Cloudify
ONAP Overview Webinar - Aarna Networks & CloudifyCloudify Community
 
Velocity - NxtGen Oxford
Velocity - NxtGen OxfordVelocity - NxtGen Oxford
Velocity - NxtGen OxfordPhil Pursglove
 
Project FiFo - Architecture
Project FiFo - ArchitectureProject FiFo - Architecture
Project FiFo - ArchitectureLicenser
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsGareth Rogers
 
Beyond POLB (Plain Old Load Balancing)
Beyond POLB (Plain Old Load Balancing) Beyond POLB (Plain Old Load Balancing)
Beyond POLB (Plain Old Load Balancing) Lori MacVittie
 
A Gentle Introduction to Functions-as-a-Service
A Gentle Introduction to Functions-as-a-ServiceA Gentle Introduction to Functions-as-a-Service
A Gentle Introduction to Functions-as-a-ServiceValeri Karpov
 
Apache Kafka : Monitoring vs Alerting
Apache Kafka : Monitoring vs AlertingApache Kafka : Monitoring vs Alerting
Apache Kafka : Monitoring vs AlertingRatish Ravindran
 
Rootconf 2017 - State of the Open Source monitoring landscape
Rootconf 2017 - State of the Open Source monitoring landscape Rootconf 2017 - State of the Open Source monitoring landscape
Rootconf 2017 - State of the Open Source monitoring landscape NETWAYS
 
[Webinar] End User Experience Monitoring with Site24x7
[Webinar] End User Experience Monitoring with Site24x7[Webinar] End User Experience Monitoring with Site24x7
[Webinar] End User Experience Monitoring with Site24x7Site24x7
 
Scale out magento 2 at aws
Scale out magento 2 at awsScale out magento 2 at aws
Scale out magento 2 at awsroot360 GmbH
 
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...Flink Forward
 
Shedding Light on LINE Token Economy You Won't Find in Our White Paper
Shedding Light on LINE Token Economy You Won't Find in Our White PaperShedding Light on LINE Token Economy You Won't Find in Our White Paper
Shedding Light on LINE Token Economy You Won't Find in Our White PaperLINE Corporation
 

What's hot (20)

Modern APM Approaches
Modern APM ApproachesModern APM Approaches
Modern APM Approaches
 
Schemas Beyond The Edge
Schemas Beyond The EdgeSchemas Beyond The Edge
Schemas Beyond The Edge
 
APidays Paris 2019 - Reason for Asynchronous APIs by John Carter, Software AG
APidays Paris 2019 - Reason for Asynchronous APIs by John Carter, Software AGAPidays Paris 2019 - Reason for Asynchronous APIs by John Carter, Software AG
APidays Paris 2019 - Reason for Asynchronous APIs by John Carter, Software AG
 
LINEデリマでのElasticsearchの運用と監視の話
LINEデリマでのElasticsearchの運用と監視の話LINEデリマでのElasticsearchの運用と監視の話
LINEデリマでのElasticsearchの運用と監視の話
 
OPEN'17_2_Customer Experience_Essent
OPEN'17_2_Customer Experience_EssentOPEN'17_2_Customer Experience_Essent
OPEN'17_2_Customer Experience_Essent
 
Real User Monitoring (RUM)
Real User Monitoring (RUM)Real User Monitoring (RUM)
Real User Monitoring (RUM)
 
[Old] Site24x7 Real Browser Monitoring
[Old] Site24x7 Real Browser Monitoring[Old] Site24x7 Real Browser Monitoring
[Old] Site24x7 Real Browser Monitoring
 
ONAP Overview Webinar - Aarna Networks & Cloudify
ONAP Overview Webinar - Aarna Networks & CloudifyONAP Overview Webinar - Aarna Networks & Cloudify
ONAP Overview Webinar - Aarna Networks & Cloudify
 
Velocity - NxtGen Oxford
Velocity - NxtGen OxfordVelocity - NxtGen Oxford
Velocity - NxtGen Oxford
 
Project FiFo - Architecture
Project FiFo - ArchitectureProject FiFo - Architecture
Project FiFo - Architecture
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Beyond POLB (Plain Old Load Balancing)
Beyond POLB (Plain Old Load Balancing) Beyond POLB (Plain Old Load Balancing)
Beyond POLB (Plain Old Load Balancing)
 
A Gentle Introduction to Functions-as-a-Service
A Gentle Introduction to Functions-as-a-ServiceA Gentle Introduction to Functions-as-a-Service
A Gentle Introduction to Functions-as-a-Service
 
Apache Kafka : Monitoring vs Alerting
Apache Kafka : Monitoring vs AlertingApache Kafka : Monitoring vs Alerting
Apache Kafka : Monitoring vs Alerting
 
Rootconf 2017 - State of the Open Source monitoring landscape
Rootconf 2017 - State of the Open Source monitoring landscape Rootconf 2017 - State of the Open Source monitoring landscape
Rootconf 2017 - State of the Open Source monitoring landscape
 
[Webinar] End User Experience Monitoring with Site24x7
[Webinar] End User Experience Monitoring with Site24x7[Webinar] End User Experience Monitoring with Site24x7
[Webinar] End User Experience Monitoring with Site24x7
 
Scale out magento 2 at aws
Scale out magento 2 at awsScale out magento 2 at aws
Scale out magento 2 at aws
 
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
 
ONAP on Vagrant
ONAP on VagrantONAP on Vagrant
ONAP on Vagrant
 
Shedding Light on LINE Token Economy You Won't Find in Our White Paper
Shedding Light on LINE Token Economy You Won't Find in Our White PaperShedding Light on LINE Token Economy You Won't Find in Our White Paper
Shedding Light on LINE Token Economy You Won't Find in Our White Paper
 

Similar to Metrics driven development with dedicated Observability Team

Micro Services Architecture
Micro Services ArchitectureMicro Services Architecture
Micro Services ArchitectureRanjan Baisak
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Startupfest
 
Mule Runtime: Performance Tuning
Mule Runtime: Performance Tuning Mule Runtime: Performance Tuning
Mule Runtime: Performance Tuning MuleSoft
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
 
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...InfluxData
 
PayPal Risk Platform High Performance Practice
PayPal Risk Platform High Performance PracticePayPal Risk Platform High Performance Practice
PayPal Risk Platform High Performance PracticeBrian Ling
 
Kinesis @ lyft
Kinesis @ lyftKinesis @ lyft
Kinesis @ lyftMian Hamid
 
Production Ready Microservices at Scale
Production Ready Microservices at ScaleProduction Ready Microservices at Scale
Production Ready Microservices at ScaleRajeev Bharshetty
 
Kubernetes Infra 2.0
Kubernetes Infra 2.0Kubernetes Infra 2.0
Kubernetes Infra 2.0Deepak Sood
 
Agile infrastructure
Agile infrastructureAgile infrastructure
Agile infrastructureTarun Rajput
 
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Amazon Web Services
 
Debugging Microservices - key challenges and techniques - Microservices Odesa...
Debugging Microservices - key challenges and techniques - Microservices Odesa...Debugging Microservices - key challenges and techniques - Microservices Odesa...
Debugging Microservices - key challenges and techniques - Microservices Odesa...Lohika_Odessa_TechTalks
 
Tech talk microservices debugging
Tech talk microservices debuggingTech talk microservices debugging
Tech talk microservices debuggingAndrey Kolodnitsky
 
Moving to microservices – a technology and organisation transformational journey
Moving to microservices – a technology and organisation transformational journeyMoving to microservices – a technology and organisation transformational journey
Moving to microservices – a technology and organisation transformational journeyBoyan Dimitrov
 
Architecture for Scale [AppFirst]
Architecture for Scale [AppFirst]Architecture for Scale [AppFirst]
Architecture for Scale [AppFirst]AppFirst
 
Dubbo and Weidian's practice on micro-service architecture
Dubbo and Weidian's practice on micro-service architectureDubbo and Weidian's practice on micro-service architecture
Dubbo and Weidian's practice on micro-service architectureHuxing Zhang
 
Istio Mesh – Managing Container Deployments at Scale
Istio Mesh – Managing Container Deployments at ScaleIstio Mesh – Managing Container Deployments at Scale
Istio Mesh – Managing Container Deployments at ScaleMofizur Rahman
 

Similar to Metrics driven development with dedicated Observability Team (20)

Micro Services Architecture
Micro Services ArchitectureMicro Services Architecture
Micro Services Architecture
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
 
Mule Runtime: Performance Tuning
Mule Runtime: Performance Tuning Mule Runtime: Performance Tuning
Mule Runtime: Performance Tuning
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
 
Scaling tappsi
Scaling tappsiScaling tappsi
Scaling tappsi
 
PayPal Risk Platform High Performance Practice
PayPal Risk Platform High Performance PracticePayPal Risk Platform High Performance Practice
PayPal Risk Platform High Performance Practice
 
Kinesis @ lyft
Kinesis @ lyftKinesis @ lyft
Kinesis @ lyft
 
Production Ready Microservices at Scale
Production Ready Microservices at ScaleProduction Ready Microservices at Scale
Production Ready Microservices at Scale
 
Kubernetes Infra 2.0
Kubernetes Infra 2.0Kubernetes Infra 2.0
Kubernetes Infra 2.0
 
Cassandra in xPatterns
Cassandra in xPatternsCassandra in xPatterns
Cassandra in xPatterns
 
Agile infrastructure
Agile infrastructureAgile infrastructure
Agile infrastructure
 
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
 
Redundant devops
Redundant devopsRedundant devops
Redundant devops
 
Debugging Microservices - key challenges and techniques - Microservices Odesa...
Debugging Microservices - key challenges and techniques - Microservices Odesa...Debugging Microservices - key challenges and techniques - Microservices Odesa...
Debugging Microservices - key challenges and techniques - Microservices Odesa...
 
Tech talk microservices debugging
Tech talk microservices debuggingTech talk microservices debugging
Tech talk microservices debugging
 
Moving to microservices – a technology and organisation transformational journey
Moving to microservices – a technology and organisation transformational journeyMoving to microservices – a technology and organisation transformational journey
Moving to microservices – a technology and organisation transformational journey
 
Architecture for Scale [AppFirst]
Architecture for Scale [AppFirst]Architecture for Scale [AppFirst]
Architecture for Scale [AppFirst]
 
Dubbo and Weidian's practice on micro-service architecture
Dubbo and Weidian's practice on micro-service architectureDubbo and Weidian's practice on micro-service architecture
Dubbo and Weidian's practice on micro-service architecture
 
Istio Mesh – Managing Container Deployments at Scale
Istio Mesh – Managing Container Deployments at ScaleIstio Mesh – Managing Container Deployments at Scale
Istio Mesh – Managing Container Deployments at Scale
 

More from LINE Corporation

JJUG CCC 2018 Fall 懇親会LT
JJUG CCC 2018 Fall 懇親会LTJJUG CCC 2018 Fall 懇親会LT
JJUG CCC 2018 Fall 懇親会LTLINE Corporation
 
Reduce dependency on Rx with Kotlin Coroutines
Reduce dependency on Rx with Kotlin CoroutinesReduce dependency on Rx with Kotlin Coroutines
Reduce dependency on Rx with Kotlin CoroutinesLINE Corporation
 
Kotlin/NativeでAndroidのNativeメソッドを実装してみた
Kotlin/NativeでAndroidのNativeメソッドを実装してみたKotlin/NativeでAndroidのNativeメソッドを実装してみた
Kotlin/NativeでAndroidのNativeメソッドを実装してみたLINE Corporation
 
Use Kotlin scripts and Clova SDK to build your Clova extension
Use Kotlin scripts and Clova SDK to build your Clova extensionUse Kotlin scripts and Clova SDK to build your Clova extension
Use Kotlin scripts and Clova SDK to build your Clova extensionLINE Corporation
 
The Magic of LINE 購物 Testing
The Magic of LINE 購物 TestingThe Magic of LINE 購物 Testing
The Magic of LINE 購物 TestingLINE Corporation
 
UI Automation Test with JUnit5
UI Automation Test with JUnit5UI Automation Test with JUnit5
UI Automation Test with JUnit5LINE Corporation
 
Feature Detection for UI Testing
Feature Detection for UI TestingFeature Detection for UI Testing
Feature Detection for UI TestingLINE Corporation
 
LINE 新星計劃介紹與新創團隊分享
LINE 新星計劃介紹與新創團隊分享LINE 新星計劃介紹與新創團隊分享
LINE 新星計劃介紹與新創團隊分享LINE Corporation
 
​LINE 技術合作夥伴與應用分享
​LINE 技術合作夥伴與應用分享​LINE 技術合作夥伴與應用分享
​LINE 技術合作夥伴與應用分享LINE Corporation
 
LINE 開發者社群經營與技術推廣
LINE 開發者社群經營與技術推廣LINE 開發者社群經營與技術推廣
LINE 開發者社群經營與技術推廣LINE Corporation
 
日本開發者大會短講分享
日本開發者大會短講分享日本開發者大會短講分享
日本開發者大會短講分享LINE Corporation
 
LINE Chatbot - 活動報名報到設計分享
LINE Chatbot - 活動報名報到設計分享LINE Chatbot - 活動報名報到設計分享
LINE Chatbot - 活動報名報到設計分享LINE Corporation
 
在 LINE 私有雲中使用 Managed Kubernetes
在 LINE 私有雲中使用 Managed Kubernetes在 LINE 私有雲中使用 Managed Kubernetes
在 LINE 私有雲中使用 Managed KubernetesLINE Corporation
 
LINE TODAY高效率的敏捷測試開發技巧
LINE TODAY高效率的敏捷測試開發技巧LINE TODAY高效率的敏捷測試開發技巧
LINE TODAY高效率的敏捷測試開發技巧LINE Corporation
 
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹LINE Corporation
 
LINE Things - LINE IoT平台新技術分享
LINE Things - LINE IoT平台新技術分享LINE Things - LINE IoT平台新技術分享
LINE Things - LINE IoT平台新技術分享LINE Corporation
 
LINE Pay - 一卡通支付新體驗
LINE Pay - 一卡通支付新體驗LINE Pay - 一卡通支付新體驗
LINE Pay - 一卡通支付新體驗LINE Corporation
 
LINE Platform API Update - 打造一個更好的Chatbot服務
LINE Platform API Update - 打造一個更好的Chatbot服務LINE Platform API Update - 打造一個更好的Chatbot服務
LINE Platform API Update - 打造一個更好的Chatbot服務LINE Corporation
 
Keynote - ​LINE 的技術策略佈局與跨國產品開發
Keynote - ​LINE 的技術策略佈局與跨國產品開發Keynote - ​LINE 的技術策略佈局與跨國產品開發
Keynote - ​LINE 的技術策略佈局與跨國產品開發LINE Corporation
 

More from LINE Corporation (20)

JJUG CCC 2018 Fall 懇親会LT
JJUG CCC 2018 Fall 懇親会LTJJUG CCC 2018 Fall 懇親会LT
JJUG CCC 2018 Fall 懇親会LT
 
Reduce dependency on Rx with Kotlin Coroutines
Reduce dependency on Rx with Kotlin CoroutinesReduce dependency on Rx with Kotlin Coroutines
Reduce dependency on Rx with Kotlin Coroutines
 
Kotlin/NativeでAndroidのNativeメソッドを実装してみた
Kotlin/NativeでAndroidのNativeメソッドを実装してみたKotlin/NativeでAndroidのNativeメソッドを実装してみた
Kotlin/NativeでAndroidのNativeメソッドを実装してみた
 
Use Kotlin scripts and Clova SDK to build your Clova extension
Use Kotlin scripts and Clova SDK to build your Clova extensionUse Kotlin scripts and Clova SDK to build your Clova extension
Use Kotlin scripts and Clova SDK to build your Clova extension
 
The Magic of LINE 購物 Testing
The Magic of LINE 購物 TestingThe Magic of LINE 購物 Testing
The Magic of LINE 購物 Testing
 
GA Test Automation
GA Test AutomationGA Test Automation
GA Test Automation
 
UI Automation Test with JUnit5
UI Automation Test with JUnit5UI Automation Test with JUnit5
UI Automation Test with JUnit5
 
Feature Detection for UI Testing
Feature Detection for UI TestingFeature Detection for UI Testing
Feature Detection for UI Testing
 
LINE 新星計劃介紹與新創團隊分享
LINE 新星計劃介紹與新創團隊分享LINE 新星計劃介紹與新創團隊分享
LINE 新星計劃介紹與新創團隊分享
 
​LINE 技術合作夥伴與應用分享
​LINE 技術合作夥伴與應用分享​LINE 技術合作夥伴與應用分享
​LINE 技術合作夥伴與應用分享
 
LINE 開發者社群經營與技術推廣
LINE 開發者社群經營與技術推廣LINE 開發者社群經營與技術推廣
LINE 開發者社群經營與技術推廣
 
日本開發者大會短講分享
日本開發者大會短講分享日本開發者大會短講分享
日本開發者大會短講分享
 
LINE Chatbot - 活動報名報到設計分享
LINE Chatbot - 活動報名報到設計分享LINE Chatbot - 活動報名報到設計分享
LINE Chatbot - 活動報名報到設計分享
 
在 LINE 私有雲中使用 Managed Kubernetes
在 LINE 私有雲中使用 Managed Kubernetes在 LINE 私有雲中使用 Managed Kubernetes
在 LINE 私有雲中使用 Managed Kubernetes
 
LINE TODAY高效率的敏捷測試開發技巧
LINE TODAY高效率的敏捷測試開發技巧LINE TODAY高效率的敏捷測試開發技巧
LINE TODAY高效率的敏捷測試開發技巧
 
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹
 
LINE Things - LINE IoT平台新技術分享
LINE Things - LINE IoT平台新技術分享LINE Things - LINE IoT平台新技術分享
LINE Things - LINE IoT平台新技術分享
 
LINE Pay - 一卡通支付新體驗
LINE Pay - 一卡通支付新體驗LINE Pay - 一卡通支付新體驗
LINE Pay - 一卡通支付新體驗
 
LINE Platform API Update - 打造一個更好的Chatbot服務
LINE Platform API Update - 打造一個更好的Chatbot服務LINE Platform API Update - 打造一個更好的Chatbot服務
LINE Platform API Update - 打造一個更好的Chatbot服務
 
Keynote - ​LINE 的技術策略佈局與跨國產品開發
Keynote - ​LINE 的技術策略佈局與跨國產品開發Keynote - ​LINE 的技術策略佈局與跨國產品開發
Keynote - ​LINE 的技術策略佈局與跨國產品開發
 

Recently uploaded

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Metrics driven development with dedicated Observability Team

  • 1. Metrics driven development, a observability perspective Huy Do LINE corp
  • 2. Introduction • Huy Do • Software Engineer at Observability Team • Founded kipalog.com & Ruby Vietnam group
  • 3. Agenda • Metrics driven culture at LINE • Introduce our observability stack
  • 4. LINE • A lot of end users (~170M active) • A lot of traffics • A lot of services (delivery, taxi, games, manga…)
  • 5. What we care • User Experience • One important prospect of User Experience is Reliability
  • 6. RELIABILITY • No Downtime • Low MTTR (Mean Time To Repair) • Fast Response • Fair response time • Fair percentile latency : p99, p95, p50
  • 7. HOW
  • 9. • EVERY Engineers MUST care about their application statuses • EVERY Engineers MUST do on-call rotate • NO "application engineer" who write code only • We have a dedicate team to provide them stable tools to care about their application status at best CULTURE
  • 12. – Wikipedia “observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs”
  • 15. • Metrics • Most simplest form is a triple • (name, value, timestamp) • Could be represent as graph METRICS
  • 16. • System Metrics • CPU/Disk IO/Network/DiskUsage... • MUST: have alert for critical metrics by default (users don't know what to monitor, and don't know the good threshold) • Application Metrics • Internal queue size, endpoint latency tail (p50, p95, p99), request size, request count METRICS
  • 17. • In LINE we care A LOT about Application Metrics • We try to instrument every single new added logic • Some of our heavy servers exported over 10000 metrics per server METRICS
  • 19. Warn / Error / Fatal log for alerting
  • 20. • In LINE All error / warning logs MUST be • Permanent stored (for trouble shooting later) • Used for alerting • Easy to query (you should not go to each host, and do grep access log) LOGGING
  • 21. LOGGING Real time error/warn log analysis with help of 
 Elasticsearch / Kibana
  • 24. • Not a common concept in normal service • Very helpful in microservice or fully async system , when a response could come from multiple services or multiple async threads. TRACING
  • 27. • We call it IMON • IMON could • Aggregate metrics from dozen of thousands of hosts, and do alert • Aggregate warn/error logs from application and do alert • (on going) Tracing requests across services
  • 29. • ~ 20 millions metrics per minute • And keep growing every day • ~ 500k log received per minute (peak time could up to few millions)
  • 31.
  • 33. •Shard-ing MySQL cluster (~50 servers) •Partition by “customers” •Batching write for better throughput METRICS DATABASE
  • 34. • MySQL is not fit for time series database • "Good TSDB"? • Compression • Optimize for write, but read MUST fast enough • Flexible query (topK, rate, delta) • Fast aggregate • We're moving to OpenTSDB METRICS DATABASE
  • 35. • ElasticSearch to store warn/error log • ElasticSearch is very good at writing (with support of batching write from application layer) • However, some bad read query will kill the server LOGGING DATABASE
  • 36. • Wrote our own in golang • Similar architect with telegraf (but with buffer) • Fully managed • Monitor all agents CPU / memory usage.. • Monitor all agents error • Automatically roll-out TELEMETRY AGENT
  • 37. • Flexbile routing rules • Dedicated collector for big customer • Drop request by dynamic configuration • Written by armeria and centraldogma ROUTING GATEWAY https://github.com/line/armeria https://github.com/line/centraldogma
  • 38. • Faster, more stable TSDB • Wire everything together • For every alert, see the big image with metrics/ log/tracing in same place • Autonomous alerting • With help of Machine Learning FUTURE
  • 39. FINALLY • How you monitor reflect your engineering culture • Data driven culture • Stability driven culture • Monitoring IS NOT for devops engineer or sysadmin only, but for EVERY ENGINEERS
  • 40. Thank you for listening