SlideShare a Scribd company logo
1 of 29
Download to read offline
ABOUT ME
- 14y. in IT
- 13y. in Node.js dev
- RnD Team Lead at WalkMe
- working with highload services
What is
highload?
it is when 2 servers are not enough
Why 2+?
Redundancy
1 service can shut
down or brake
0 downtime updates
You can update 1
service, while 2nd
will handle requests
2 is a minimum number of
servers even for
non-highload projects
When do you need
2+ servers?
- Customers are complaining about
performance
- Your metrics show performance
degradation
- Yes
- Any code optimization has
its limits
- At some point you will
reach your CPU capacity
with more users
Maybe optimize your app?
So adding more servers is the right
approach to handle more request?
Yes, but how many servers?
9 or 10?
Status codes
the more 2xx - the better
the less 5xx - the better
Backend latency
Preferably to respond under 200ms
To satisfy business needs
Be cost effective
The less we spend - the more money
business can get.
How to achieve this?
CPU
~40-60% avg utilization
Memory
<50% max utilization
Traffic pattern
This can affect our auto scaling
parameters
Active handles
Spikes of active handles can block
requests from being processed
Active requests
Spikes of active requests can block
requests from being processed
Event loop lag
can be reason, why we can’t handle
requests in time
Monitoring & auto scaling
$$$$
Case 1: traffic increases and
decreases gradually
$$$$
Case 2: traffic or/and CPU
usage increases and decreases
sporadically
$$
$$
$$
$$
- potential money
saving
Hard to auto scale such systems,
there are some heavy requests.
Possible solution - offload CPU heavy
tasks to offline jobs (workers,
separate deployments)
Node.js metrics: event loop lag
Hundreds of these can cause high event loop lag and
lead to app unresponsiveness.
Mitigation: add setImmediate() to your cycles
event loop lag in sync methods
I hope you are not using sync methods of fs.
Use async variations of methods everywhere.
Do not use it
Use it
How to capture these?
default metrics can be collected in register
of prom-client and later exposed by your
http server, so Prometheus can collect
them and display in Grafana
Exploring event loop lag
Avg event loop lag > 100ms is the case for investigation
Other default metrics, that are collected with
“collectDefaultMetrics”
https://github.com/siimon/prom-client/tree/master/lib/metrics
Debug specific pod and check types of handles
Incoming http requests
from load balancer
Outgoing connections to
3rd parties
'Number of active libuv handles grouped by handle type. Every handle type is C++ class
name.'
But can we optimize
app by itself?
Code improvements: batch writes
Kafka write example.
Batch operations are also supported by Kinesis,
DynamoDb, Aerospike and many more
Batch writes example
Can be applied to any 3rd party, that supports batch writes
Logs, what can go wrong?
100_000 * 3_600 = 0.36B/h
- How much you would pay
to DataDog for this?
- What network load this
will create?
- What CPU load this will
create?
- How would you navigate
through 0.36B of logs per
hour?
In highload this can become
mitigation 1: sample errors
You don’t need all 100_000 errors in your logs
mitigation 2: store statistics of errors
It’s important to know when and how many errors did you
have
Custom metrics with prometheus
Now combine these methods
Error messages should be
persistent
You will know exact
number of events that
happened
You still can find details
about the error, where it
happened
You should tune log rate
to your load. it can be any
number 0.00001%-100%
Conclusion
Horizontal scale is most effective way
to handle more requests
Use as little servers as possible
Use batch operations when possible
log only needed amount of logs
Offload heavy jobs to “offline workers”
Eliminate long blocking operations
Monitor everything
THANK YOU!
Time for questions!
Andrii Shumada
More talks:
https://eagleeye.github.io

More Related Content

Similar to "Surviving highload with Node.js", Andrii Shumada

Scalable Apache for Beginners
Scalable Apache for BeginnersScalable Apache for Beginners
Scalable Apache for Beginners
webhostingguy
 

Similar to "Surviving highload with Node.js", Andrii Shumada (20)

Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
X-Ray distributed tracing proof-of-concept
X-Ray distributed tracing proof-of-conceptX-Ray distributed tracing proof-of-concept
X-Ray distributed tracing proof-of-concept
 
Serverless meetup Auckland #6
Serverless meetup Auckland #6Serverless meetup Auckland #6
Serverless meetup Auckland #6
 
Performance Optimization in Large Systems - Cusec 2019
Performance Optimization in Large Systems - Cusec 2019Performance Optimization in Large Systems - Cusec 2019
Performance Optimization in Large Systems - Cusec 2019
 
High-Speed Reactive Microservices - trials and tribulations
High-Speed Reactive Microservices - trials and tribulationsHigh-Speed Reactive Microservices - trials and tribulations
High-Speed Reactive Microservices - trials and tribulations
 
Going Serverless on AWS
Going Serverless on AWSGoing Serverless on AWS
Going Serverless on AWS
 
Building and Scaling a WebSockets Pubsub System
Building and Scaling a WebSockets Pubsub SystemBuilding and Scaling a WebSockets Pubsub System
Building and Scaling a WebSockets Pubsub System
 
Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017
 
Scalability using Node.js
Scalability using Node.jsScalability using Node.js
Scalability using Node.js
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)
 
Scalable Apache for Beginners
Scalable Apache for BeginnersScalable Apache for Beginners
Scalable Apache for Beginners
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Enterprise application performance - Understanding & Learnings
Enterprise application performance - Understanding & LearningsEnterprise application performance - Understanding & Learnings
Enterprise application performance - Understanding & Learnings
 
Server Monitoring (Scaling while bootstrapped)
Server Monitoring  (Scaling while bootstrapped)Server Monitoring  (Scaling while bootstrapped)
Server Monitoring (Scaling while bootstrapped)
 
Serverless Computing
Serverless ComputingServerless Computing
Serverless Computing
 
Introduce AWS Lambda for newbie and Non-IT
Introduce AWS Lambda for newbie and Non-ITIntroduce AWS Lambda for newbie and Non-IT
Introduce AWS Lambda for newbie and Non-IT
 
Introduction to requirement of microservices
Introduction to requirement of microservicesIntroduction to requirement of microservices
Introduction to requirement of microservices
 
Operations: Production Readiness
Operations: Production ReadinessOperations: Production Readiness
Operations: Production Readiness
 
Cloud Native & Service Mesh
Cloud Native & Service MeshCloud Native & Service Mesh
Cloud Native & Service Mesh
 
Richardrodger nodeday-2014-final
Richardrodger nodeday-2014-finalRichardrodger nodeday-2014-final
Richardrodger nodeday-2014-final
 

More from Fwdays

More from Fwdays (20)

"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y..."How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
 
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
 
"Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl..."Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl...
 
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T..."How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
 
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ..."The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
 
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu..."[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
 
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care..."[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
 
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"..."4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
 
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast..."Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
 
"Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others..."Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others...
 
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
 
"Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv..."Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv...
 
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin..."How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 

Recently uploaded (20)

Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 

"Surviving highload with Node.js", Andrii Shumada

  • 1.
  • 2. ABOUT ME - 14y. in IT - 13y. in Node.js dev - RnD Team Lead at WalkMe - working with highload services
  • 4. it is when 2 servers are not enough
  • 5. Why 2+? Redundancy 1 service can shut down or brake 0 downtime updates You can update 1 service, while 2nd will handle requests 2 is a minimum number of servers even for non-highload projects
  • 6. When do you need 2+ servers? - Customers are complaining about performance - Your metrics show performance degradation
  • 7. - Yes - Any code optimization has its limits - At some point you will reach your CPU capacity with more users Maybe optimize your app?
  • 8. So adding more servers is the right approach to handle more request?
  • 9. Yes, but how many servers? 9 or 10?
  • 10. Status codes the more 2xx - the better the less 5xx - the better Backend latency Preferably to respond under 200ms To satisfy business needs Be cost effective The less we spend - the more money business can get. How to achieve this?
  • 11. CPU ~40-60% avg utilization Memory <50% max utilization Traffic pattern This can affect our auto scaling parameters Active handles Spikes of active handles can block requests from being processed Active requests Spikes of active requests can block requests from being processed Event loop lag can be reason, why we can’t handle requests in time Monitoring & auto scaling
  • 12. $$$$ Case 1: traffic increases and decreases gradually $$$$
  • 13. Case 2: traffic or/and CPU usage increases and decreases sporadically $$ $$ $$ $$ - potential money saving Hard to auto scale such systems, there are some heavy requests. Possible solution - offload CPU heavy tasks to offline jobs (workers, separate deployments)
  • 14. Node.js metrics: event loop lag Hundreds of these can cause high event loop lag and lead to app unresponsiveness. Mitigation: add setImmediate() to your cycles
  • 15. event loop lag in sync methods I hope you are not using sync methods of fs. Use async variations of methods everywhere. Do not use it Use it
  • 16. How to capture these? default metrics can be collected in register of prom-client and later exposed by your http server, so Prometheus can collect them and display in Grafana
  • 17. Exploring event loop lag Avg event loop lag > 100ms is the case for investigation
  • 18. Other default metrics, that are collected with “collectDefaultMetrics” https://github.com/siimon/prom-client/tree/master/lib/metrics
  • 19. Debug specific pod and check types of handles Incoming http requests from load balancer Outgoing connections to 3rd parties 'Number of active libuv handles grouped by handle type. Every handle type is C++ class name.'
  • 20. But can we optimize app by itself?
  • 21. Code improvements: batch writes Kafka write example. Batch operations are also supported by Kinesis, DynamoDb, Aerospike and many more
  • 22. Batch writes example Can be applied to any 3rd party, that supports batch writes
  • 23. Logs, what can go wrong? 100_000 * 3_600 = 0.36B/h - How much you would pay to DataDog for this? - What network load this will create? - What CPU load this will create? - How would you navigate through 0.36B of logs per hour? In highload this can become
  • 24. mitigation 1: sample errors You don’t need all 100_000 errors in your logs
  • 25. mitigation 2: store statistics of errors It’s important to know when and how many errors did you have
  • 26. Custom metrics with prometheus
  • 27. Now combine these methods Error messages should be persistent You will know exact number of events that happened You still can find details about the error, where it happened You should tune log rate to your load. it can be any number 0.00001%-100%
  • 28. Conclusion Horizontal scale is most effective way to handle more requests Use as little servers as possible Use batch operations when possible log only needed amount of logs Offload heavy jobs to “offline workers” Eliminate long blocking operations Monitor everything
  • 29. THANK YOU! Time for questions! Andrii Shumada More talks: https://eagleeye.github.io