This document discusses highload systems and strategies for scaling Node.js applications to handle increased traffic. It recommends using multiple servers for redundancy and handling spikes in load. Key metrics for monitoring include status codes, backend latency, CPU and memory utilization, and event loop lag. Batching operations to third parties and sampling logs are suggested to reduce load. Offloading heavy tasks to workers can also help optimize performance. The document emphasizes monitoring systems closely and using as few servers as possible through optimization.
5. Why 2+?
Redundancy
1 service can shut
down or brake
0 downtime updates
You can update 1
service, while 2nd
will handle requests
2 is a minimum number of
servers even for
non-highload projects
6. When do you need
2+ servers?
- Customers are complaining about
performance
- Your metrics show performance
degradation
7. - Yes
- Any code optimization has
its limits
- At some point you will
reach your CPU capacity
with more users
Maybe optimize your app?
8. So adding more servers is the right
approach to handle more request?
10. Status codes
the more 2xx - the better
the less 5xx - the better
Backend latency
Preferably to respond under 200ms
To satisfy business needs
Be cost effective
The less we spend - the more money
business can get.
How to achieve this?
11. CPU
~40-60% avg utilization
Memory
<50% max utilization
Traffic pattern
This can affect our auto scaling
parameters
Active handles
Spikes of active handles can block
requests from being processed
Active requests
Spikes of active requests can block
requests from being processed
Event loop lag
can be reason, why we can’t handle
requests in time
Monitoring & auto scaling
13. Case 2: traffic or/and CPU
usage increases and decreases
sporadically
$$
$$
$$
$$
- potential money
saving
Hard to auto scale such systems,
there are some heavy requests.
Possible solution - offload CPU heavy
tasks to offline jobs (workers,
separate deployments)
14. Node.js metrics: event loop lag
Hundreds of these can cause high event loop lag and
lead to app unresponsiveness.
Mitigation: add setImmediate() to your cycles
15. event loop lag in sync methods
I hope you are not using sync methods of fs.
Use async variations of methods everywhere.
Do not use it
Use it
16. How to capture these?
default metrics can be collected in register
of prom-client and later exposed by your
http server, so Prometheus can collect
them and display in Grafana
17. Exploring event loop lag
Avg event loop lag > 100ms is the case for investigation
18. Other default metrics, that are collected with
“collectDefaultMetrics”
https://github.com/siimon/prom-client/tree/master/lib/metrics
19. Debug specific pod and check types of handles
Incoming http requests
from load balancer
Outgoing connections to
3rd parties
'Number of active libuv handles grouped by handle type. Every handle type is C++ class
name.'
23. Logs, what can go wrong?
100_000 * 3_600 = 0.36B/h
- How much you would pay
to DataDog for this?
- What network load this
will create?
- What CPU load this will
create?
- How would you navigate
through 0.36B of logs per
hour?
In highload this can become
27. Now combine these methods
Error messages should be
persistent
You will know exact
number of events that
happened
You still can find details
about the error, where it
happened
You should tune log rate
to your load. it can be any
number 0.00001%-100%
28. Conclusion
Horizontal scale is most effective way
to handle more requests
Use as little servers as possible
Use batch operations when possible
log only needed amount of logs
Offload heavy jobs to “offline workers”
Eliminate long blocking operations
Monitor everything
29. THANK YOU!
Time for questions!
Andrii Shumada
More talks:
https://eagleeye.github.io