In the enterprise there are rarely simple solutions to highly nuanced problems that satisfy all needs. Several customers might each ask "How do I make Jira/Confluence faster?" and each require a different answer. Using this example, this talk will pick apart the inputs, outputs, concerns, and realities of answering a short question with a long answer. We'll then discuss real-world examples from our own internal instances, to give you a taste of the process we've gone through to solve our own performance problems, and to show why there is no simple playbook; "it depends" on a lot! The key takeaways are:
* The importance of having a shared definition of performance
* The importance of having agreed-upon priorities, including what isn't important
* The importance of measuring (allthethings) and understanding them
* The thing you think is the problem might not be the problem, and vice versa.
* The real world and the ideal world tend to look nothing alike!
11. My users say that JIRA takes
too long to create an issue,
and that opening their
boards takes forever!
JIRA ADMIN
12. JIRA feels slower than it did
last month, and way slower
than it was earlier in the
year.
JIRA USER
13. My developers say JIRA is
too slow, but it seems fine to
me…
PROJECT MANAGER
14. We just acquired this other company, they have
Confluence as well and we want to merge them
together. We’ll end up with 25,000 users and it’s
not the fastest now, and then we want to start
using Collaborative Editing, and open it up to
outside contractors, and …
CONFLUENCE ADMIN
24. Status Quo
Is it ok today? It’s only a few seconds…
Be reasonable…
Don’t compare your internal Jira instance to a
supercomputer!
Expectations
25. Status Quo
Is it ok today? It’s only a few seconds…
Be reasonable…
Don’t compare your internal Jira instance to a
supercomputer…
Expectations
Latency
A little is ok, but a lot can be a big problem.
30. Urgency
Trying to fix what’s broken, or make it better?
Who cares?
Discover, discern, and prioritize!
Priorities!
31. Urgency
Trying to fix what’s broken, or make it better?
Who cares?
Discover, discern, and prioritize!
Now vs Later
What’s most beneficial now?
What would be helpful down the road?
Priorities!
34. Staffing
Levels
Customer 1
Customer 2
Profile
- 10,000 Users
- Jira Data Center, Confluence Data Center
- 1 Manager
- 2 FT Sys Admin
- 3 FT App Admins
- 1 FT Dev
- 1 FT SRE
- 1 Architect
Assessment
Well-staffed. Team runs all of our applications as well
as other developer tools.
35. Staffing
Levels
Customer 1
Customer 2
Profile
- 20,000 Users
- 2 Jira Data Center, 2 Confluence Data Center, 1 BB
Data Center, FeCru, 3 Bamboo
- 1 Team Lead
- 2 FT/1 PT App Admin
- 1 FT/1PT Sys Admins
Assessment
Under-staffed. Team runs all of our applications as well
as at least 5 other tool across different geographies
42. It depends…
Bitbucket
Make sure you monitor the CPU!
(But don’t forget about DB load, SCM
jobs, or disk speed, or…)
Bamboo
Make sure you monitor the network!
(But don’t forget about CPU load, or
page load time, or…)
Jira
Make sure you monitor disk I/O!
(But don’t forget about heap use, or
CPU load, or page load time, or…)
Confluence
Make sure you monitor Memory!
(But don’t forget about the db
connection pool, or disk I/O, or…)
46. Extranet
Long pauses
Garbage collection pauses of
10-20s
Nodes removed
Lack of response means nodes are
aggressively removed from the pool
Garbage
Collection
Network
latency
Load balancers
48. Extranet
10s health check
Nodes are removed from the pool if there is no
response for 10s
Idempotency
Failed requests replay across
other nodes
No draining period
Requests fail immediately
Garbage
Collection
Network
latency
Load balancers
88. Be patient
Wait for peak and low load
times
Go slow
Know when to stop
Is that last 100ms really
worth it?
Isolate
Make one change at a time
and benchmark
96. Magic tricks
CPU
Add more cores, limit concurrency
Garbage Collection
Adding more memory != success
Database
App servers
Requests
97. Magic tricks
Threading
More database connections than
HTTP threads
Load balancer
• Increase buffers
• Turn off idempotency
• Allow draining
• ‘Least connections’ over
‘round robin’
Database
App servers
Requests
98. Limit complexity
• Limit or combine custom fields
• Clean up unused plugins
• Keep general complexity of
workflows low
Magic tricks
Jira
Confluence
Bitbucket
Server
99. Tune for fault tolerance
• Extend the cluster safety interval
• Turn off idempotency at your load
balancer
• Know your garbage collection
behaviour
Magic tricks
Jira
Confluence
Bitbucket
Server
100. Load comes from git
• Optimise for git over the JVM
• Scale vertically over horizontally
• Use docker and mirrors
Magic tricks
Jira
Confluence
Bitbucket
Server
103. Planning for
the future
Capacity
planning
Alerting
HTTP threads
# of requests in highest load minute / 60 * average
time to complete = threads in use/s
Requests in highest load minute = 8400
Time to complete = 0.82s
8400 / 60 * 0.82 = 115 threads, or 29 per node
104. Planning for
the future
Capacity
planning
Alerting
Our alerts
• More than 500x 500 errors in a minute
• More than 300 timeouts at the load balancer in an
hour
• Garbage collection pauses > 10s
• Nodes being removed/readded at the load balancer
• Cluster panics
• Out of memory errors
• Long running space exports
108. Accept Your Reality
There are limits to performance tuning. Be ok with
what’s fast enough.
Data is Your Friend
…but having good data takes time. Move slowly
and methodically.
109. Accept Your Reality
There are limits to performance tuning. Be ok with
what’s fast enough.
Data is Your Friend
…but having good data takes time. Move slowly
and methodically.
Chill
Go slowly. Track Everything. Lather. Rinse. Repeat.
(Always repeat.)
110. The Four Principles of
Atlassian Performance
Tuning
Dan Hardiker
CTO, Adaptavist
SUMMIT EUROPE 2017