Talk about the soft side of scalability, covering team management, process implementation and some solid technology-related principles. Based on 10 years of experience building scalable teams and scalable data platforms
6. CULTURE
➤ Treat people as volunteers (*)
➤ Lead by living the values you
promote
➤ Respect, collaboration
➤ Promote fun in the workplace
➤ Culture of safety at work (**)
(*) Peter Drucker
(**) Google, Project Aristotle
7. EFFECTIVE TEAMS
PROJECTARISTOTLE(2012)
Psychological safety: team climate
characterised by interpersonal trust
and mutual respect in which people
are comfortable being themselves.
Feeling free to share the things that
scare us without fear of
recriminations.
Behaviours: conversational turn-
taking and empathy.
https://www.nytimes.com/2016/02/28/magazine/what-google-learned-from-its-quest-to-build-the-perfect-team.html
8. TEAMS VS. INDIVIDUAL CONTRIBUTORS
➤ Beware of toxic people
➤ Value communication and
team work over super-heroes
(*) Sunday afternoon test
10. TEAM SIZE
➤ Never underestimate the
power of a small team
➤ Small teams force alignment
and focus
➤ Bigger teams need an insane
amount of overhead
➤ Parkinson's Law: “Work
expands to fill the time available
for its completion”
work that keeps a person busy
but has little value in itself
11. TEAM STRUCTURE
No artificial boundaries around languages or skills
Try cross-functional teams
(less friction, better end to end collaboration, project ownership)
12. MIDDLE-MANAGEMENT CURSE
Mistakes:
➤ Prematurely re-organise for scale
(deep hierarchy, over-
specialisation)
➤ Process managers (factory
mentality) vs Problem solvers
➤ Micromanagement
➤ Non-engineering culture
➤ 1-on-1s as calendar-filler
➤ Not being “on the ground”
➤ Over-confidence in tooling
➤ OTOH, coordination can be hard
14. WHY ARE PROCESSES CRITICAL?
Ease management
of teams/projects
Standardise actions
in repetitive tasks
Reduce mundane
decisions to focus
on grander ideas
Allow the team to
react quickly to crisis
➤ A process shouldn’t exist for the sake of it
➤ Introduce processes gradually, only keep what works
➤ Don’t put too much confidence in tools alone to fix issues
16. PROMOTING SYSTEMS TO PROD
➤ Code reviews
➤ Dev, Test, Stage and Live
environments
➤ Manual and automated QA
processes
➤ Performance and stress testing
➤ Release check lists (runbook)
➤ Instrumentation checks
➤ Testing roll-back capability
Protection from significant failures
BARRIER CONDITIONS
17. DESIGN AND CODE REVIEWS
➤ Promote collaboration
➤ Validate ideas, assess risk, detect
flaws, simplify the solution
➤ Reason about behaviour before
coding
DAILY STAND-UPS
➤ Important for knowledge
sharing, collaboration,
alignment
18. CONTROLLING CHANGE: RISK ESTIMATION
http://dilbert.com/strips/comic/2008-05-08/
➤ Limit / log the impact of changes
➤ Assess risk methodologies:
• Gut feeling / finger in the air
• Semaphore method
• Failure Mode and Effect Analysis
19. RISK MANAGEMENT
➤ Risk is cumulative
➤ Determine limits and
tolerance
➤ Stress, long hours, peer
pressure can multiply risk
20. WHEN/WHAT TO SCALE: DETERMINING HEADROOM
Capacity
Current Load
Why?
Budget plan
Prioritisation
Hiring plan
Determine starting point, remaining capacity, expected demand
21. LOAD TESTING
➤ Identify, document and
eliminate bottlenecks through
a strict controlled process of
measurement and analysis
➤ Measure system’s response
and stability
➤ Verify the app can meet the
desired performance
objectives (SLA)
➤ Establish success criteria, test
environment, tests, what
needs to be monitored, what
data needs to be collected
22. STRESS TESTING
➤ Determine the app’s stability
when subjected to above-
normal loads
➤ Verify the app’s behaviour
when close to the breaking
point
➤ Positive testing: progressively
increase load to overwhelm
the system’s resources
➤ Negative testing: take away
resources (memory, threads,
connections) to test the
application recoverability
24. DO NOT SCALE UNTIL YOU CAN’T AVOID IT ANYMORE
➤ “Go meet your people. Do things that don’t scale.” (Paul
Graham to AirBNB’s founders)
➤ Solve for specific problems
➤ Don’t generalise until you rebuilt something for the 3rd time
➤ Don’t over-engineer the solution
➤ Automate repetitive and error-prone tasks
➤ Avoid complicating things
✴ Phone system
25. MVP APPROACH
➤ Test ideas before spending a
year building something you
haven’t proven in the market
first
➤ Fake it till you make it
➤ Example: Zappos
26. ARCHITECTURAL / DESIGN PRINCIPLES
N + 1 nodes for rollback to be disabled
(feature flags)
to be monitored
for multiple live
systems/sites
use mature
technology
asynchronous
communications
stateless
systems
+1
buy when
non core
27. FAULT-TOLERANT STRUCTURES
➤ Swim lanes: isolate and limit the
impacts of failure within the
system by segmenting pipelines
➤ Barrier and Guide (shard)
➤ Increase availability
➤ Make incidents easier to detect,
identify and resolve
➤ Favour the transactions making
the company money first
➤ Isolate functions causing repetitive
problems (or busy tenants)
➤ Consider the natural layout or
topology of the site
28. SCALING IN DIFFERENT DIRECTIONS
x
y z
AKF Scaling Cube, “The Art of Scalability”, M.L.Abbott, M.T.Fisher
cloning of services and data
without any bias
(e.g. more serving nodes in a worker
pool where any node can do the work)
separation of work
responsibility by type of data
or type of work
(different specialised worker
pools)
separation of work by
customer or requestor
(dedicated highly specialised
worker pools)
29. SCALING IN DIFFERENT DIRECTIONS - 1. SCALING WORK / APPS
x
cloning of entities
or data - unbiased
distribution of work
y
separation of work
by activity or data
z
separation of work
by person for whom
the work is done
web site
(mirror 1)
web site
(mirror 2)
search
server
shopping
cart server
premium site
standard site
LB
30. SCALING IN DIFFERENT DIRECTIONS - 1. SCALING WORK / APPS
x mirroring
+ scale transactions
- scale data
y split by service
+ scale isolation
+ scale function data
- scale customer data
z
split by need /
location / value
+ scale isolation
+ scale customer data
- scale function data
31. SCALING IN DIFFERENT DIRECTIONS - 2. SCALING DATA
x
data cloning
(replication /
clustering) + load
balancer
y
split different things
by service / resource /
data affinity
z
split similar things
by modulus / hash-
based lookups
copy 1 copy 2 copy 3
ABC DEF GHI
32. SCALING IN DIFFERENT DIRECTIONS - 2. SCALING DATA
x
data cloning
(replication /
clustering) + load
balancer
+ easy to implement
+ scale transaction volume
+ useful in case of high read to write ratio
- scale data size and growth
y
split different things
by service / resource /
data affinity
+ fault isolation
+ reduce query time
- more difficult
- data migration
z
split similar things
by modulus / hash-
based lookups
+ uniformly balanced demand
+ fault isolation
+ scale data and transactions
- more costly
39. MONITORING
➤ Measure all the things!
➤ Think about what metrics to
track when you design your
app: system/app/user level
➤ Engage with Ops / QA early
on in the design phase
➤ Invest in a good monitoring
solution
➤ Data integrity checks (bucket
analysis, statistical analysis)
➤ Alerting and monitoring
dashboards should be intuitive
39
43. OTHER SCALING TIPS
➤ Use caching aggressively (CDNs,
app & object caches)
➤ Design to scale out horizontally
➤ Simplify scope, design,
implementation: lean == fast
➤ Know latencies
➤ Relax temporal constraints
➤ Discuss and Learn from mistakes
➤ Design for fault tolerance,
graceful failure, and resilience
➤ Avoid SPOFs
➤ Avoid or distribute state
➤ Be competent