Scaling teams, processes and architectures

MANAGING GROWTH
SCALING TEAMS, PROCESSES, ARCHITECTURES
Lorenzo Alberton, CTO @ DataSift
MEST, Accra 10 December 2017

LORENZO ALBERTON
Chief Technology Officer, DataSift
http://alberton.info
@lorenzoalberton

SCALABLE ARCHITECTURES http://bit.ly/scaleds

SCALABILITY IS ABOUT…
People
Technology
ProcessesTRUE
FOUNDATION

PART 1.
PEOPLE
Staﬃng, Roles,
Management, Teams

CULTURE
➤ Treat people as volunteers (*)
➤ Lead by living the values you
promote
➤ Respect, collaboration
➤ Promote fun in the workplace
➤ Culture of safety at work (**)
(*) Peter Drucker
(**) Google, Project Aristotle

EFFECTIVE TEAMS
PROJECTARISTOTLE(2012)
Psychological safety: team climate
characterised by interpersonal trust
and mutual respect in which people
are comfortable being themselves.
Feeling free to share the things that
scare us without fear of
recriminations.
Behaviours: conversational turn-
taking and empathy.
https://www.nytimes.com/2016/02/28/magazine/what-google-learned-from-its-quest-to-build-the-perfect-team.html

TEAMS VS. INDIVIDUAL CONTRIBUTORS
➤ Beware of toxic people
➤ Value communication and
team work over super-heroes
(*) Sunday afternoon test

STAFFING
Don’t hire
experts
Technologies come and go
Focus more on people with passion
and less on people with specific skills

TEAM SIZE
➤ Never underestimate the
power of a small team
➤ Small teams force alignment
and focus
➤ Bigger teams need an insane
amount of overhead
➤ Parkinson's Law: “Work
expands to ﬁll the time available
for its completion”
work that keeps a person busy
but has little value in itself

TEAM STRUCTURE
No artificial boundaries around languages or skills
Try cross-functional teams  
(less friction, better end to end collaboration, project ownership)

MIDDLE-MANAGEMENT CURSE
Mistakes:
➤ Prematurely re-organise for scale
(deep hierarchy, over-
specialisation)
➤ Process managers (factory
mentality) vs Problem solvers
➤ Micromanagement
➤ Non-engineering culture
➤ 1-on-1s as calendar-ﬁller
➤ Not being “on the ground”
➤ Over-conﬁdence in tooling
➤ OTOH, coordination can be hard

PART 2.
PROCESSES
How to make day to day
operations smooth

WHY ARE PROCESSES CRITICAL?
Ease management
of teams/projects
Standardise actions
in repetitive tasks
Reduce mundane
decisions to focus
on grander ideas
Allow the team to
react quickly to crisis
➤ A process shouldn’t exist for the sake of it
➤ Introduce processes gradually, only keep what works
➤ Don’t put too much conﬁdence in tools alone to ﬁx issues

EXAMPLE PROCESSES
➤ Development methodology
➤ Risk / Beneﬁt analysis
➤ Prioritisation / Planning
➤ Design and code reviews
➤ Evaluating headroom / scale
➤ Load / Stress testing
➤ Test automation
➤ Deployment automation
➤ Release checklists
➤ Risk assessment/management
➤ Blameless postmortems

PROMOTING SYSTEMS TO PROD
➤ Code reviews
➤ Dev, Test, Stage and Live
environments
➤ Manual and automated QA
processes
➤ Performance and stress testing
➤ Release check lists (runbook)
➤ Instrumentation checks
➤ Testing roll-back capability
Protection from significant failures
BARRIER CONDITIONS

DESIGN AND CODE REVIEWS
➤ Promote collaboration
➤ Validate ideas, assess risk, detect
ﬂaws, simplify the solution
➤ Reason about behaviour before
coding
DAILY STAND-UPS
➤ Important for knowledge
sharing, collaboration,
alignment

CONTROLLING CHANGE: RISK ESTIMATION
http://dilbert.com/strips/comic/2008-05-08/
➤ Limit / log the impact of changes
➤ Assess risk methodologies:
• Gut feeling / ﬁnger in the air
• Semaphore method
• Failure Mode and Eﬀect Analysis

RISK MANAGEMENT
➤ Risk is cumulative
➤ Determine limits and
tolerance
➤ Stress, long hours, peer
pressure can multiply risk

WHEN/WHAT TO SCALE: DETERMINING HEADROOM
Capacity
Current Load
Why?
Budget plan
Prioritisation
Hiring plan
Determine starting point, remaining capacity, expected demand

LOAD TESTING
➤ Identify, document and
eliminate bottlenecks through
a strict controlled process of
measurement and analysis
➤ Measure system’s response
and stability
➤ Verify the app can meet the
desired performance
objectives (SLA)
➤ Establish success criteria, test
environment, tests, what
needs to be monitored, what
data needs to be collected

STRESS TESTING
➤ Determine the app’s stability
when subjected to above-
normal loads
➤ Verify the app’s behaviour
when close to the breaking
point
➤ Positive testing: progressively
increase load to overwhelm
the system’s resources
➤ Negative testing: take away
resources (memory, threads,
connections) to test the
application recoverability

PART 3.
TECHNOLOGY
Architecting Robust,
Scalable Solutions

DO NOT SCALE UNTIL YOU CAN’T AVOID IT ANYMORE
➤ “Go meet your people. Do things that don’t scale.” (Paul
Graham to AirBNB’s founders)
➤ Solve for speciﬁc problems
➤ Don’t generalise until you rebuilt something for the 3rd time
➤ Don’t over-engineer the solution
➤ Automate repetitive and error-prone tasks
➤ Avoid complicating things
✴ Phone system

MVP APPROACH
➤ Test ideas before spending a
year building something you
haven’t proven in the market
ﬁrst
➤ Fake it till you make it
➤ Example: Zappos

ARCHITECTURAL / DESIGN PRINCIPLES
N + 1 nodes for rollback to be disabled
(feature flags)
to be monitored
for multiple live
systems/sites
use mature
technology
asynchronous
communications
stateless
systems
+1
buy when
non core

FAULT-TOLERANT STRUCTURES
➤ Swim lanes: isolate and limit the
impacts of failure within the
system by segmenting pipelines
➤ Barrier and Guide (shard)
➤ Increase availability
➤ Make incidents easier to detect,
identify and resolve 
➤ Favour the transactions making
the company money ﬁrst
➤ Isolate functions causing repetitive
problems (or busy tenants)
➤ Consider the natural layout or
topology of the site

SCALING IN DIFFERENT DIRECTIONS
x
y z
AKF Scaling Cube, “The Art of Scalability”, M.L.Abbott, M.T.Fisher
cloning of services and data
without any bias
(e.g. more serving nodes in a worker
pool where any node can do the work)
separation of work
responsibility by type of data
or type of work
(different specialised worker
pools)
separation of work by
customer or requestor
(dedicated highly specialised
worker pools)

SCALING IN DIFFERENT DIRECTIONS - 1. SCALING WORK / APPS
x
cloning of entities
or data - unbiased
distribution of work
y
separation of work
by activity or data
z
separation of work
by person for whom
the work is done
web site 
(mirror 1)
web site 
(mirror 2)
search  
server
shopping
cart server
premium site
standard site
LB

SCALING IN DIFFERENT DIRECTIONS - 1. SCALING WORK / APPS
x mirroring
+ scale transactions
- scale data
y split by service
+ scale isolation
+ scale function data
- scale customer data
z
split by need /
location / value
+ scale isolation
+ scale customer data
- scale function data

SCALING IN DIFFERENT DIRECTIONS - 2. SCALING DATA
x
data cloning
(replication /
clustering) + load
balancer
y
split different things
by service / resource /
data affinity
z
split similar things
by modulus / hash-
based lookups
copy 1 copy 2 copy 3
ABC DEF GHI

SCALING IN DIFFERENT DIRECTIONS - 2. SCALING DATA
x
data cloning
(replication /
clustering) + load
balancer
+ easy to implement
+ scale transaction volume
+ useful in case of high read to write ratio
- scale data size and growth
y
split different things
by service / resource /
data affinity
+ fault isolation
+ reduce query time
- more diﬃcult
- data migration
z
split similar things
by modulus / hash-
based lookups
+ uniformly balanced demand
+ fault isolation
+ scale data and transactions
- more costly

QUEUES
➤ Asynchronous communication
➤ Workload distribution
➤ Failure isolation

MESSAGE QUEUES AS BUFFERS (ASYNC COMM - DECOUPLING)
CP
Unpredictable load spikes
CP
Load normalisation / smoothing
Batching ⇒ higher throughput
source /
producer
sink /
consumer

WORKLOAD DISTRIBUTION - LOAD BALANCING
Consumer 1
Consumer 2
Consumer 3
Producer
push pull
pull
pull

MULTIPLEXING
pull
Consumer
fair-queuing:
R1, R4, R5,
R2, R6, R3
Producer 1
Producer 2
Producer 3
push R4
push R1, R2, R3
push R5, R6

HIGH AVAILABILITY (PUB-SUB / BROADCAST)
Listener 1
Listener 2
Listener 3
[Broadcast]
Publisher 1
Publisher 2
[Dynamic Subscriptions]

BOUND YOUR QUEUE SIZE - APPLY BACK PRESSURE
CP

MONITORING
➤ Measure all the things!
➤ Think about what metrics to
track when you design your
app: system/app/user level
➤ Engage with Ops / QA early
on in the design phase
➤ Invest in a good monitoring
solution
➤ Data integrity checks (bucket
analysis, statistical analysis)
➤ Alerting and monitoring
dashboards should be intuitive
39

LOOK! RIB CAGES!
INTUITIVE MONITORING DASHBOARDS: LIVE HEAT-MAPS

LOOK! MONITORS!

OTHER SCALING TIPS
➤ Use caching aggressively (CDNs,
app & object caches)
➤ Design to scale out horizontally
➤ Simplify scope, design,
implementation: lean == fast
➤ Know latencies
➤ Relax temporal constraints
➤ Discuss and Learn from mistakes
➤ Design for fault tolerance,
graceful failure, and resilience
➤ Avoid SPOFs
➤ Avoid or distribute state
➤ Be competent

REFERENCES
http://www.slideshare.net/quipo/the-art-of-
scalability-managing-growth
http://www.infoq.com/presentations/Simple-
Made-Easy-QCon-London-2012
http://www.slideshare.net/postwait/scalable-
internet-architecture
http://bit.ly/IJKwuc
http://agile.dzone.com/news/approaches-
organizational
https://bitly.com/vCSd49
M. L. Abbot, M. T. Fisher,
“The Art Of Scalability”,
Addison Wesley
http://theartofscalability.com/

http://alberton.info/talks
@lorenzoalberton
lorenzo@datasift.com
THANK YOU!
/in/lorenzoalberton

Scaling teams, processes and architectures

Recomendados

Recomendados

Más contenido relacionado

Similar a Scaling teams, processes and architectures

Similar a Scaling teams, processes and architectures (20)

Más de Lorenzo Alberton

Más de Lorenzo Alberton (8)

Último

Último (17)

Scaling teams, processes and architectures