Lessons Learned Building Storm

Lessons learned building Storm
Nathan Marz
@nathanmarz !1

InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/storm-lessons

Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide

Software development in theory

Software development in practice

Storm
Widely used stream processing system

Storm

Fully distributed and scalable
Strong processing guarantees
High performance
Multi-tenant

Storm architecture
Master node (similar to Hadoop JobTracker)

Storm architecture
Used for cluster coordination

Storm architecture
Run worker processes

Lesson #1:
No such thing as a long-lived process

If JobTracker dies, all jobs die

Solution:
Design system to be fault-tolerant
to process restart

Implications
Program can be kill -9’d at any point
All state must be external to process
State modiﬁcation might be aborted at any point

Other benefits for Storm
Easy to reconfigure
Can upgrade Nimbus without touching apps (e.g. with bug fixes)

Lesson #2:
Use state machines to express
intricate behavior

Killing a topology

1. Stop emitting new data into topology
2. Wait to let topology ﬁnish processing in-transit messages
3. Shutdown workers
4. Cleanup state

Killing a topology

Asynchronous
Must be process fault-tolerant
Don’t allow activate/deactivate/rebalance to a killed topology
Should be able to kill a killed topology with a smaller wait time

Originally lots of (buggy)
conditional logic

Rewrote Nimbus to be an asynchronous,
process fault-tolerant state machine

Example of general solution being easier
to understand than the speciﬁc solution

Lesson #3:
Every feature will be abused

One rogue topology uses up all
disk space on the cluster

Solution:
Switch from log4j to logback so
that size of logs can be limited

Example:
Storm’s “reportError” method

Used to show errors in the Storm UI

Error info is stored in Zookeeper

What happens when a user deploys code like this?

Denial-of-service on Zookeeper
and cluster goes down

Solution:
Rate-limit how many errors/sec
can be written to Zookeeper

Lesson #4:
Isolate to avoid cascading failure

Originally one giant shared Zookeeper
cluster for all services within Twitter

If one service abused ZK, that could
lead to cascading failure

Zookeeper is not a multi-tenant system

Solution:
Storm got a dedicated Zookeeper cluster

Lesson #5:
Minimize dependencies

Other people’s code is not correct

Fewer
dependencies
Less possibility
for failure=

Example:
Storm’s usage of Zookeeper

Worker locations stored in Zookeeper

All workers must know locations of
other workers to send messages

Two ways to get location updates

2. Use Zookeeper “watch”
feature to get push notiﬁcations

Method 2 is faster but
relies on another feature

If watch feature fails, locations
still propagate via polling

Eliminating dependence justiﬁed
by small amount of code required

Monitoring is a prerequisite to
operating robust software

Storm makes monitoring
topologies very easy

Lots of stats monitored automatically

Simple API to monitor custom stats

Automatically integrates with
visualization stack

Lesson #7:
How we solved the resource
management problem

Resource management

Multi-tenancy: topologies don’t affect each other

Capacity management: converting $$$ into topologies

Initially treated these as
separate problems

Multi-tenancy attempt #1
Resource isolation
using Mesos

Machine Machine Machine Machine
Mesos
Storm

Each machine runs workers
from many topologies

Resource isolation is an
extraordinary claim

Extraordinary claims require
extraordinary evidence!

Ran into massive variance problems

At-least resource model of Mesos made
capacity measurement impossible

Almost went down route of resource
isolation with resource capping

But what about hardware threading
and caching?

Conclusion:
Sharing a single machine for
independent applications is
fundamentally complex

Capacity management attempt #1
1. Provide shared Storm cluster
2. Measure capacity usage in aggregate
3. Always have some % of cluster free
4. Grow cluster as needed according to usage

People would deploy topologies with
more workers than slots on cluster

People only care about getting their
application working and will twist any
knob possible

Problems
Production topologies starved
Bloated resource usage because no incentive to optimize
No process for making $$$ decisions

Requirements
1. Production topologies get priority to resources
2. One topology cannot affect the performance of another topology
3. Incentives for people to optimize resource usage
4. Process for making $$$ decisions on machines
5. Ability to measure how much capacity a topology needs for 3 and 4

The more complex the problem,
the simpler the solution must be

Isolation scheduler
Nimbus conﬁguration

Isolation scheduler
Conﬁgurable only by cluster administrator

Isolation scheduler
Map from topology name to # of machines

Isolation scheduler
Topologies listed are production topologies

Isolation scheduler
These topologies guaranteed dedicated
access to that # of machines

Isolation scheduler
Remaining machines used for failover and
for running development topologies

Beneﬁts
Resource contention issue of Mesos completely avoided
Takes advantage of process fault-tolerance of Nimbus
Simple to use and understand
Easy to do capacity measurements
Distinguishes production from in-development

Topology productionization process
1. Test topology on cluster as a development topology
2. When ready, work with admins to do capacity measurement
3. Submit capacity proposal for approval by VP
4. Allocate machines immediately from failover machines
5. Backﬁll capacity when machines arrive 4-6 weeks later

Beneﬁts
Incentives to optimize resource usage
Backﬁll allows immediate productionization
Human process integrated with technical solution

Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/storm-
lessons

Lessons Learned Building Storm

Recomendados

Recomendados

Más contenido relacionado

Más de C4Media

Más de C4Media (20)

Último

Último (20)

Lessons Learned Building Storm