Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/162zy81.
Nathan Marz shares lessons learned building Storm, an open-source, distributed, real-time computation system. Filmed at qconnewyork.com.
Nathan Marz is currently working on a new startup. He was the lead engineer at BackType before being acquired by Twitter in 2011. At Twitter, he started the streaming compute team which provides and develops shared infrastructure to support many critical real-time applications throughout the company. Nathan is the creator of many open source projects, including projects such as Cascalog and Storm.
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/storm-lessons
3. Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
25. Killing a topology
1. Stop emitting new data into topology
2. Wait to let topology finish processing in-transit messages
3. Shutdown workers
4. Cleanup state
26. Killing a topology
Asynchronous
Must be process fault-tolerant
Don’t allow activate/deactivate/rebalance to a killed topology
Should be able to kill a killed topology with a smaller wait time
81. Capacity management attempt #1
1. Provide shared Storm cluster
2. Measure capacity usage in aggregate
3. Always have some % of cluster free
4. Grow cluster as needed according to usage
86. Requirements
1. Production topologies get priority to resources
2. One topology cannot affect the performance of another topology
3. Incentives for people to optimize resource usage
4. Process for making $$$ decisions on machines
5. Ability to measure how much capacity a topology needs for 3 and 4
96. Benefits
Resource contention issue of Mesos completely avoided
Takes advantage of process fault-tolerance of Nimbus
Simple to use and understand
Easy to do capacity measurements
Distinguishes production from in-development
97. Topology productionization process
1. Test topology on cluster as a development topology
2. When ready, work with admins to do capacity measurement
3. Submit capacity proposal for approval by VP
4. Allocate machines immediately from failover machines
5. Backfill capacity when machines arrive 4-6 weeks later
98. Benefits
Incentives to optimize resource usage
Backfill allows immediate productionization
Human process integrated with technical solution