Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1nyhaC6.
Nathan Marz discusses building NoSQL-based data systems that are scalable and easy to reason about. Filmed at qconlondon.com.
Nathan Marz is the creator of many open source projects which are relied upon by over 50 companies around the world, including Cascalog and Storm. Nathan is also working on a book for Manning publications entitled "Big Data: principles and best practices of scalable realtime data systems". Nathan was previously the lead engineer at BackType before being acquired by Twitter in 2011.
Developer Data Modeling Mistakes: From Postgres to NoSQL
A Call for Sanity in NoSQL
1. A call for sanity in NoSQL
Nathan Marz
@nathanmarz 1
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/nosql-cons
http://www.infoq.com/presentati
ons/nasa-big-data
http://www.infoq.com/presentati
ons/nasa-big-data
3. Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
27. ID Name Location ID
1 Sally 3
2 George 1
3 Bob 3
Location ID City State Population
1 NewYork NY 8.2M
2 San Diego CA 1.3M
3 Chicago IL 2.7M
Normalized schema
29. ID Name Location ID City State
1 Sally 3 Chicago IL
2 George 1 NewYork NY
3 Bob 3 Chicago IL
Location ID City State Population
1 NewYork NY 8.2M
2 San Diego CA 1.3M
3 Chicago IL 2.7M
Denormalized schema
100. Batch workflow
All data
Normalize
pageview URLs
Equiv connected-
component labeling
Normalize
pageview
userids
Aggregate
HyperLogLog sets
per bucket
Index into batch
view
103. Equiv graph
1 -> A
2 -> A
3 -> A
4 -> A
5 -> A
11 -> A
6 -> B
7 -> B
9 -> B
104. Batch workflow
All data
Normalize
pageview URLs
Equiv connected-
component labeling
Normalize
pageview
userids
Aggregate
HyperLogLog sets
per bucket
Index into batch
view
116. Conclusions from example
- Avoids every single insane complexity I talked about
- Powerful to be able to extract more out of a piece of data the longer
you have it
- Eventual accuracy is a super useful technique
- Schemas can be useful and non-painful
- A Lambda Architecture is fundamentally easy to extend with new
views/queries