3. Computer generated data
Application server logs (web sites, games)
Sensor data (weather, water, smart grids)
Images/videos (traffic, security cameras)
4. Human generated data
Twitter “Firehose” (50 mil tweets/day 1,400% growth
per year)
Blogs/Reviews/Emails/Pictures
Social graphs
Facebook, linked-in, contacts
5. Big Data is full of valuable, unanswered questions!
8. Why is Big Data Hard (and Getting Harder)?
Data Structure
Need to consolidate data from multiple data sources
in multiple formats across multiple businesses
9. Why is Big Data Hard (and Getting Harder)?
Changing Data Requirements
Faster response time of fresher data
Sampling is not good enough and history is important
Increasing complexity of analytics
Users demand inexpensive experimentation
11. Innovation #1:
Apache Hadoop
The MapReduce computational paradigm
Open source, scalable, fault tolerant, distributed system‐
Hadoop lowers the cost of developing a distributed
system for data processing
12. Innovation #2:
Amazon Elastic Compute Cloud (EC2)
“provides resizable compute capacity in the cloud.”
Amazon EC2 lowers the cost of operating a
distributed system for data processing
14. Elastic MapReduce applications
Targeted advertising / Clickstream analysis
Security: anti-virus, fraud detection, image recognition
Pattern matching / Recommendations
Data warehousing / BI
Bio-informatics (Genome analysis)
Financial simulation (Monte Carlo simulation)
File processing (resize jpegs, video encoding)
Web indexing
15. Clickstream Analysis –
Big Box Retailer came to Razorfish
3.5 billion records
71 million unique cookies
1.7 million targeted ads required per day
Problem: Improve Return on Ad Spend (ROAS)
16. Clickstream Analysis –
Targeted Ad
User recently
purchased a sports
movie and is
searching for video
games (1.7 Million per day)
17. Clickstream Analysis –
Lots of experimentation but final design:
100 node on-demand Elastic MapReduce cluster running Hadoop
20. World’s largest handmade marketplace
8.9 million items
1 billion page view per month
$320MM 2010 GMS
21. • Easy to ‘backfill’ and run experiments just boot up a cluster
with 100, 500, or 1000 nodes
Production DB
snapshots
Production DB
snapshots
Web event
logs
Web event
logs ETL – Step
1
ETL – Step
1
ETL – Step
2
ETL – Step
2
JobJob
JobJob
JobJob
25. • Yelp does not have a physical MapReduce cluster
• Running 250 production clusters per week
• All of those run on Elastic MapReduce
MapReduce at Yelp
30. 9/23/2011 Amazon EMR Strata Justin Moore - @injust
How do we use EMR?
• Map-Reduce
– Run algorithms on our entire dataset
– Streaming jobs, complex analyses
• Hive
– Business intelligence
– Exploratory analyses
– Infographics!
31. 9/23/2011 Amazon EMR Strata Justin Moore - @injust
How big is our data?
• Global reach (North Pole, Space)
• Native app for almost every smartphone, SMS,
web, mobile-web
• 10M+ users, 15M+ venues, ~1B check-ins
• Terabytes of log data
37. 9/23/2011 Amazon EMR Strata Justin Moore - @injust
When do people go to a place?
Thursday Friday Saturday Sunday
38. 9/23/2011 Amazon EMR Strata Justin Moore - @injust
Why are people checking in?
• Explore their city, discover new places
• Find friends, meet up
• Save with local deals
• Get insider tips on venues
• Personal analytics, diary
• Follow brands and celebrities
• Earn points, badges, gamification of life
• The list grows…
39. 9/23/2011 Amazon EMR Strata Justin Moore - @injust
How can we leverage these insights?
40. 9/23/2011 Amazon EMR Strata Justin Moore - @injust
Join us!
foursquare is hiring
www.foursquare.com/jobs
Justin Moore
@injust
justin@foursquare.com