2. Outline
● Some comments on what we're trying to
show
○ high level cluster configuration
○ an example application that might use this config
■ based on a Gowalla data set
● Launch cluster nodes on EC2
● Launch/configure Cassandra on cluster
● Demonstrate use of Cassandra
○ cassandra-cli, pycassa scripts to interact with db
● Demonstrate use of Hadoop
● Demonstrate use of Pig on the real data
3. Cluster configuration
● Four EC2 nodes
○ m1.medium instances
■ realistically a bit small for real world
● 3 nodes part of Cassandra
○ data can be input dynamically into db via Thrift API
● All nodes run Hadoop Tasktracker
● MapReduce runs close to (Cassandra) data
● JobTracker on separate node
4. Cluster config
Job Tracker Cassandra
Task Tracker
Cassandra Cassandra
Task Tracker Task Tracker
All nodes m1.small for demo
7. Application data
● Used Gowalla data in this test application
● Gowalla provide anonymized data for
test/research purposes:
○ Graph of UID connections
○ List of checkins - UID, LocID
● Size of data set:
○ 400MB checkins
■ 6.4m checkins
○ ~200k users
● Also generated simpler variant of this data
for demonstration
○ more real user information
○ more real location information
8. Application data - User Graph
Simple graph structure -
unidirectional graph with
UIDs as nodes
10. How this data can be used
● Application interested in:
○ my checkins
○ list my friends
○ checkins at given location
○ my friends checkins
● Analytics:
○ top ten most active users - most checkins
○ aggregate checkins per week
○ aggregate checkins per week per city
11. Cassandra data models
● The following data models were used:
○ User
○ Location
○ Checkin
○ FriendRels
■ graph of friend relationships
○ UserCheckins
■ checkins by user
○ LocationCheckins
■ checkins by location
○ FriendCheckins
■ checkins by friends
12. Cassandra data models
● Use of valueless columns
○ FriendRels, UserCheckins, LocationCheckins,
FriendCheckins are just sets of valueless columns
● FriendRel:
○ row_key: {friendid1: '', friendid2: '', friendid3: '', ...}
■ row_key is a uid
● UserCheckins:
○ row_key: {checkinid1: '', checkinid2: '', ...}
■ row_key is uid
● LocationCheckins use LocID as row key
● FriendCheckins use my UID to get my
friend's checkins