1. By :: Jaideep Katkar
Under the Guidance of :: Dr. Tran Thanh
2. GraphLab Overview
A New Framework For Parallel Machine
Learning
– high-level abstractions for machine
learning problems
– Shared-memory multiprocessor
– Assume no fault tolerance needed
– Concurrent access precessing models with
sequential-consistency guarantees
3. How GraphLab Works?
– Represent the user's data by a directed graph
– Each block of data is represented by a vertex
and a directed edge
– Shared data table
– User functions:
Update: modify the vertex and edges state, read
only to shared table
Fold: sequential aggregation to a key entry in the
shared table, modify vertex data
Merge: Parallelize Fold function
Apply: Finalize the key entry in the shared table
5. GraphLab Toolkit
Topic Modeling contains applications like LDA which can be used to
cluster documents and extract topical representations.
Graph Analytics contains application like pagerank and triangle
counting which can be applied to general graphs to estimate
community structure.
Clustering contains standard data clustering tools such as Kmeans
Collaborative Filtering contains a collection of applications used to
make predictions about users interests and factorize large matrices.
Graphical Models contains tools for making joint predictions about
collections of related random variables.
Computer Vision contains a collection of tools for reasoning about
images.
6. Running GraphLab on EC2 Cluster
Requirements ::
• You should have Amazon EC2 account eligible to run on us-east-1a zone.
• Amazon AWS console your AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY (under your account name on the top right
corner-> security credentials -> access keys)
• You should have a keypair attached to the zone you are running on (in our
example us-east-1a)
• Install boto. This is the AWS Python client. To install, run: ‘sudo pip boto’.
• Download and install Graphlab as mentioned on next slides.
7. Satisfying Dependencies on Ubuntu
All the dependencies can be satisfied from the repository:
Below command will install gcc , jdk need to compile graphlab Programs:
Downloading GraphLab version 2.2
You can download GraphLab directly from our Github Repository.
Github also offers a zip download of the repository if you do not have
git. The git command line for cloning the repository is:
8. Compiling and Running Graphlab
In the graphlabapi directory, will create two sub-directories, release/ and
debug/ . cd into either of these directories and running make will build the
release or the debug versions respectively. Note that this will compile all of
GraphLab, including all toolkits.
9. Running Stochastic gradient descent (SGD) in
Collaborative Filtering toolkit
The collaborative filtering toolkit contains tools for computing a linear model
of the data, and predicting missing values based on this linear model. This is
useful when computing recommendations for users
http://docs.graphlab.org/collaborative_filtering.html
10. Running SGD for Netflix Data to predict
User Rating
Input File (Training) for Netflix Data
[User] [item] [rating]
1000 2 5.0
3 7 12.0
6 2 2.1
Creating Directory to load Netflix data
11. Command Line Arguments to Run SGD
--gamma=XX Gradient descent step size
--lambda=XX Gradient descent regularization
--step_dec=XX Multiplicative step decrease. Should be between 0.1
to 1. Default is 0.9.
--D=X Feature vector width. Common values are 20 - 150.
--max_iter=XX Max number of iterations
--maxval=XX Maximum allowed rating
--minval=XX Min allowed rating
--predictions=XX File name to write prediction to. Note that you will
need a user/item pair input file named something. predict to enable
predictions (see section: ratings).
--tol=XX Stop computation when absolute error of prediction is less
than tolerance. Default is 1e-3.
12.
13. O/P file
SGD is a simple gradient descent algorithm. Prediction in SGD is
done as : r_ui = p_u * q_i Where r_ui is a scalar rating of user u to
item i, and p_u is the user feature vector of size D, q_i is the item
feature vector of size D and the product is a vector product.
14. Creating a GraphLab project
Create a GraphLab project, simply create a sub-
directory in the graphlab/apps/ folder with your
project Name.
For instance,
graphlab/apps/my_first_GraphLabProject.
Create a text file called CMakeLists.txt with the
following contents ::
project(My_GraphLabProject)
add_graphlab_executable(my_first_GraphLabProject <ProgramName>.cpp)
15. Hello World in GraphLab
#include <graphlab.hpp>
using namespace graphlab;
#include <graphlab.hpp>
int main(int argc, char** argv)
{
graphlab::mpi_tools::init(argc, argv);
graphlab::distributed_control dc;
dc.cout() << "Hello World!n";
graphlab::mpi_tools::finalize();
}
• dc is the distributed communication layer which is needed by a number of
the core GraphLab objects, whether you are running distributed or not
• To create the program run the configure script, than run "make" in the
•debug/ release/ build folders. The program when executed, will print "Hello
World!".